How to successfully convert math papers to plain text












-1















Goals:



1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.




  1. Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.


Problems:




  1. All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.


2.PDF is really hard to process.



3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain def macros which can not even be processed by de-macro. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.



Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as mathcal{A} and A being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?










share|improve this question

























  • This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.

    – Werner
    Nov 21 '18 at 1:51











  • @Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.

    – Ying Zhou
    Nov 21 '18 at 18:40
















-1















Goals:



1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.




  1. Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.


Problems:




  1. All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.


2.PDF is really hard to process.



3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain def macros which can not even be processed by de-macro. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.



Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as mathcal{A} and A being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?










share|improve this question

























  • This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.

    – Werner
    Nov 21 '18 at 1:51











  • @Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.

    – Ying Zhou
    Nov 21 '18 at 18:40














-1












-1








-1








Goals:



1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.




  1. Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.


Problems:




  1. All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.


2.PDF is really hard to process.



3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain def macros which can not even be processed by de-macro. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.



Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as mathcal{A} and A being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?










share|improve this question
















Goals:



1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.




  1. Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.


Problems:




  1. All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.


2.PDF is really hard to process.



3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain def macros which can not even be processed by de-macro. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.



Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as mathcal{A} and A being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?







pdf latex ps mathml






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 '18 at 4:12







Ying Zhou

















asked Nov 20 '18 at 4:04









Ying ZhouYing Zhou

7210




7210













  • This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.

    – Werner
    Nov 21 '18 at 1:51











  • @Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.

    – Ying Zhou
    Nov 21 '18 at 18:40



















  • This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.

    – Werner
    Nov 21 '18 at 1:51











  • @Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.

    – Ying Zhou
    Nov 21 '18 at 18:40

















This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.

– Werner
Nov 21 '18 at 1:51





This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.

– Werner
Nov 21 '18 at 1:51













@Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.

– Ying Zhou
Nov 21 '18 at 18:40





@Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.

– Ying Zhou
Nov 21 '18 at 18:40












1 Answer
1






active

oldest

votes


















0














Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).



You can further refer to the issue of recognition to this paper:



https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -



Recognition of handwritten symbols



Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥



http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.






share|improve this answer
























  • Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.

    – Ying Zhou
    Nov 20 '18 at 16:30













  • Oh , it makes the issue even easier!

    – Farid Hasanov
    Nov 21 '18 at 8:03











  • With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.

    – Farid Hasanov
    Nov 21 '18 at 8:04











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53386054%2fhow-to-successfully-convert-math-papers-to-plain-text%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).



You can further refer to the issue of recognition to this paper:



https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -



Recognition of handwritten symbols



Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥



http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.






share|improve this answer
























  • Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.

    – Ying Zhou
    Nov 20 '18 at 16:30













  • Oh , it makes the issue even easier!

    – Farid Hasanov
    Nov 21 '18 at 8:03











  • With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.

    – Farid Hasanov
    Nov 21 '18 at 8:04
















0














Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).



You can further refer to the issue of recognition to this paper:



https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -



Recognition of handwritten symbols



Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥



http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.






share|improve this answer
























  • Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.

    – Ying Zhou
    Nov 20 '18 at 16:30













  • Oh , it makes the issue even easier!

    – Farid Hasanov
    Nov 21 '18 at 8:03











  • With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.

    – Farid Hasanov
    Nov 21 '18 at 8:04














0












0








0







Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).



You can further refer to the issue of recognition to this paper:



https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -



Recognition of handwritten symbols



Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥



http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.






share|improve this answer













Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).



You can further refer to the issue of recognition to this paper:



https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -



Recognition of handwritten symbols



Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥



http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 20 '18 at 10:56









Farid HasanovFarid Hasanov

12




12













  • Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.

    – Ying Zhou
    Nov 20 '18 at 16:30













  • Oh , it makes the issue even easier!

    – Farid Hasanov
    Nov 21 '18 at 8:03











  • With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.

    – Farid Hasanov
    Nov 21 '18 at 8:04



















  • Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.

    – Ying Zhou
    Nov 20 '18 at 16:30













  • Oh , it makes the issue even easier!

    – Farid Hasanov
    Nov 21 '18 at 8:03











  • With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.

    – Farid Hasanov
    Nov 21 '18 at 8:04

















Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.

– Ying Zhou
Nov 20 '18 at 16:30







Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.

– Ying Zhou
Nov 20 '18 at 16:30















Oh , it makes the issue even easier!

– Farid Hasanov
Nov 21 '18 at 8:03





Oh , it makes the issue even easier!

– Farid Hasanov
Nov 21 '18 at 8:03













With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.

– Farid Hasanov
Nov 21 '18 at 8:04





With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.

– Farid Hasanov
Nov 21 '18 at 8:04




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53386054%2fhow-to-successfully-convert-math-papers-to-plain-text%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud

Zucchini