How to successfully convert math papers to plain text
Goals:
1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.
- Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.
Problems:
- All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.
2.PDF is really hard to process.
3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML
and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain def
macros which can not even be processed by de-macro
. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.
Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as mathcal{A}
and A
being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?
pdf latex ps mathml
add a comment |
Goals:
1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.
- Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.
Problems:
- All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.
2.PDF is really hard to process.
3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML
and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain def
macros which can not even be processed by de-macro
. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.
Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as mathcal{A}
and A
being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?
pdf latex ps mathml
This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.
– Werner
Nov 21 '18 at 1:51
@Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.
– Ying Zhou
Nov 21 '18 at 18:40
add a comment |
Goals:
1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.
- Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.
Problems:
- All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.
2.PDF is really hard to process.
3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML
and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain def
macros which can not even be processed by de-macro
. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.
Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as mathcal{A}
and A
being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?
pdf latex ps mathml
Goals:
1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.
- Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.
Problems:
- All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.
2.PDF is really hard to process.
3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML
and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain def
macros which can not even be processed by de-macro
. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.
Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as mathcal{A}
and A
being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?
pdf latex ps mathml
pdf latex ps mathml
edited Nov 20 '18 at 4:12
Ying Zhou
asked Nov 20 '18 at 4:04
Ying ZhouYing Zhou
7210
7210
This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.
– Werner
Nov 21 '18 at 1:51
@Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.
– Ying Zhou
Nov 21 '18 at 18:40
add a comment |
This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.
– Werner
Nov 21 '18 at 1:51
@Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.
– Ying Zhou
Nov 21 '18 at 18:40
This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.
– Werner
Nov 21 '18 at 1:51
This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.
– Werner
Nov 21 '18 at 1:51
@Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.
– Ying Zhou
Nov 21 '18 at 18:40
@Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.
– Ying Zhou
Nov 21 '18 at 18:40
add a comment |
1 Answer
1
active
oldest
votes
Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).
You can further refer to the issue of recognition to this paper:
https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -
Recognition of handwritten symbols
Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥
http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.
Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.
– Ying Zhou
Nov 20 '18 at 16:30
Oh , it makes the issue even easier!
– Farid Hasanov
Nov 21 '18 at 8:03
With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.
– Farid Hasanov
Nov 21 '18 at 8:04
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53386054%2fhow-to-successfully-convert-math-papers-to-plain-text%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).
You can further refer to the issue of recognition to this paper:
https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -
Recognition of handwritten symbols
Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥
http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.
Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.
– Ying Zhou
Nov 20 '18 at 16:30
Oh , it makes the issue even easier!
– Farid Hasanov
Nov 21 '18 at 8:03
With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.
– Farid Hasanov
Nov 21 '18 at 8:04
add a comment |
Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).
You can further refer to the issue of recognition to this paper:
https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -
Recognition of handwritten symbols
Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥
http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.
Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.
– Ying Zhou
Nov 20 '18 at 16:30
Oh , it makes the issue even easier!
– Farid Hasanov
Nov 21 '18 at 8:03
With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.
– Farid Hasanov
Nov 21 '18 at 8:04
add a comment |
Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).
You can further refer to the issue of recognition to this paper:
https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -
Recognition of handwritten symbols
Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥
http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.
Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).
You can further refer to the issue of recognition to this paper:
https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -
Recognition of handwritten symbols
Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥
http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.
answered Nov 20 '18 at 10:56
Farid HasanovFarid Hasanov
12
12
Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.
– Ying Zhou
Nov 20 '18 at 16:30
Oh , it makes the issue even easier!
– Farid Hasanov
Nov 21 '18 at 8:03
With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.
– Farid Hasanov
Nov 21 '18 at 8:04
add a comment |
Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.
– Ying Zhou
Nov 20 '18 at 16:30
Oh , it makes the issue even easier!
– Farid Hasanov
Nov 21 '18 at 8:03
With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.
– Farid Hasanov
Nov 21 '18 at 8:04
Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.
– Ying Zhou
Nov 20 '18 at 16:30
Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.
– Ying Zhou
Nov 20 '18 at 16:30
Oh , it makes the issue even easier!
– Farid Hasanov
Nov 21 '18 at 8:03
Oh , it makes the issue even easier!
– Farid Hasanov
Nov 21 '18 at 8:03
With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.
– Farid Hasanov
Nov 21 '18 at 8:04
With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.
– Farid Hasanov
Nov 21 '18 at 8:04
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53386054%2fhow-to-successfully-convert-math-papers-to-plain-text%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.
– Werner
Nov 21 '18 at 1:51
@Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.
– Ying Zhou
Nov 21 '18 at 18:40