How to successfully convert math papers to plain text

-1

Goals:

1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.

Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.

Problems:

All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.

2.PDF is really hard to process.

3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain def macros which can not even be processed by de-macro. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.

Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as mathcal{A} and A being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?

edited Nov 20 '18 at 4:12

asked Nov 20 '18 at 4:04

Ying Zhou

7210

This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.

– Werner
Nov 21 '18 at 1:51

@Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.

– Ying Zhou
Nov 21 '18 at 18:40

add a comment |

-1

Goals:

1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.

Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.

Problems:

All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.

2.PDF is really hard to process.

edited Nov 20 '18 at 4:12

asked Nov 20 '18 at 4:04

Ying Zhou

7210

This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.

– Werner
Nov 21 '18 at 1:51

@Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.

– Ying Zhou
Nov 21 '18 at 18:40

add a comment |

-1

Goals:

1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.

Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.

Problems:

All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.

2.PDF is really hard to process.

edited Nov 20 '18 at 4:12

asked Nov 20 '18 at 4:04

Ying Zhou

7210

Goals:

1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.

Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.

Problems:

All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.

2.PDF is really hard to process.

pdf latex ps mathml

edited Nov 20 '18 at 4:12

asked Nov 20 '18 at 4:04

Ying Zhou

7210

edited Nov 20 '18 at 4:12

asked Nov 20 '18 at 4:04

Ying Zhou

7210

edited Nov 20 '18 at 4:12

asked Nov 20 '18 at 4:04

Ying Zhou

7210

asked Nov 20 '18 at 4:04

Ying Zhou

7210

asked Nov 20 '18 at 4:04

Ying Zhou

7210

This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.

– Werner
Nov 21 '18 at 1:51

@Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.

– Ying Zhou
Nov 21 '18 at 18:40

add a comment |

This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.

– Werner
Nov 21 '18 at 1:51

@Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.

– Ying Zhou
Nov 21 '18 at 18:40

This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.

– Werner
Nov 21 '18 at 1:51

@Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.

– Ying Zhou
Nov 21 '18 at 18:40

add a comment |

1 Answer
1

active

oldest

votes

Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).

You can further refer to the issue of recognition to this paper:

https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -

Recognition of handwritten symbols

Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥

http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.

answered Nov 20 '18 at 10:56

Farid Hasanov

Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.

– Ying Zhou
Nov 20 '18 at 16:30

Oh , it makes the issue even easier!

– Farid Hasanov
Nov 21 '18 at 8:03

With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.

– Farid Hasanov
Nov 21 '18 at 8:04

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53386054%2fhow-to-successfully-convert-math-papers-to-plain-text%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).

You can further refer to the issue of recognition to this paper:

https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -

Recognition of handwritten symbols

Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥

http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.

answered Nov 20 '18 at 10:56

Farid Hasanov

Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.

– Ying Zhou
Nov 20 '18 at 16:30

Oh , it makes the issue even easier!

– Farid Hasanov
Nov 21 '18 at 8:03

With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.

– Farid Hasanov
Nov 21 '18 at 8:04

add a comment |

Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).

You can further refer to the issue of recognition to this paper:

https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -

Recognition of handwritten symbols

Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥

http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.

answered Nov 20 '18 at 10:56

Farid Hasanov

Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.

– Ying Zhou
Nov 20 '18 at 16:30

Oh , it makes the issue even easier!

– Farid Hasanov
Nov 21 '18 at 8:03

With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.

– Farid Hasanov
Nov 21 '18 at 8:04

add a comment |

Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).

You can further refer to the issue of recognition to this paper:

https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -

Recognition of handwritten symbols

Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥

http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.

answered Nov 20 '18 at 10:56

Farid Hasanov

Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).

You can further refer to the issue of recognition to this paper:

https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -

Recognition of handwritten symbols

Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥

http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.

answered Nov 20 '18 at 10:56

Farid Hasanov

answered Nov 20 '18 at 10:56

Farid Hasanov

answered Nov 20 '18 at 10:56

Farid Hasanov

answered Nov 20 '18 at 10:56

Farid Hasanov

Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.

– Ying Zhou
Nov 20 '18 at 16:30

Oh , it makes the issue even easier!

– Farid Hasanov
Nov 21 '18 at 8:03

With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.

– Farid Hasanov
Nov 21 '18 at 8:04

add a comment |

Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.

– Ying Zhou
Nov 20 '18 at 16:30

Oh , it makes the issue even easier!

– Farid Hasanov
Nov 21 '18 at 8:03

With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.

– Farid Hasanov
Nov 21 '18 at 8:04

Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.

– Ying Zhou
Nov 20 '18 at 16:30

Oh , it makes the issue even easier!

– Farid Hasanov
Nov 21 '18 at 8:03

With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.

– Farid Hasanov
Nov 21 '18 at 8:04

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk