I can't search through pdf file












0















I have a problem with searching in a pdf file.



I found out the problem is caused by wrong encode type. I have read on a forum there is nothing i can do about it only if I contact the pdf file's creator to use correct encode type,but for me it is not possible.





You can see on this image i am looking for a certain word that I know it is contained in the document for sure, but it can't find it.





The rectangular characters are hungarian accent letters in the pdf document.



What I did is export the whole pdf to image files jpeg2000 or jpg or tif,recombine all the pages to a single pdf file and run OCR but with this approach the file became too large, with lower resolution it lost to much detail so it was not usable.(but it became searchable)



Does anyone know a way to produce a good quality searchable pdf file?










share|improve this question

























  • But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.

    – usr2564301
    Nov 20 '18 at 18:35











  • That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.

    – Levente Bartos
    Nov 20 '18 at 19:24











  • There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.

    – usr2564301
    Nov 20 '18 at 20:16






  • 1





    Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf

    – Levente Bartos
    Nov 21 '18 at 11:17











  • The encoding is actually correct, but the accent U+0301 is in the wrong position! Gr{U+0301}afok – it should be behind the a. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.

    – usr2564301
    Nov 21 '18 at 11:25
















0















I have a problem with searching in a pdf file.



I found out the problem is caused by wrong encode type. I have read on a forum there is nothing i can do about it only if I contact the pdf file's creator to use correct encode type,but for me it is not possible.





You can see on this image i am looking for a certain word that I know it is contained in the document for sure, but it can't find it.





The rectangular characters are hungarian accent letters in the pdf document.



What I did is export the whole pdf to image files jpeg2000 or jpg or tif,recombine all the pages to a single pdf file and run OCR but with this approach the file became too large, with lower resolution it lost to much detail so it was not usable.(but it became searchable)



Does anyone know a way to produce a good quality searchable pdf file?










share|improve this question

























  • But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.

    – usr2564301
    Nov 20 '18 at 18:35











  • That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.

    – Levente Bartos
    Nov 20 '18 at 19:24











  • There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.

    – usr2564301
    Nov 20 '18 at 20:16






  • 1





    Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf

    – Levente Bartos
    Nov 21 '18 at 11:17











  • The encoding is actually correct, but the accent U+0301 is in the wrong position! Gr{U+0301}afok – it should be behind the a. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.

    – usr2564301
    Nov 21 '18 at 11:25














0












0








0








I have a problem with searching in a pdf file.



I found out the problem is caused by wrong encode type. I have read on a forum there is nothing i can do about it only if I contact the pdf file's creator to use correct encode type,but for me it is not possible.





You can see on this image i am looking for a certain word that I know it is contained in the document for sure, but it can't find it.





The rectangular characters are hungarian accent letters in the pdf document.



What I did is export the whole pdf to image files jpeg2000 or jpg or tif,recombine all the pages to a single pdf file and run OCR but with this approach the file became too large, with lower resolution it lost to much detail so it was not usable.(but it became searchable)



Does anyone know a way to produce a good quality searchable pdf file?










share|improve this question
















I have a problem with searching in a pdf file.



I found out the problem is caused by wrong encode type. I have read on a forum there is nothing i can do about it only if I contact the pdf file's creator to use correct encode type,but for me it is not possible.





You can see on this image i am looking for a certain word that I know it is contained in the document for sure, but it can't find it.





The rectangular characters are hungarian accent letters in the pdf document.



What I did is export the whole pdf to image files jpeg2000 or jpg or tif,recombine all the pages to a single pdf file and run OCR but with this approach the file became too large, with lower resolution it lost to much detail so it was not usable.(but it became searchable)



Does anyone know a way to produce a good quality searchable pdf file?







pdf full-text-search ocr encode






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 '18 at 14:54









mkl

54.6k1168147




54.6k1168147










asked Nov 20 '18 at 13:18









Levente BartosLevente Bartos

215




215













  • But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.

    – usr2564301
    Nov 20 '18 at 18:35











  • That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.

    – Levente Bartos
    Nov 20 '18 at 19:24











  • There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.

    – usr2564301
    Nov 20 '18 at 20:16






  • 1





    Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf

    – Levente Bartos
    Nov 21 '18 at 11:17











  • The encoding is actually correct, but the accent U+0301 is in the wrong position! Gr{U+0301}afok – it should be behind the a. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.

    – usr2564301
    Nov 21 '18 at 11:25



















  • But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.

    – usr2564301
    Nov 20 '18 at 18:35











  • That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.

    – Levente Bartos
    Nov 20 '18 at 19:24











  • There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.

    – usr2564301
    Nov 20 '18 at 20:16






  • 1





    Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf

    – Levente Bartos
    Nov 21 '18 at 11:17











  • The encoding is actually correct, but the accent U+0301 is in the wrong position! Gr{U+0301}afok – it should be behind the a. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.

    – usr2564301
    Nov 21 '18 at 11:25

















But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.

– usr2564301
Nov 20 '18 at 18:35





But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.

– usr2564301
Nov 20 '18 at 18:35













That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.

– Levente Bartos
Nov 20 '18 at 19:24





That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.

– Levente Bartos
Nov 20 '18 at 19:24













There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.

– usr2564301
Nov 20 '18 at 20:16





There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.

– usr2564301
Nov 20 '18 at 20:16




1




1





Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf

– Levente Bartos
Nov 21 '18 at 11:17





Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf

– Levente Bartos
Nov 21 '18 at 11:17













The encoding is actually correct, but the accent U+0301 is in the wrong position! Gr{U+0301}afok – it should be behind the a. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.

– usr2564301
Nov 21 '18 at 11:25





The encoding is actually correct, but the accent U+0301 is in the wrong position! Gr{U+0301}afok – it should be behind the a. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.

– usr2564301
Nov 21 '18 at 11:25












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53393898%2fi-cant-search-through-pdf-file%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53393898%2fi-cant-search-through-pdf-file%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Xamarin.form Move up view when keyboard appear

Post-Redirect-Get with Spring WebFlux and Thymeleaf

Anylogic : not able to use stopDelay()