I can't search through pdf file

I have a problem with searching in a pdf file.

I found out the problem is caused by wrong encode type. I have read on a forum there is nothing i can do about it only if I contact the pdf file's creator to use correct encode type,but for me it is not possible.

You can see on this image i am looking for a certain word that I know it is contained in the document for sure, but it can't find it.

The rectangular characters are hungarian accent letters in the pdf document.

What I did is export the whole pdf to image files jpeg2000 or jpg or tif,recombine all the pages to a single pdf file and run OCR but with this approach the file became too large, with lower resolution it lost to much detail so it was not usable.(but it became searchable)

Does anyone know a way to produce a good quality searchable pdf file?

edited Nov 20 '18 at 14:54

mkl

54.6k1168147

asked Nov 20 '18 at 13:18

Levente Bartos

215

But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.

– usr2564301
Nov 20 '18 at 18:35

That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.

– Levente Bartos
Nov 20 '18 at 19:24

There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.

– usr2564301
Nov 20 '18 at 20:16

1

Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf

– Levente Bartos
Nov 21 '18 at 11:17

The encoding is actually correct, but the accent U+0301 is in the wrong position! Gr{U+0301}afok – it should be behind the a. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.

– usr2564301
Nov 21 '18 at 11:25

add a comment |

I have a problem with searching in a pdf file.

You can see on this image i am looking for a certain word that I know it is contained in the document for sure, but it can't find it.

The rectangular characters are hungarian accent letters in the pdf document.

Does anyone know a way to produce a good quality searchable pdf file?

edited Nov 20 '18 at 14:54

mkl

54.6k1168147

asked Nov 20 '18 at 13:18

Levente Bartos

215

But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.

– usr2564301
Nov 20 '18 at 18:35

That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.

– Levente Bartos
Nov 20 '18 at 19:24

There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.

– usr2564301
Nov 20 '18 at 20:16

1

Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf

– Levente Bartos
Nov 21 '18 at 11:17

The encoding is actually correct, but the accent U+0301 is in the wrong position! Gr{U+0301}afok – it should be behind the a. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.

– usr2564301
Nov 21 '18 at 11:25

add a comment |

I have a problem with searching in a pdf file.

You can see on this image i am looking for a certain word that I know it is contained in the document for sure, but it can't find it.

The rectangular characters are hungarian accent letters in the pdf document.

Does anyone know a way to produce a good quality searchable pdf file?

edited Nov 20 '18 at 14:54

mkl

54.6k1168147

asked Nov 20 '18 at 13:18

Levente Bartos

215

I have a problem with searching in a pdf file.

You can see on this image i am looking for a certain word that I know it is contained in the document for sure, but it can't find it.

The rectangular characters are hungarian accent letters in the pdf document.

Does anyone know a way to produce a good quality searchable pdf file?

pdf full-text-search ocr encode

edited Nov 20 '18 at 14:54

mkl

54.6k1168147

asked Nov 20 '18 at 13:18

Levente Bartos

215

edited Nov 20 '18 at 14:54

mkl

54.6k1168147

asked Nov 20 '18 at 13:18

Levente Bartos

215

edited Nov 20 '18 at 14:54

mkl

54.6k1168147

edited Nov 20 '18 at 14:54

mkl

54.6k1168147

edited Nov 20 '18 at 14:54

mkl

54.6k1168147

asked Nov 20 '18 at 13:18

Levente Bartos

215

asked Nov 20 '18 at 13:18

Levente Bartos

215

asked Nov 20 '18 at 13:18

Levente Bartos

215

But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.

– usr2564301
Nov 20 '18 at 18:35

That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.

– Levente Bartos
Nov 20 '18 at 19:24

There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.

– usr2564301
Nov 20 '18 at 20:16

1

Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf

– Levente Bartos
Nov 21 '18 at 11:17

The encoding is actually correct, but the accent U+0301 is in the wrong position! Gr{U+0301}afok – it should be behind the a. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.

– usr2564301
Nov 21 '18 at 11:25

add a comment |

But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.

– usr2564301
Nov 20 '18 at 18:35

That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.

– Levente Bartos
Nov 20 '18 at 19:24

There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.

– usr2564301
Nov 20 '18 at 20:16

1

Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf

– Levente Bartos
Nov 21 '18 at 11:17

The encoding is actually correct, but the accent U+0301 is in the wrong position! Gr{U+0301}afok – it should be behind the a. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.

– usr2564301
Nov 21 '18 at 11:25

But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.

– usr2564301
Nov 20 '18 at 18:35

That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.

– Levente Bartos
Nov 20 '18 at 19:24

There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.

– usr2564301
Nov 20 '18 at 20:16

Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf

– Levente Bartos
Nov 21 '18 at 11:17

The encoding is actually correct, but the accent U+0301 is in the wrong position! Gr{U+0301}afok – it should be behind the a. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.

– usr2564301
Nov 21 '18 at 11:25

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53393898%2fi-cant-search-through-pdf-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk