I can't search through pdf file
I have a problem with searching in a pdf file.
I found out the problem is caused by wrong encode type. I have read on a forum there is nothing i can do about it only if I contact the pdf file's creator to use correct encode type,but for me it is not possible.
You can see on this image i am looking for a certain word that I know it is contained in the document for sure, but it can't find it.
The rectangular characters are hungarian accent letters in the pdf document.
What I did is export the whole pdf to image files jpeg2000 or jpg or tif,recombine all the pages to a single pdf file and run OCR but with this approach the file became too large, with lower resolution it lost to much detail so it was not usable.(but it became searchable)
Does anyone know a way to produce a good quality searchable pdf file?
pdf full-text-search ocr encode
add a comment |
I have a problem with searching in a pdf file.
I found out the problem is caused by wrong encode type. I have read on a forum there is nothing i can do about it only if I contact the pdf file's creator to use correct encode type,but for me it is not possible.
You can see on this image i am looking for a certain word that I know it is contained in the document for sure, but it can't find it.
The rectangular characters are hungarian accent letters in the pdf document.
What I did is export the whole pdf to image files jpeg2000 or jpg or tif,recombine all the pages to a single pdf file and run OCR but with this approach the file became too large, with lower resolution it lost to much detail so it was not usable.(but it became searchable)
Does anyone know a way to produce a good quality searchable pdf file?
pdf full-text-search ocr encode
But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.
– usr2564301
Nov 20 '18 at 18:35
That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.
– Levente Bartos
Nov 20 '18 at 19:24
There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.
– usr2564301
Nov 20 '18 at 20:16
1
Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf
– Levente Bartos
Nov 21 '18 at 11:17
The encoding is actually correct, but the accent U+0301 is in the wrong position!Gr{U+0301}afok
– it should be behind thea
. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.
– usr2564301
Nov 21 '18 at 11:25
add a comment |
I have a problem with searching in a pdf file.
I found out the problem is caused by wrong encode type. I have read on a forum there is nothing i can do about it only if I contact the pdf file's creator to use correct encode type,but for me it is not possible.
You can see on this image i am looking for a certain word that I know it is contained in the document for sure, but it can't find it.
The rectangular characters are hungarian accent letters in the pdf document.
What I did is export the whole pdf to image files jpeg2000 or jpg or tif,recombine all the pages to a single pdf file and run OCR but with this approach the file became too large, with lower resolution it lost to much detail so it was not usable.(but it became searchable)
Does anyone know a way to produce a good quality searchable pdf file?
pdf full-text-search ocr encode
I have a problem with searching in a pdf file.
I found out the problem is caused by wrong encode type. I have read on a forum there is nothing i can do about it only if I contact the pdf file's creator to use correct encode type,but for me it is not possible.
You can see on this image i am looking for a certain word that I know it is contained in the document for sure, but it can't find it.
The rectangular characters are hungarian accent letters in the pdf document.
What I did is export the whole pdf to image files jpeg2000 or jpg or tif,recombine all the pages to a single pdf file and run OCR but with this approach the file became too large, with lower resolution it lost to much detail so it was not usable.(but it became searchable)
Does anyone know a way to produce a good quality searchable pdf file?
pdf full-text-search ocr encode
pdf full-text-search ocr encode
edited Nov 20 '18 at 14:54
mkl
54.6k1168147
54.6k1168147
asked Nov 20 '18 at 13:18
Levente BartosLevente Bartos
215
215
But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.
– usr2564301
Nov 20 '18 at 18:35
That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.
– Levente Bartos
Nov 20 '18 at 19:24
There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.
– usr2564301
Nov 20 '18 at 20:16
1
Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf
– Levente Bartos
Nov 21 '18 at 11:17
The encoding is actually correct, but the accent U+0301 is in the wrong position!Gr{U+0301}afok
– it should be behind thea
. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.
– usr2564301
Nov 21 '18 at 11:25
add a comment |
But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.
– usr2564301
Nov 20 '18 at 18:35
That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.
– Levente Bartos
Nov 20 '18 at 19:24
There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.
– usr2564301
Nov 20 '18 at 20:16
1
Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf
– Levente Bartos
Nov 21 '18 at 11:17
The encoding is actually correct, but the accent U+0301 is in the wrong position!Gr{U+0301}afok
– it should be behind thea
. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.
– usr2564301
Nov 21 '18 at 11:25
But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.
– usr2564301
Nov 20 '18 at 18:35
But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.
– usr2564301
Nov 20 '18 at 18:35
That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.
– Levente Bartos
Nov 20 '18 at 19:24
That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.
– Levente Bartos
Nov 20 '18 at 19:24
There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.
– usr2564301
Nov 20 '18 at 20:16
There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.
– usr2564301
Nov 20 '18 at 20:16
1
1
Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf
– Levente Bartos
Nov 21 '18 at 11:17
Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf
– Levente Bartos
Nov 21 '18 at 11:17
The encoding is actually correct, but the accent U+0301 is in the wrong position!
Gr{U+0301}afok
– it should be behind the a
. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.– usr2564301
Nov 21 '18 at 11:25
The encoding is actually correct, but the accent U+0301 is in the wrong position!
Gr{U+0301}afok
– it should be behind the a
. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.– usr2564301
Nov 21 '18 at 11:25
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53393898%2fi-cant-search-through-pdf-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53393898%2fi-cant-search-through-pdf-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
But you say you cannot ask its author for a better PDF, so why your last question? If you post a link to that file we could at least verify that your forum advisors are correct.
– usr2564301
Nov 20 '18 at 18:35
That forum topic was not related with my problem. I could ask maybe, but i would like to find out if is there any general way for solving this kind of problem. I fount pdf files in the past that were corrupt and had the same problem. As i mentioned exporting the pdf and re combine the exported files then run an OCR is a solution but it comes with bigger file size and actually lowered resolution. I am wondering maybe it it possible to get data straight from the pdf file type,or maybe there is a simplier solution.
– Levente Bartos
Nov 20 '18 at 19:24
There are many, many different ways to create a perfectly correct and viewable PDF. Its text may or may not be extractable. That is why I'd like to see the file. If you cannot, for example, simply select the text, then it may not be "text". If you can, and you can copy it with a good PDF viewer, then it may be possible to mechanically extract. But we don't know what will work without seeing the file.
– usr2564301
Nov 20 '18 at 20:16
1
Here is the not well behaving pdf file, it is computing science in hungarian. cs.bme.hu/~fleiner/jegyzet/NESZ.pdf
– Levente Bartos
Nov 21 '18 at 11:17
The encoding is actually correct, but the accent U+0301 is in the wrong position!
Gr{U+0301}afok
– it should be behind thea
. This is most likely Acrobat's problem. No idea how to fix. It may be off-topic for Stack Overflow after all.– usr2564301
Nov 21 '18 at 11:25