Read highlighted text from a picture using tesseract











up vote
0
down vote

favorite
1












I am trying to read text from an image using tesseract, the picture is high qualit, so tesseract is able to read the text with 95% accuracy, which is OK for me at this point, however tesseract in unable to read the text which is highlighted (selected), Please refer the PIC



enter image description here



How can i read the text from selected area using tesseract, Is there any way to identify which word is highlighted in the image ?



Code used to read and convert the text from image as below.



tesseract::TessBaseAPI *myOCR = new tesseract::TessBaseAPI();

//cout << "hello" << endl;

if (myOCR->Init("C:\QTSoftware\IODriver\", "eng")) {

SaveLineLog(brdInd, "Unable to initialize tesseract engine", __LINE__);

return RC_TESSERACT_ENG_FAILURE;

//fprintf(stderr, "Could not initialize tesseract.n");

//return RC_THREAD_FAILURE;
//exit(1);
}

SaveLineLog(brdInd, "tesseract engine is UP :)", __LINE__);

//strcpy_s(fileName, "C:\TEMP\T481Logs\FrameOCR23_0.jpg");

FILE *pFile;

fopen_s(&pFile,fileName, "r"); // Open picture

PIX* pix; // Image format from `leptonica`

pix = pixReadStreamBmp(pFile);

fclose(pFile);

//Pix *pix = pixRead(fileName);

//Pix *pix = pixReadStreamBmp(fileName);

//cout << "Tesseract - Pix : " << pix << endl;

if (pix == NULL)
{
SaveLineLog(brdInd, "Pix failure", __LINE__);

return RC_TESSERACT_PIX_FAILURE;
}

myOCR->SetImage(pix);

char* outText = myOCR->GetUTF8Text();









share|improve this question
























  • Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
    – ZdaR
    Jun 7 '17 at 3:01










  • i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found that tesseract is ignoring the highlighted text so i may try to pre-process.
    – Pankaj Mishra
    Jun 8 '17 at 18:44

















up vote
0
down vote

favorite
1












I am trying to read text from an image using tesseract, the picture is high qualit, so tesseract is able to read the text with 95% accuracy, which is OK for me at this point, however tesseract in unable to read the text which is highlighted (selected), Please refer the PIC



enter image description here



How can i read the text from selected area using tesseract, Is there any way to identify which word is highlighted in the image ?



Code used to read and convert the text from image as below.



tesseract::TessBaseAPI *myOCR = new tesseract::TessBaseAPI();

//cout << "hello" << endl;

if (myOCR->Init("C:\QTSoftware\IODriver\", "eng")) {

SaveLineLog(brdInd, "Unable to initialize tesseract engine", __LINE__);

return RC_TESSERACT_ENG_FAILURE;

//fprintf(stderr, "Could not initialize tesseract.n");

//return RC_THREAD_FAILURE;
//exit(1);
}

SaveLineLog(brdInd, "tesseract engine is UP :)", __LINE__);

//strcpy_s(fileName, "C:\TEMP\T481Logs\FrameOCR23_0.jpg");

FILE *pFile;

fopen_s(&pFile,fileName, "r"); // Open picture

PIX* pix; // Image format from `leptonica`

pix = pixReadStreamBmp(pFile);

fclose(pFile);

//Pix *pix = pixRead(fileName);

//Pix *pix = pixReadStreamBmp(fileName);

//cout << "Tesseract - Pix : " << pix << endl;

if (pix == NULL)
{
SaveLineLog(brdInd, "Pix failure", __LINE__);

return RC_TESSERACT_PIX_FAILURE;
}

myOCR->SetImage(pix);

char* outText = myOCR->GetUTF8Text();









share|improve this question
























  • Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
    – ZdaR
    Jun 7 '17 at 3:01










  • i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found that tesseract is ignoring the highlighted text so i may try to pre-process.
    – Pankaj Mishra
    Jun 8 '17 at 18:44















up vote
0
down vote

favorite
1









up vote
0
down vote

favorite
1






1





I am trying to read text from an image using tesseract, the picture is high qualit, so tesseract is able to read the text with 95% accuracy, which is OK for me at this point, however tesseract in unable to read the text which is highlighted (selected), Please refer the PIC



enter image description here



How can i read the text from selected area using tesseract, Is there any way to identify which word is highlighted in the image ?



Code used to read and convert the text from image as below.



tesseract::TessBaseAPI *myOCR = new tesseract::TessBaseAPI();

//cout << "hello" << endl;

if (myOCR->Init("C:\QTSoftware\IODriver\", "eng")) {

SaveLineLog(brdInd, "Unable to initialize tesseract engine", __LINE__);

return RC_TESSERACT_ENG_FAILURE;

//fprintf(stderr, "Could not initialize tesseract.n");

//return RC_THREAD_FAILURE;
//exit(1);
}

SaveLineLog(brdInd, "tesseract engine is UP :)", __LINE__);

//strcpy_s(fileName, "C:\TEMP\T481Logs\FrameOCR23_0.jpg");

FILE *pFile;

fopen_s(&pFile,fileName, "r"); // Open picture

PIX* pix; // Image format from `leptonica`

pix = pixReadStreamBmp(pFile);

fclose(pFile);

//Pix *pix = pixRead(fileName);

//Pix *pix = pixReadStreamBmp(fileName);

//cout << "Tesseract - Pix : " << pix << endl;

if (pix == NULL)
{
SaveLineLog(brdInd, "Pix failure", __LINE__);

return RC_TESSERACT_PIX_FAILURE;
}

myOCR->SetImage(pix);

char* outText = myOCR->GetUTF8Text();









share|improve this question















I am trying to read text from an image using tesseract, the picture is high qualit, so tesseract is able to read the text with 95% accuracy, which is OK for me at this point, however tesseract in unable to read the text which is highlighted (selected), Please refer the PIC



enter image description here



How can i read the text from selected area using tesseract, Is there any way to identify which word is highlighted in the image ?



Code used to read and convert the text from image as below.



tesseract::TessBaseAPI *myOCR = new tesseract::TessBaseAPI();

//cout << "hello" << endl;

if (myOCR->Init("C:\QTSoftware\IODriver\", "eng")) {

SaveLineLog(brdInd, "Unable to initialize tesseract engine", __LINE__);

return RC_TESSERACT_ENG_FAILURE;

//fprintf(stderr, "Could not initialize tesseract.n");

//return RC_THREAD_FAILURE;
//exit(1);
}

SaveLineLog(brdInd, "tesseract engine is UP :)", __LINE__);

//strcpy_s(fileName, "C:\TEMP\T481Logs\FrameOCR23_0.jpg");

FILE *pFile;

fopen_s(&pFile,fileName, "r"); // Open picture

PIX* pix; // Image format from `leptonica`

pix = pixReadStreamBmp(pFile);

fclose(pFile);

//Pix *pix = pixRead(fileName);

//Pix *pix = pixReadStreamBmp(fileName);

//cout << "Tesseract - Pix : " << pix << endl;

if (pix == NULL)
{
SaveLineLog(brdInd, "Pix failure", __LINE__);

return RC_TESSERACT_PIX_FAILURE;
}

myOCR->SetImage(pix);

char* outText = myOCR->GetUTF8Text();






opencv image-processing ocr tesseract






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 7 at 11:14









Amnon

1,3221824




1,3221824










asked Jun 6 '17 at 23:13









Pankaj Mishra

13




13












  • Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
    – ZdaR
    Jun 7 '17 at 3:01










  • i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found that tesseract is ignoring the highlighted text so i may try to pre-process.
    – Pankaj Mishra
    Jun 8 '17 at 18:44




















  • Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
    – ZdaR
    Jun 7 '17 at 3:01










  • i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found that tesseract is ignoring the highlighted text so i may try to pre-process.
    – Pankaj Mishra
    Jun 8 '17 at 18:44


















Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
– ZdaR
Jun 7 '17 at 3:01




Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
– ZdaR
Jun 7 '17 at 3:01












i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found that tesseract is ignoring the highlighted text so i may try to pre-process.
– Pankaj Mishra
Jun 8 '17 at 18:44






i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found that tesseract is ignoring the highlighted text so i may try to pre-process.
– Pankaj Mishra
Jun 8 '17 at 18:44














1 Answer
1






active

oldest

votes

















up vote
0
down vote













If all images have this format, you can first invert colors, then use Threshold binarization to get rid of this shadow.
It's better always to pre-process and provide clean image to the OCR engine, it will probably enhance the detection percent too.






share|improve this answer

















  • 1




    i understand that , i may need to pre-process before passing the image to tesseract, how do i identify the highlighted text in the image ? - thanks
    – Pankaj Mishra
    Jun 8 '17 at 18:38










  • @PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
    – MoustafaS
    Jun 8 '17 at 23:15











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f44401113%2fread-highlighted-text-from-a-picture-using-tesseract%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote













If all images have this format, you can first invert colors, then use Threshold binarization to get rid of this shadow.
It's better always to pre-process and provide clean image to the OCR engine, it will probably enhance the detection percent too.






share|improve this answer

















  • 1




    i understand that , i may need to pre-process before passing the image to tesseract, how do i identify the highlighted text in the image ? - thanks
    – Pankaj Mishra
    Jun 8 '17 at 18:38










  • @PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
    – MoustafaS
    Jun 8 '17 at 23:15















up vote
0
down vote













If all images have this format, you can first invert colors, then use Threshold binarization to get rid of this shadow.
It's better always to pre-process and provide clean image to the OCR engine, it will probably enhance the detection percent too.






share|improve this answer

















  • 1




    i understand that , i may need to pre-process before passing the image to tesseract, how do i identify the highlighted text in the image ? - thanks
    – Pankaj Mishra
    Jun 8 '17 at 18:38










  • @PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
    – MoustafaS
    Jun 8 '17 at 23:15













up vote
0
down vote










up vote
0
down vote









If all images have this format, you can first invert colors, then use Threshold binarization to get rid of this shadow.
It's better always to pre-process and provide clean image to the OCR engine, it will probably enhance the detection percent too.






share|improve this answer












If all images have this format, you can first invert colors, then use Threshold binarization to get rid of this shadow.
It's better always to pre-process and provide clean image to the OCR engine, it will probably enhance the detection percent too.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jun 8 '17 at 2:23









MoustafaS

1,568717




1,568717








  • 1




    i understand that , i may need to pre-process before passing the image to tesseract, how do i identify the highlighted text in the image ? - thanks
    – Pankaj Mishra
    Jun 8 '17 at 18:38










  • @PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
    – MoustafaS
    Jun 8 '17 at 23:15














  • 1




    i understand that , i may need to pre-process before passing the image to tesseract, how do i identify the highlighted text in the image ? - thanks
    – Pankaj Mishra
    Jun 8 '17 at 18:38










  • @PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
    – MoustafaS
    Jun 8 '17 at 23:15








1




1




i understand that , i may need to pre-process before passing the image to tesseract, how do i identify the highlighted text in the image ? - thanks
– Pankaj Mishra
Jun 8 '17 at 18:38




i understand that , i may need to pre-process before passing the image to tesseract, how do i identify the highlighted text in the image ? - thanks
– Pankaj Mishra
Jun 8 '17 at 18:38












@PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
– MoustafaS
Jun 8 '17 at 23:15




@PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
– MoustafaS
Jun 8 '17 at 23:15


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f44401113%2fread-highlighted-text-from-a-picture-using-tesseract%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Xamarin.form Move up view when keyboard appear

Post-Redirect-Get with Spring WebFlux and Thymeleaf

Anylogic : not able to use stopDelay()