Read highlighted text from a picture using tesseract

up vote
0
down vote

favorite

I am trying to read text from an image using tesseract, the picture is high qualit, so tesseract is able to read the text with 95% accuracy, which is OK for me at this point, however tesseract in unable to read the text which is highlighted (selected), Please refer the PIC

enter image description here

How can i read the text from selected area using tesseract, Is there any way to identify which word is highlighted in the image ?

Code used to read and convert the text from image as below.

tesseract::TessBaseAPI *myOCR = new tesseract::TessBaseAPI();



//cout << "hello" << endl;



if (myOCR->Init("C:\QTSoftware\IODriver\", "eng")) {



    SaveLineLog(brdInd, "Unable to initialize tesseract engine", __LINE__); 



    return RC_TESSERACT_ENG_FAILURE;



    //fprintf(stderr, "Could not initialize tesseract.n");



    //return RC_THREAD_FAILURE;

    //exit(1);

}



SaveLineLog(brdInd, "tesseract engine is UP :)", __LINE__);



//strcpy_s(fileName, "C:\TEMP\T481Logs\FrameOCR23_0.jpg");



FILE *pFile;



fopen_s(&pFile,fileName, "r");    // Open picture



PIX* pix;                         // Image format from `leptonica`



pix = pixReadStreamBmp(pFile);



fclose(pFile);



//Pix *pix = pixRead(fileName);



//Pix *pix = pixReadStreamBmp(fileName);



//cout << "Tesseract - Pix : " << pix << endl;



if (pix == NULL)

{

   SaveLineLog(brdInd, "Pix failure", __LINE__); 



   return RC_TESSERACT_PIX_FAILURE;

}



myOCR->SetImage(pix);



char* outText = myOCR->GetUTF8Text();

edited Nov 7 at 11:14

Amnon

1,3221824

asked Jun 6 '17 at 23:13

Pankaj Mishra

Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
– ZdaR
Jun 7 '17 at 3:01

i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found that tesseract is ignoring the highlighted text so i may try to pre-process.
– Pankaj Mishra
Jun 8 '17 at 18:44

add a comment |

up vote
0
down vote

favorite

enter image description here

How can i read the text from selected area using tesseract, Is there any way to identify which word is highlighted in the image ?

Code used to read and convert the text from image as below.

tesseract::TessBaseAPI *myOCR = new tesseract::TessBaseAPI();



//cout << "hello" << endl;



if (myOCR->Init("C:\QTSoftware\IODriver\", "eng")) {



    SaveLineLog(brdInd, "Unable to initialize tesseract engine", __LINE__); 



    return RC_TESSERACT_ENG_FAILURE;



    //fprintf(stderr, "Could not initialize tesseract.n");



    //return RC_THREAD_FAILURE;

    //exit(1);

}



SaveLineLog(brdInd, "tesseract engine is UP :)", __LINE__);



//strcpy_s(fileName, "C:\TEMP\T481Logs\FrameOCR23_0.jpg");



FILE *pFile;



fopen_s(&pFile,fileName, "r");    // Open picture



PIX* pix;                         // Image format from `leptonica`



pix = pixReadStreamBmp(pFile);



fclose(pFile);



//Pix *pix = pixRead(fileName);



//Pix *pix = pixReadStreamBmp(fileName);



//cout << "Tesseract - Pix : " << pix << endl;



if (pix == NULL)

{

   SaveLineLog(brdInd, "Pix failure", __LINE__); 



   return RC_TESSERACT_PIX_FAILURE;

}



myOCR->SetImage(pix);



char* outText = myOCR->GetUTF8Text();

edited Nov 7 at 11:14

Amnon

1,3221824

asked Jun 6 '17 at 23:13

Pankaj Mishra

Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
– ZdaR
Jun 7 '17 at 3:01

i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found that tesseract is ignoring the highlighted text so i may try to pre-process.
– Pankaj Mishra
Jun 8 '17 at 18:44

add a comment |

up vote
0
down vote

favorite

enter image description here

How can i read the text from selected area using tesseract, Is there any way to identify which word is highlighted in the image ?

Code used to read and convert the text from image as below.

tesseract::TessBaseAPI *myOCR = new tesseract::TessBaseAPI();



//cout << "hello" << endl;



if (myOCR->Init("C:\QTSoftware\IODriver\", "eng")) {



    SaveLineLog(brdInd, "Unable to initialize tesseract engine", __LINE__); 



    return RC_TESSERACT_ENG_FAILURE;



    //fprintf(stderr, "Could not initialize tesseract.n");



    //return RC_THREAD_FAILURE;

    //exit(1);

}



SaveLineLog(brdInd, "tesseract engine is UP :)", __LINE__);



//strcpy_s(fileName, "C:\TEMP\T481Logs\FrameOCR23_0.jpg");



FILE *pFile;



fopen_s(&pFile,fileName, "r");    // Open picture



PIX* pix;                         // Image format from `leptonica`



pix = pixReadStreamBmp(pFile);



fclose(pFile);



//Pix *pix = pixRead(fileName);



//Pix *pix = pixReadStreamBmp(fileName);



//cout << "Tesseract - Pix : " << pix << endl;



if (pix == NULL)

{

   SaveLineLog(brdInd, "Pix failure", __LINE__); 



   return RC_TESSERACT_PIX_FAILURE;

}



myOCR->SetImage(pix);



char* outText = myOCR->GetUTF8Text();

edited Nov 7 at 11:14

Amnon

1,3221824

asked Jun 6 '17 at 23:13

Pankaj Mishra

enter image description here

How can i read the text from selected area using tesseract, Is there any way to identify which word is highlighted in the image ?

Code used to read and convert the text from image as below.

tesseract::TessBaseAPI *myOCR = new tesseract::TessBaseAPI();



//cout << "hello" << endl;



if (myOCR->Init("C:\QTSoftware\IODriver\", "eng")) {



    SaveLineLog(brdInd, "Unable to initialize tesseract engine", __LINE__); 



    return RC_TESSERACT_ENG_FAILURE;



    //fprintf(stderr, "Could not initialize tesseract.n");



    //return RC_THREAD_FAILURE;

    //exit(1);

}



SaveLineLog(brdInd, "tesseract engine is UP :)", __LINE__);



//strcpy_s(fileName, "C:\TEMP\T481Logs\FrameOCR23_0.jpg");



FILE *pFile;



fopen_s(&pFile,fileName, "r");    // Open picture



PIX* pix;                         // Image format from `leptonica`



pix = pixReadStreamBmp(pFile);



fclose(pFile);



//Pix *pix = pixRead(fileName);



//Pix *pix = pixReadStreamBmp(fileName);



//cout << "Tesseract - Pix : " << pix << endl;



if (pix == NULL)

{

   SaveLineLog(brdInd, "Pix failure", __LINE__); 



   return RC_TESSERACT_PIX_FAILURE;

}



myOCR->SetImage(pix);



char* outText = myOCR->GetUTF8Text();

opencv image-processing ocr tesseract

edited Nov 7 at 11:14

Amnon

1,3221824

asked Jun 6 '17 at 23:13

Pankaj Mishra

edited Nov 7 at 11:14

Amnon

1,3221824

asked Jun 6 '17 at 23:13

Pankaj Mishra

edited Nov 7 at 11:14

Amnon

1,3221824

edited Nov 7 at 11:14

Amnon

1,3221824

edited Nov 7 at 11:14

Amnon

1,3221824

asked Jun 6 '17 at 23:13

Pankaj Mishra

asked Jun 6 '17 at 23:13

Pankaj Mishra

asked Jun 6 '17 at 23:13

Pankaj Mishra

Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
– ZdaR
Jun 7 '17 at 3:01

i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found that tesseract is ignoring the highlighted text so i may try to pre-process.
– Pankaj Mishra
Jun 8 '17 at 18:44

add a comment |

Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
– ZdaR
Jun 7 '17 at 3:01

i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found that tesseract is ignoring the highlighted text so i may try to pre-process.
– Pankaj Mishra
Jun 8 '17 at 18:44

Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
– ZdaR
Jun 7 '17 at 3:01

i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found that tesseract is ignoring the highlighted text so i may try to pre-process.
– Pankaj Mishra
Jun 8 '17 at 18:44

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

If all images have this format, you can first invert colors, then use Threshold binarization to get rid of this shadow.
It's better always to pre-process and provide clean image to the OCR engine, it will probably enhance the detection percent too.

answered Jun 8 '17 at 2:23

MoustafaS

1,568717

1

i understand that , i may need to pre-process before passing the image to tesseract, how do i identify the highlighted text in the image ? - thanks
– Pankaj Mishra
Jun 8 '17 at 18:38

@PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
– MoustafaS
Jun 8 '17 at 23:15

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f44401113%2fread-highlighted-text-from-a-picture-using-tesseract%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

answered Jun 8 '17 at 2:23

MoustafaS

1,568717

1

i understand that , i may need to pre-process before passing the image to tesseract, how do i identify the highlighted text in the image ? - thanks
– Pankaj Mishra
Jun 8 '17 at 18:38

@PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
– MoustafaS
Jun 8 '17 at 23:15

add a comment |

up vote
0
down vote

answered Jun 8 '17 at 2:23

MoustafaS

1,568717

1

i understand that , i may need to pre-process before passing the image to tesseract, how do i identify the highlighted text in the image ? - thanks
– Pankaj Mishra
Jun 8 '17 at 18:38

@PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
– MoustafaS
Jun 8 '17 at 23:15

add a comment |

up vote
0
down vote

answered Jun 8 '17 at 2:23

MoustafaS

1,568717

answered Jun 8 '17 at 2:23

MoustafaS

1,568717

answered Jun 8 '17 at 2:23

MoustafaS

1,568717

answered Jun 8 '17 at 2:23

MoustafaS

1,568717

answered Jun 8 '17 at 2:23

MoustafaS

1,568717

1

i understand that , i may need to pre-process before passing the image to tesseract, how do i identify the highlighted text in the image ? - thanks
– Pankaj Mishra
Jun 8 '17 at 18:38

@PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
– MoustafaS
Jun 8 '17 at 23:15

add a comment |

1

i understand that , i may need to pre-process before passing the image to tesseract, how do i identify the highlighted text in the image ? - thanks
– Pankaj Mishra
Jun 8 '17 at 18:38

@PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
– MoustafaS
Jun 8 '17 at 23:15

i understand that , i may need to pre-process before passing the image to tesseract, how do i identify the highlighted text in the image ? - thanks
– Pankaj Mishra
Jun 8 '17 at 18:38

@PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
– MoustafaS
Jun 8 '17 at 23:15

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk