Read highlighted text from a picture using tesseract
up vote
0
down vote
favorite
I am trying to read text from an image using tesseract
, the picture is high qualit, so tesseract
is able to read the text with 95% accuracy, which is OK for me at this point, however tesseract
in unable to read the text which is highlighted (selected), Please refer the PIC
How can i read the text from selected area using tesseract
, Is there any way to identify which word is highlighted in the image ?
Code used to read and convert the text from image as below.
tesseract::TessBaseAPI *myOCR = new tesseract::TessBaseAPI();
//cout << "hello" << endl;
if (myOCR->Init("C:\QTSoftware\IODriver\", "eng")) {
SaveLineLog(brdInd, "Unable to initialize tesseract engine", __LINE__);
return RC_TESSERACT_ENG_FAILURE;
//fprintf(stderr, "Could not initialize tesseract.n");
//return RC_THREAD_FAILURE;
//exit(1);
}
SaveLineLog(brdInd, "tesseract engine is UP :)", __LINE__);
//strcpy_s(fileName, "C:\TEMP\T481Logs\FrameOCR23_0.jpg");
FILE *pFile;
fopen_s(&pFile,fileName, "r"); // Open picture
PIX* pix; // Image format from `leptonica`
pix = pixReadStreamBmp(pFile);
fclose(pFile);
//Pix *pix = pixRead(fileName);
//Pix *pix = pixReadStreamBmp(fileName);
//cout << "Tesseract - Pix : " << pix << endl;
if (pix == NULL)
{
SaveLineLog(brdInd, "Pix failure", __LINE__);
return RC_TESSERACT_PIX_FAILURE;
}
myOCR->SetImage(pix);
char* outText = myOCR->GetUTF8Text();
opencv image-processing ocr tesseract
add a comment |
up vote
0
down vote
favorite
I am trying to read text from an image using tesseract
, the picture is high qualit, so tesseract
is able to read the text with 95% accuracy, which is OK for me at this point, however tesseract
in unable to read the text which is highlighted (selected), Please refer the PIC
How can i read the text from selected area using tesseract
, Is there any way to identify which word is highlighted in the image ?
Code used to read and convert the text from image as below.
tesseract::TessBaseAPI *myOCR = new tesseract::TessBaseAPI();
//cout << "hello" << endl;
if (myOCR->Init("C:\QTSoftware\IODriver\", "eng")) {
SaveLineLog(brdInd, "Unable to initialize tesseract engine", __LINE__);
return RC_TESSERACT_ENG_FAILURE;
//fprintf(stderr, "Could not initialize tesseract.n");
//return RC_THREAD_FAILURE;
//exit(1);
}
SaveLineLog(brdInd, "tesseract engine is UP :)", __LINE__);
//strcpy_s(fileName, "C:\TEMP\T481Logs\FrameOCR23_0.jpg");
FILE *pFile;
fopen_s(&pFile,fileName, "r"); // Open picture
PIX* pix; // Image format from `leptonica`
pix = pixReadStreamBmp(pFile);
fclose(pFile);
//Pix *pix = pixRead(fileName);
//Pix *pix = pixReadStreamBmp(fileName);
//cout << "Tesseract - Pix : " << pix << endl;
if (pix == NULL)
{
SaveLineLog(brdInd, "Pix failure", __LINE__);
return RC_TESSERACT_PIX_FAILURE;
}
myOCR->SetImage(pix);
char* outText = myOCR->GetUTF8Text();
opencv image-processing ocr tesseract
Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
– ZdaR
Jun 7 '17 at 3:01
i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found thattesseract
is ignoring the highlighted text so i may try to pre-process.
– Pankaj Mishra
Jun 8 '17 at 18:44
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I am trying to read text from an image using tesseract
, the picture is high qualit, so tesseract
is able to read the text with 95% accuracy, which is OK for me at this point, however tesseract
in unable to read the text which is highlighted (selected), Please refer the PIC
How can i read the text from selected area using tesseract
, Is there any way to identify which word is highlighted in the image ?
Code used to read and convert the text from image as below.
tesseract::TessBaseAPI *myOCR = new tesseract::TessBaseAPI();
//cout << "hello" << endl;
if (myOCR->Init("C:\QTSoftware\IODriver\", "eng")) {
SaveLineLog(brdInd, "Unable to initialize tesseract engine", __LINE__);
return RC_TESSERACT_ENG_FAILURE;
//fprintf(stderr, "Could not initialize tesseract.n");
//return RC_THREAD_FAILURE;
//exit(1);
}
SaveLineLog(brdInd, "tesseract engine is UP :)", __LINE__);
//strcpy_s(fileName, "C:\TEMP\T481Logs\FrameOCR23_0.jpg");
FILE *pFile;
fopen_s(&pFile,fileName, "r"); // Open picture
PIX* pix; // Image format from `leptonica`
pix = pixReadStreamBmp(pFile);
fclose(pFile);
//Pix *pix = pixRead(fileName);
//Pix *pix = pixReadStreamBmp(fileName);
//cout << "Tesseract - Pix : " << pix << endl;
if (pix == NULL)
{
SaveLineLog(brdInd, "Pix failure", __LINE__);
return RC_TESSERACT_PIX_FAILURE;
}
myOCR->SetImage(pix);
char* outText = myOCR->GetUTF8Text();
opencv image-processing ocr tesseract
I am trying to read text from an image using tesseract
, the picture is high qualit, so tesseract
is able to read the text with 95% accuracy, which is OK for me at this point, however tesseract
in unable to read the text which is highlighted (selected), Please refer the PIC
How can i read the text from selected area using tesseract
, Is there any way to identify which word is highlighted in the image ?
Code used to read and convert the text from image as below.
tesseract::TessBaseAPI *myOCR = new tesseract::TessBaseAPI();
//cout << "hello" << endl;
if (myOCR->Init("C:\QTSoftware\IODriver\", "eng")) {
SaveLineLog(brdInd, "Unable to initialize tesseract engine", __LINE__);
return RC_TESSERACT_ENG_FAILURE;
//fprintf(stderr, "Could not initialize tesseract.n");
//return RC_THREAD_FAILURE;
//exit(1);
}
SaveLineLog(brdInd, "tesseract engine is UP :)", __LINE__);
//strcpy_s(fileName, "C:\TEMP\T481Logs\FrameOCR23_0.jpg");
FILE *pFile;
fopen_s(&pFile,fileName, "r"); // Open picture
PIX* pix; // Image format from `leptonica`
pix = pixReadStreamBmp(pFile);
fclose(pFile);
//Pix *pix = pixRead(fileName);
//Pix *pix = pixReadStreamBmp(fileName);
//cout << "Tesseract - Pix : " << pix << endl;
if (pix == NULL)
{
SaveLineLog(brdInd, "Pix failure", __LINE__);
return RC_TESSERACT_PIX_FAILURE;
}
myOCR->SetImage(pix);
char* outText = myOCR->GetUTF8Text();
opencv image-processing ocr tesseract
opencv image-processing ocr tesseract
edited Nov 7 at 11:14
Amnon
1,3221824
1,3221824
asked Jun 6 '17 at 23:13
Pankaj Mishra
13
13
Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
– ZdaR
Jun 7 '17 at 3:01
i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found thattesseract
is ignoring the highlighted text so i may try to pre-process.
– Pankaj Mishra
Jun 8 '17 at 18:44
add a comment |
Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
– ZdaR
Jun 7 '17 at 3:01
i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found thattesseract
is ignoring the highlighted text so i may try to pre-process.
– Pankaj Mishra
Jun 8 '17 at 18:44
Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
– ZdaR
Jun 7 '17 at 3:01
Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
– ZdaR
Jun 7 '17 at 3:01
i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found that
tesseract
is ignoring the highlighted text so i may try to pre-process.– Pankaj Mishra
Jun 8 '17 at 18:44
i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found that
tesseract
is ignoring the highlighted text so i may try to pre-process.– Pankaj Mishra
Jun 8 '17 at 18:44
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
If all images have this format, you can first invert colors, then use Threshold binarization to get rid of this shadow.
It's better always to pre-process and provide clean image to the OCR engine, it will probably enhance the detection percent too.
1
i understand that , i may need to pre-process before passing the image totesseract
, how do i identify the highlighted text in the image ? - thanks
– Pankaj Mishra
Jun 8 '17 at 18:38
@PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
– MoustafaS
Jun 8 '17 at 23:15
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
If all images have this format, you can first invert colors, then use Threshold binarization to get rid of this shadow.
It's better always to pre-process and provide clean image to the OCR engine, it will probably enhance the detection percent too.
1
i understand that , i may need to pre-process before passing the image totesseract
, how do i identify the highlighted text in the image ? - thanks
– Pankaj Mishra
Jun 8 '17 at 18:38
@PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
– MoustafaS
Jun 8 '17 at 23:15
add a comment |
up vote
0
down vote
If all images have this format, you can first invert colors, then use Threshold binarization to get rid of this shadow.
It's better always to pre-process and provide clean image to the OCR engine, it will probably enhance the detection percent too.
1
i understand that , i may need to pre-process before passing the image totesseract
, how do i identify the highlighted text in the image ? - thanks
– Pankaj Mishra
Jun 8 '17 at 18:38
@PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
– MoustafaS
Jun 8 '17 at 23:15
add a comment |
up vote
0
down vote
up vote
0
down vote
If all images have this format, you can first invert colors, then use Threshold binarization to get rid of this shadow.
It's better always to pre-process and provide clean image to the OCR engine, it will probably enhance the detection percent too.
If all images have this format, you can first invert colors, then use Threshold binarization to get rid of this shadow.
It's better always to pre-process and provide clean image to the OCR engine, it will probably enhance the detection percent too.
answered Jun 8 '17 at 2:23
MoustafaS
1,568717
1,568717
1
i understand that , i may need to pre-process before passing the image totesseract
, how do i identify the highlighted text in the image ? - thanks
– Pankaj Mishra
Jun 8 '17 at 18:38
@PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
– MoustafaS
Jun 8 '17 at 23:15
add a comment |
1
i understand that , i may need to pre-process before passing the image totesseract
, how do i identify the highlighted text in the image ? - thanks
– Pankaj Mishra
Jun 8 '17 at 18:38
@PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
– MoustafaS
Jun 8 '17 at 23:15
1
1
i understand that , i may need to pre-process before passing the image to
tesseract
, how do i identify the highlighted text in the image ? - thanks– Pankaj Mishra
Jun 8 '17 at 18:38
i understand that , i may need to pre-process before passing the image to
tesseract
, how do i identify the highlighted text in the image ? - thanks– Pankaj Mishra
Jun 8 '17 at 18:38
@PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
– MoustafaS
Jun 8 '17 at 23:15
@PankajMishra what i suggested fixes the issue of not being able to 'read' the highlighted part, but to identify it, this would need some more info
– MoustafaS
Jun 8 '17 at 23:15
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f44401113%2fread-highlighted-text-from-a-picture-using-tesseract%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Can you attach the code you are using, are you performing any kind of pre-processing before passing the image to tesseract method ?
– ZdaR
Jun 7 '17 at 3:01
i have used the above code in application to invoke 'tesseract' and get the text from the image, since all the images are black and white , so pre-processing was not required, now i found that
tesseract
is ignoring the highlighted text so i may try to pre-process.– Pankaj Mishra
Jun 8 '17 at 18:44