Headers are not getting extracted from PDF while extracting the table data from PDF using camelot

up vote
2
down vote

favorite

I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.

Attaching the target
PDF link below and target table are at page number 3 and 4, which need to extracted.

https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing

One of the tables looks like below
enter image description here

I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"

https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines

However not able to resolve the problem by tweaking the line_size_scaling parameter.

Please assist.

edited Nov 9 at 15:43

Arpit Solanki

5,14921643

asked Nov 8 at 8:20

Abhishek Bisht

125

add a comment |

up vote
2
down vote

favorite

I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.

Attaching the target
PDF link below and target table are at page number 3 and 4, which need to extracted.

https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing

One of the tables looks like below
enter image description here

I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"

https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines

However not able to resolve the problem by tweaking the line_size_scaling parameter.

Please assist.

edited Nov 9 at 15:43

Arpit Solanki

5,14921643

asked Nov 8 at 8:20

Abhishek Bisht

125

add a comment |

up vote
2
down vote

favorite

I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.

Attaching the target
PDF link below and target table are at page number 3 and 4, which need to extracted.

https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing

One of the tables looks like below
enter image description here

I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"

https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines

However not able to resolve the problem by tweaking the line_size_scaling parameter.

Please assist.

edited Nov 9 at 15:43

Arpit Solanki

5,14921643

asked Nov 8 at 8:20

Abhishek Bisht

125

I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.

Attaching the target
PDF link below and target table are at page number 3 and 4, which need to extracted.

https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing

One of the tables looks like below
enter image description here

I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"

https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines

However not able to resolve the problem by tweaking the line_size_scaling parameter.

Please assist.

pdf-scraping python-camelot

edited Nov 9 at 15:43

Arpit Solanki

5,14921643

asked Nov 8 at 8:20

Abhishek Bisht

125

edited Nov 9 at 15:43

Arpit Solanki

5,14921643

asked Nov 8 at 8:20

Abhishek Bisht

125

edited Nov 9 at 15:43

Arpit Solanki

5,14921643

edited Nov 9 at 15:43

Arpit Solanki

5,14921643

edited Nov 9 at 15:43

Arpit Solanki

5,14921643

asked Nov 8 at 8:20

Abhishek Bisht

125

asked Nov 8 at 8:20

Abhishek Bisht

125

asked Nov 8 at 8:20

Abhishek Bisht

125

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

I plotted the detected table boundary on page 3 using $ camelot -p 3 lattice -plot contour 007.pdf. Looks like Camelot isn't including the header row in the detected table boundary [bug 1] (see image below). Then I tried using the table_areas keyword argument with flavor='lattice' but then it didn't include the lines in the the specified table boundary [bug 2]. I've added these on the issue tracker as #200 and #201.

You can still use the table_areas keyword argument with flavor='stream' to get the table out.

Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf

Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])

You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging

Hope that helps!

enter image description here

edited Nov 9 at 19:22

answered Nov 9 at 16:53

Vinayak Mehta

13110

Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
– Abhishek Bisht
Nov 10 at 14:16

I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
– Abhishek Bisht
Nov 10 at 14:26

I can see the tag. If you found this answer helpful, please accept it, thanks.
– Vinayak Mehta
Nov 10 at 15:37

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53203779%2fheaders-are-not-getting-extracted-from-pdf-while-extracting-the-table-data-from%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

You can still use the table_areas keyword argument with flavor='stream' to get the table out.

Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf

Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])

You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging

Hope that helps!

enter image description here

edited Nov 9 at 19:22

answered Nov 9 at 16:53

Vinayak Mehta

13110

Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
– Abhishek Bisht
Nov 10 at 14:16

I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
– Abhishek Bisht
Nov 10 at 14:26

I can see the tag. If you found this answer helpful, please accept it, thanks.
– Vinayak Mehta
Nov 10 at 15:37

add a comment |

up vote
1
down vote

accepted

You can still use the table_areas keyword argument with flavor='stream' to get the table out.

Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf

Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])

You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging

Hope that helps!

enter image description here

edited Nov 9 at 19:22

answered Nov 9 at 16:53

Vinayak Mehta

13110

Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
– Abhishek Bisht
Nov 10 at 14:16

I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
– Abhishek Bisht
Nov 10 at 14:26

I can see the tag. If you found this answer helpful, please accept it, thanks.
– Vinayak Mehta
Nov 10 at 15:37

add a comment |

up vote
1
down vote

accepted

You can still use the table_areas keyword argument with flavor='stream' to get the table out.

Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf

Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])

You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging

Hope that helps!

enter image description here

edited Nov 9 at 19:22

answered Nov 9 at 16:53

Vinayak Mehta

13110

You can still use the table_areas keyword argument with flavor='stream' to get the table out.

Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf

Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])

You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging

Hope that helps!

enter image description here

edited Nov 9 at 19:22

answered Nov 9 at 16:53

Vinayak Mehta

13110

edited Nov 9 at 19:22

answered Nov 9 at 16:53

Vinayak Mehta

13110

answered Nov 9 at 16:53

Vinayak Mehta

13110

answered Nov 9 at 16:53

Vinayak Mehta

13110

Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
– Abhishek Bisht
Nov 10 at 14:16

I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
– Abhishek Bisht
Nov 10 at 14:26

I can see the tag. If you found this answer helpful, please accept it, thanks.
– Vinayak Mehta
Nov 10 at 15:37

add a comment |

Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
– Abhishek Bisht
Nov 10 at 14:16

I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
– Abhishek Bisht
Nov 10 at 14:26

I can see the tag. If you found this answer helpful, please accept it, thanks.
– Vinayak Mehta
Nov 10 at 15:37

Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
– Abhishek Bisht
Nov 10 at 14:16

I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
– Abhishek Bisht
Nov 10 at 14:26

I can see the tag. If you found this answer helpful, please accept it, thanks.
– Vinayak Mehta
Nov 10 at 15:37

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk