Headers are not getting extracted from PDF while extracting the table data from PDF using camelot
up vote
2
down vote
favorite
I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.
Attaching the target
PDF link below and target table are at page number 3 and 4, which need to extracted.
https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing
One of the tables looks like below
I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"
https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines
However not able to resolve the problem by tweaking the line_size_scaling parameter.
Please assist.
pdf-scraping python-camelot
add a comment |
up vote
2
down vote
favorite
I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.
Attaching the target
PDF link below and target table are at page number 3 and 4, which need to extracted.
https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing
One of the tables looks like below
I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"
https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines
However not able to resolve the problem by tweaking the line_size_scaling parameter.
Please assist.
pdf-scraping python-camelot
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.
Attaching the target
PDF link below and target table are at page number 3 and 4, which need to extracted.
https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing
One of the tables looks like below
I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"
https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines
However not able to resolve the problem by tweaking the line_size_scaling parameter.
Please assist.
pdf-scraping python-camelot
I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.
Attaching the target
PDF link below and target table are at page number 3 and 4, which need to extracted.
https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing
One of the tables looks like below
I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"
https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines
However not able to resolve the problem by tweaking the line_size_scaling parameter.
Please assist.
pdf-scraping python-camelot
pdf-scraping python-camelot
edited Nov 9 at 15:43
Arpit Solanki
5,14921643
5,14921643
asked Nov 8 at 8:20
Abhishek Bisht
125
125
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
I plotted the detected table boundary on page 3 using $ camelot -p 3 lattice -plot contour 007.pdf
. Looks like Camelot isn't including the header row in the detected table boundary [bug 1] (see image below). Then I tried using the table_areas
keyword argument with flavor='lattice'
but then it didn't include the lines in the the specified table boundary [bug 2]. I've added these on the issue tracker as #200 and #201.
You can still use the table_areas
keyword argument with flavor='stream'
to get the table out.
Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf
Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])
You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging
Hope that helps!
Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
– Abhishek Bisht
Nov 10 at 14:16
I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
– Abhishek Bisht
Nov 10 at 14:26
I can see the tag. If you found this answer helpful, please accept it, thanks.
– Vinayak Mehta
Nov 10 at 15:37
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53203779%2fheaders-are-not-getting-extracted-from-pdf-while-extracting-the-table-data-from%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
I plotted the detected table boundary on page 3 using $ camelot -p 3 lattice -plot contour 007.pdf
. Looks like Camelot isn't including the header row in the detected table boundary [bug 1] (see image below). Then I tried using the table_areas
keyword argument with flavor='lattice'
but then it didn't include the lines in the the specified table boundary [bug 2]. I've added these on the issue tracker as #200 and #201.
You can still use the table_areas
keyword argument with flavor='stream'
to get the table out.
Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf
Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])
You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging
Hope that helps!
Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
– Abhishek Bisht
Nov 10 at 14:16
I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
– Abhishek Bisht
Nov 10 at 14:26
I can see the tag. If you found this answer helpful, please accept it, thanks.
– Vinayak Mehta
Nov 10 at 15:37
add a comment |
up vote
1
down vote
accepted
I plotted the detected table boundary on page 3 using $ camelot -p 3 lattice -plot contour 007.pdf
. Looks like Camelot isn't including the header row in the detected table boundary [bug 1] (see image below). Then I tried using the table_areas
keyword argument with flavor='lattice'
but then it didn't include the lines in the the specified table boundary [bug 2]. I've added these on the issue tracker as #200 and #201.
You can still use the table_areas
keyword argument with flavor='stream'
to get the table out.
Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf
Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])
You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging
Hope that helps!
Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
– Abhishek Bisht
Nov 10 at 14:16
I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
– Abhishek Bisht
Nov 10 at 14:26
I can see the tag. If you found this answer helpful, please accept it, thanks.
– Vinayak Mehta
Nov 10 at 15:37
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
I plotted the detected table boundary on page 3 using $ camelot -p 3 lattice -plot contour 007.pdf
. Looks like Camelot isn't including the header row in the detected table boundary [bug 1] (see image below). Then I tried using the table_areas
keyword argument with flavor='lattice'
but then it didn't include the lines in the the specified table boundary [bug 2]. I've added these on the issue tracker as #200 and #201.
You can still use the table_areas
keyword argument with flavor='stream'
to get the table out.
Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf
Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])
You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging
Hope that helps!
I plotted the detected table boundary on page 3 using $ camelot -p 3 lattice -plot contour 007.pdf
. Looks like Camelot isn't including the header row in the detected table boundary [bug 1] (see image below). Then I tried using the table_areas
keyword argument with flavor='lattice'
but then it didn't include the lines in the the specified table boundary [bug 2]. I've added these on the issue tracker as #200 and #201.
You can still use the table_areas
keyword argument with flavor='stream'
to get the table out.
Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf
Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])
You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging
Hope that helps!
edited Nov 9 at 19:22
answered Nov 9 at 16:53
Vinayak Mehta
13110
13110
Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
– Abhishek Bisht
Nov 10 at 14:16
I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
– Abhishek Bisht
Nov 10 at 14:26
I can see the tag. If you found this answer helpful, please accept it, thanks.
– Vinayak Mehta
Nov 10 at 15:37
add a comment |
Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
– Abhishek Bisht
Nov 10 at 14:16
I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
– Abhishek Bisht
Nov 10 at 14:26
I can see the tag. If you found this answer helpful, please accept it, thanks.
– Vinayak Mehta
Nov 10 at 15:37
Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
– Abhishek Bisht
Nov 10 at 14:16
Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
– Abhishek Bisht
Nov 10 at 14:16
I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
– Abhishek Bisht
Nov 10 at 14:26
I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
– Abhishek Bisht
Nov 10 at 14:26
I can see the tag. If you found this answer helpful, please accept it, thanks.
– Vinayak Mehta
Nov 10 at 15:37
I can see the tag. If you found this answer helpful, please accept it, thanks.
– Vinayak Mehta
Nov 10 at 15:37
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53203779%2fheaders-are-not-getting-extracted-from-pdf-while-extracting-the-table-data-from%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown