Headers are not getting extracted from PDF while extracting the table data from PDF using camelot











up vote
2
down vote

favorite












I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.



Attaching the target
PDF link below and target table are at page number 3 and 4, which need to extracted.



https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing



One of the tables looks like below
enter image description here



I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"



https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines



However not able to resolve the problem by tweaking the line_size_scaling parameter.



Please assist.










share|improve this question




























    up vote
    2
    down vote

    favorite












    I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.



    Attaching the target
    PDF link below and target table are at page number 3 and 4, which need to extracted.



    https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing



    One of the tables looks like below
    enter image description here



    I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"



    https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines



    However not able to resolve the problem by tweaking the line_size_scaling parameter.



    Please assist.










    share|improve this question


























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.



      Attaching the target
      PDF link below and target table are at page number 3 and 4, which need to extracted.



      https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing



      One of the tables looks like below
      enter image description here



      I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"



      https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines



      However not able to resolve the problem by tweaking the line_size_scaling parameter.



      Please assist.










      share|improve this question















      I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.



      Attaching the target
      PDF link below and target table are at page number 3 and 4, which need to extracted.



      https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing



      One of the tables looks like below
      enter image description here



      I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"



      https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines



      However not able to resolve the problem by tweaking the line_size_scaling parameter.



      Please assist.







      pdf-scraping python-camelot






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 9 at 15:43









      Arpit Solanki

      5,14921643




      5,14921643










      asked Nov 8 at 8:20









      Abhishek Bisht

      125




      125
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          1
          down vote



          accepted










          I plotted the detected table boundary on page 3 using $ camelot -p 3 lattice -plot contour 007.pdf. Looks like Camelot isn't including the header row in the detected table boundary [bug 1] (see image below). Then I tried using the table_areas keyword argument with flavor='lattice' but then it didn't include the lines in the the specified table boundary [bug 2]. I've added these on the issue tracker as #200 and #201.



          You can still use the table_areas keyword argument with flavor='stream' to get the table out.



          Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf



          Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])



          You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging



          Hope that helps!



          enter image description here






          share|improve this answer























          • Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
            – Abhishek Bisht
            Nov 10 at 14:16












          • I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
            – Abhishek Bisht
            Nov 10 at 14:26












          • I can see the tag. If you found this answer helpful, please accept it, thanks.
            – Vinayak Mehta
            Nov 10 at 15:37











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53203779%2fheaders-are-not-getting-extracted-from-pdf-while-extracting-the-table-data-from%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          1
          down vote



          accepted










          I plotted the detected table boundary on page 3 using $ camelot -p 3 lattice -plot contour 007.pdf. Looks like Camelot isn't including the header row in the detected table boundary [bug 1] (see image below). Then I tried using the table_areas keyword argument with flavor='lattice' but then it didn't include the lines in the the specified table boundary [bug 2]. I've added these on the issue tracker as #200 and #201.



          You can still use the table_areas keyword argument with flavor='stream' to get the table out.



          Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf



          Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])



          You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging



          Hope that helps!



          enter image description here






          share|improve this answer























          • Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
            – Abhishek Bisht
            Nov 10 at 14:16












          • I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
            – Abhishek Bisht
            Nov 10 at 14:26












          • I can see the tag. If you found this answer helpful, please accept it, thanks.
            – Vinayak Mehta
            Nov 10 at 15:37















          up vote
          1
          down vote



          accepted










          I plotted the detected table boundary on page 3 using $ camelot -p 3 lattice -plot contour 007.pdf. Looks like Camelot isn't including the header row in the detected table boundary [bug 1] (see image below). Then I tried using the table_areas keyword argument with flavor='lattice' but then it didn't include the lines in the the specified table boundary [bug 2]. I've added these on the issue tracker as #200 and #201.



          You can still use the table_areas keyword argument with flavor='stream' to get the table out.



          Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf



          Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])



          You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging



          Hope that helps!



          enter image description here






          share|improve this answer























          • Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
            – Abhishek Bisht
            Nov 10 at 14:16












          • I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
            – Abhishek Bisht
            Nov 10 at 14:26












          • I can see the tag. If you found this answer helpful, please accept it, thanks.
            – Vinayak Mehta
            Nov 10 at 15:37













          up vote
          1
          down vote



          accepted







          up vote
          1
          down vote



          accepted






          I plotted the detected table boundary on page 3 using $ camelot -p 3 lattice -plot contour 007.pdf. Looks like Camelot isn't including the header row in the detected table boundary [bug 1] (see image below). Then I tried using the table_areas keyword argument with flavor='lattice' but then it didn't include the lines in the the specified table boundary [bug 2]. I've added these on the issue tracker as #200 and #201.



          You can still use the table_areas keyword argument with flavor='stream' to get the table out.



          Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf



          Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])



          You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging



          Hope that helps!



          enter image description here






          share|improve this answer














          I plotted the detected table boundary on page 3 using $ camelot -p 3 lattice -plot contour 007.pdf. Looks like Camelot isn't including the header row in the detected table boundary [bug 1] (see image below). Then I tried using the table_areas keyword argument with flavor='lattice' but then it didn't include the lines in the the specified table boundary [bug 2]. I've added these on the issue tracker as #200 and #201.



          You can still use the table_areas keyword argument with flavor='stream' to get the table out.



          Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf



          Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])



          You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging



          Hope that helps!



          enter image description here







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 9 at 19:22

























          answered Nov 9 at 16:53









          Vinayak Mehta

          13110




          13110












          • Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
            – Abhishek Bisht
            Nov 10 at 14:16












          • I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
            – Abhishek Bisht
            Nov 10 at 14:26












          • I can see the tag. If you found this answer helpful, please accept it, thanks.
            – Vinayak Mehta
            Nov 10 at 15:37


















          • Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
            – Abhishek Bisht
            Nov 10 at 14:16












          • I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
            – Abhishek Bisht
            Nov 10 at 14:26












          • I can see the tag. If you found this answer helpful, please accept it, thanks.
            – Vinayak Mehta
            Nov 10 at 15:37
















          Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
          – Abhishek Bisht
          Nov 10 at 14:16






          Hi @Vinayak, Thanks for the response. I have also plotted the the table boundaries and got the same result as header are not being part of the table. Will track the bug number.
          – Abhishek Bisht
          Nov 10 at 14:16














          I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
          – Abhishek Bisht
          Nov 10 at 14:26






          I am not able to create the tag "python-camelot", as I don't have enough credits to do that, will you please create "python-camelot" tag. As I am using camelot So might have other question also.
          – Abhishek Bisht
          Nov 10 at 14:26














          I can see the tag. If you found this answer helpful, please accept it, thanks.
          – Vinayak Mehta
          Nov 10 at 15:37




          I can see the tag. If you found this answer helpful, please accept it, thanks.
          – Vinayak Mehta
          Nov 10 at 15:37


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53203779%2fheaders-are-not-getting-extracted-from-pdf-while-extracting-the-table-data-from%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          Xamarin.form Move up view when keyboard appear

          Post-Redirect-Get with Spring WebFlux and Thymeleaf

          Anylogic : not able to use stopDelay()