Split PDF into separate files based on text found using regex





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0















I have a PDF splitter using ByteScout.PDFExtractor. My code searches for a unique identifying header which is "TP###### SIGNED AFFIDAVIT"



the #'s could be any integer from 0-9. I'm using regular expressions to search for those headers like this:



Dim regexPattern = "*TP[0-9]{6}* *SIGNED AFFIDAVIT*"



This is working. The thing is that its splitting the document page by page so when it splits i get the following in my directory:



TP02433 SIGNED AFFIDAVIT 1
TP02433 SIGNED AFFIDAVIT 2
TP02433 SIGNED AFFIDAVIT 3
TP02354 SIGNED AFFIDAVIT 4
TP02354 SIGNED AFFIDAVIT 5
TP02354 SIGNED AFFIDAVIT 6 ...


My question is this, what could i do to my code so that when it finds lets say for example TP02433 it keeps those pages together till it finds the next TP#?



Is there a way that this could find "TP[0-9]{6} SIGNED AFFIDAVIT" then extract all those documents keeping them together till it finds the next unique "TP[0-9]{6} SIGNED AFFIDAVIT" ?



so that that the end looks like this:



TP02433 SIGNED AFFIDAVIT (1 - 3)
TP02354 SIGNED AFFIDAVIT (4 - 6) ?


Here's my so-far working code:



Imports System.IO
Imports Bytescout.PDFExtractor
Imports Microsoft.Office.Interop
Imports System.IO.Path
Imports System.Text
Imports System.Text.RegularExpressions

Module Module1

Sub Main()
Dim unmerged = Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Tesspdf")

Dim pdfFile As String = "G:WordDepartment FoldersPre-SuitXavierMPOP.pdf"

Dim extractor As New TextExtractor()

extractor.WordMatchingMode = WordMatchingMode.ExactMatch

extractor.LoadDocumentFromFile(pdfFile)

Dim pageCount = extractor.GetPageCount()

Dim currentPageTypeName = "UNKNOWN"
Dim PageTypeName = "test"
extractor.RegexSearch = True
Dim regexPattern = "*TP[0-9]{6}* *SIGNED AFFIDAVIT*"



For i = 0 To pageCount - 1


If extractor.Find(i, regexPattern, False) Then

PageTypeName = Regex.Replace(extractor.TextFound.Text, "[^A-Za-z0-9-/#s]", "")

currentPageTypeName = PageTypeName

End If


Using splitter As New DocumentSplitter() With {.OptimizeSplittedDocuments = True}

Dim pageNumber = i + 1 ' (!) page number in ExtractPage() is 1-based


If Not Directory.Exists(unmerged) Then
Directory.CreateDirectory(unmerged)
End If

Dim outputfile = Combine(unmerged, currentPageTypeName & " " & pageNumber & ".pdf")


splitter.ExtractPage(pdfFile, outputfile, pageNumber)

End Using
Next
extractor.Dispose()



End Sub

End Module


I would use ExtractPageRange the pages vary. So i was wondering if this code could find the first "*TP[0-9]{6}* *SIGNED AFFIDAVIT*" extract ALL the pages after that header till it reaches the next "*TP[0-9]{6}* *SIGNED AFFIDAVIT*" and does the same till the pdf document is completely split?










share|improve this question





























    0















    I have a PDF splitter using ByteScout.PDFExtractor. My code searches for a unique identifying header which is "TP###### SIGNED AFFIDAVIT"



    the #'s could be any integer from 0-9. I'm using regular expressions to search for those headers like this:



    Dim regexPattern = "*TP[0-9]{6}* *SIGNED AFFIDAVIT*"



    This is working. The thing is that its splitting the document page by page so when it splits i get the following in my directory:



    TP02433 SIGNED AFFIDAVIT 1
    TP02433 SIGNED AFFIDAVIT 2
    TP02433 SIGNED AFFIDAVIT 3
    TP02354 SIGNED AFFIDAVIT 4
    TP02354 SIGNED AFFIDAVIT 5
    TP02354 SIGNED AFFIDAVIT 6 ...


    My question is this, what could i do to my code so that when it finds lets say for example TP02433 it keeps those pages together till it finds the next TP#?



    Is there a way that this could find "TP[0-9]{6} SIGNED AFFIDAVIT" then extract all those documents keeping them together till it finds the next unique "TP[0-9]{6} SIGNED AFFIDAVIT" ?



    so that that the end looks like this:



    TP02433 SIGNED AFFIDAVIT (1 - 3)
    TP02354 SIGNED AFFIDAVIT (4 - 6) ?


    Here's my so-far working code:



    Imports System.IO
    Imports Bytescout.PDFExtractor
    Imports Microsoft.Office.Interop
    Imports System.IO.Path
    Imports System.Text
    Imports System.Text.RegularExpressions

    Module Module1

    Sub Main()
    Dim unmerged = Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Tesspdf")

    Dim pdfFile As String = "G:WordDepartment FoldersPre-SuitXavierMPOP.pdf"

    Dim extractor As New TextExtractor()

    extractor.WordMatchingMode = WordMatchingMode.ExactMatch

    extractor.LoadDocumentFromFile(pdfFile)

    Dim pageCount = extractor.GetPageCount()

    Dim currentPageTypeName = "UNKNOWN"
    Dim PageTypeName = "test"
    extractor.RegexSearch = True
    Dim regexPattern = "*TP[0-9]{6}* *SIGNED AFFIDAVIT*"



    For i = 0 To pageCount - 1


    If extractor.Find(i, regexPattern, False) Then

    PageTypeName = Regex.Replace(extractor.TextFound.Text, "[^A-Za-z0-9-/#s]", "")

    currentPageTypeName = PageTypeName

    End If


    Using splitter As New DocumentSplitter() With {.OptimizeSplittedDocuments = True}

    Dim pageNumber = i + 1 ' (!) page number in ExtractPage() is 1-based


    If Not Directory.Exists(unmerged) Then
    Directory.CreateDirectory(unmerged)
    End If

    Dim outputfile = Combine(unmerged, currentPageTypeName & " " & pageNumber & ".pdf")


    splitter.ExtractPage(pdfFile, outputfile, pageNumber)

    End Using
    Next
    extractor.Dispose()



    End Sub

    End Module


    I would use ExtractPageRange the pages vary. So i was wondering if this code could find the first "*TP[0-9]{6}* *SIGNED AFFIDAVIT*" extract ALL the pages after that header till it reaches the next "*TP[0-9]{6}* *SIGNED AFFIDAVIT*" and does the same till the pdf document is completely split?










    share|improve this question

























      0












      0








      0








      I have a PDF splitter using ByteScout.PDFExtractor. My code searches for a unique identifying header which is "TP###### SIGNED AFFIDAVIT"



      the #'s could be any integer from 0-9. I'm using regular expressions to search for those headers like this:



      Dim regexPattern = "*TP[0-9]{6}* *SIGNED AFFIDAVIT*"



      This is working. The thing is that its splitting the document page by page so when it splits i get the following in my directory:



      TP02433 SIGNED AFFIDAVIT 1
      TP02433 SIGNED AFFIDAVIT 2
      TP02433 SIGNED AFFIDAVIT 3
      TP02354 SIGNED AFFIDAVIT 4
      TP02354 SIGNED AFFIDAVIT 5
      TP02354 SIGNED AFFIDAVIT 6 ...


      My question is this, what could i do to my code so that when it finds lets say for example TP02433 it keeps those pages together till it finds the next TP#?



      Is there a way that this could find "TP[0-9]{6} SIGNED AFFIDAVIT" then extract all those documents keeping them together till it finds the next unique "TP[0-9]{6} SIGNED AFFIDAVIT" ?



      so that that the end looks like this:



      TP02433 SIGNED AFFIDAVIT (1 - 3)
      TP02354 SIGNED AFFIDAVIT (4 - 6) ?


      Here's my so-far working code:



      Imports System.IO
      Imports Bytescout.PDFExtractor
      Imports Microsoft.Office.Interop
      Imports System.IO.Path
      Imports System.Text
      Imports System.Text.RegularExpressions

      Module Module1

      Sub Main()
      Dim unmerged = Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Tesspdf")

      Dim pdfFile As String = "G:WordDepartment FoldersPre-SuitXavierMPOP.pdf"

      Dim extractor As New TextExtractor()

      extractor.WordMatchingMode = WordMatchingMode.ExactMatch

      extractor.LoadDocumentFromFile(pdfFile)

      Dim pageCount = extractor.GetPageCount()

      Dim currentPageTypeName = "UNKNOWN"
      Dim PageTypeName = "test"
      extractor.RegexSearch = True
      Dim regexPattern = "*TP[0-9]{6}* *SIGNED AFFIDAVIT*"



      For i = 0 To pageCount - 1


      If extractor.Find(i, regexPattern, False) Then

      PageTypeName = Regex.Replace(extractor.TextFound.Text, "[^A-Za-z0-9-/#s]", "")

      currentPageTypeName = PageTypeName

      End If


      Using splitter As New DocumentSplitter() With {.OptimizeSplittedDocuments = True}

      Dim pageNumber = i + 1 ' (!) page number in ExtractPage() is 1-based


      If Not Directory.Exists(unmerged) Then
      Directory.CreateDirectory(unmerged)
      End If

      Dim outputfile = Combine(unmerged, currentPageTypeName & " " & pageNumber & ".pdf")


      splitter.ExtractPage(pdfFile, outputfile, pageNumber)

      End Using
      Next
      extractor.Dispose()



      End Sub

      End Module


      I would use ExtractPageRange the pages vary. So i was wondering if this code could find the first "*TP[0-9]{6}* *SIGNED AFFIDAVIT*" extract ALL the pages after that header till it reaches the next "*TP[0-9]{6}* *SIGNED AFFIDAVIT*" and does the same till the pdf document is completely split?










      share|improve this question














      I have a PDF splitter using ByteScout.PDFExtractor. My code searches for a unique identifying header which is "TP###### SIGNED AFFIDAVIT"



      the #'s could be any integer from 0-9. I'm using regular expressions to search for those headers like this:



      Dim regexPattern = "*TP[0-9]{6}* *SIGNED AFFIDAVIT*"



      This is working. The thing is that its splitting the document page by page so when it splits i get the following in my directory:



      TP02433 SIGNED AFFIDAVIT 1
      TP02433 SIGNED AFFIDAVIT 2
      TP02433 SIGNED AFFIDAVIT 3
      TP02354 SIGNED AFFIDAVIT 4
      TP02354 SIGNED AFFIDAVIT 5
      TP02354 SIGNED AFFIDAVIT 6 ...


      My question is this, what could i do to my code so that when it finds lets say for example TP02433 it keeps those pages together till it finds the next TP#?



      Is there a way that this could find "TP[0-9]{6} SIGNED AFFIDAVIT" then extract all those documents keeping them together till it finds the next unique "TP[0-9]{6} SIGNED AFFIDAVIT" ?



      so that that the end looks like this:



      TP02433 SIGNED AFFIDAVIT (1 - 3)
      TP02354 SIGNED AFFIDAVIT (4 - 6) ?


      Here's my so-far working code:



      Imports System.IO
      Imports Bytescout.PDFExtractor
      Imports Microsoft.Office.Interop
      Imports System.IO.Path
      Imports System.Text
      Imports System.Text.RegularExpressions

      Module Module1

      Sub Main()
      Dim unmerged = Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Tesspdf")

      Dim pdfFile As String = "G:WordDepartment FoldersPre-SuitXavierMPOP.pdf"

      Dim extractor As New TextExtractor()

      extractor.WordMatchingMode = WordMatchingMode.ExactMatch

      extractor.LoadDocumentFromFile(pdfFile)

      Dim pageCount = extractor.GetPageCount()

      Dim currentPageTypeName = "UNKNOWN"
      Dim PageTypeName = "test"
      extractor.RegexSearch = True
      Dim regexPattern = "*TP[0-9]{6}* *SIGNED AFFIDAVIT*"



      For i = 0 To pageCount - 1


      If extractor.Find(i, regexPattern, False) Then

      PageTypeName = Regex.Replace(extractor.TextFound.Text, "[^A-Za-z0-9-/#s]", "")

      currentPageTypeName = PageTypeName

      End If


      Using splitter As New DocumentSplitter() With {.OptimizeSplittedDocuments = True}

      Dim pageNumber = i + 1 ' (!) page number in ExtractPage() is 1-based


      If Not Directory.Exists(unmerged) Then
      Directory.CreateDirectory(unmerged)
      End If

      Dim outputfile = Combine(unmerged, currentPageTypeName & " " & pageNumber & ".pdf")


      splitter.ExtractPage(pdfFile, outputfile, pageNumber)

      End Using
      Next
      extractor.Dispose()



      End Sub

      End Module


      I would use ExtractPageRange the pages vary. So i was wondering if this code could find the first "*TP[0-9]{6}* *SIGNED AFFIDAVIT*" extract ALL the pages after that header till it reaches the next "*TP[0-9]{6}* *SIGNED AFFIDAVIT*" and does the same till the pdf document is completely split?







      regex vb.net pdf split






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 23 '18 at 20:19









      Pr0x1moPr0x1mo

      346




      346
























          0






          active

          oldest

          votes












          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452533%2fsplit-pdf-into-separate-files-based-on-text-found-using-regex%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452533%2fsplit-pdf-into-separate-files-based-on-text-found-using-regex%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          Academy of Television Arts & Sciences

          L'Équipe

          1995 France bombings