BeautifulSoup: Get text, create dictionary











up vote
2
down vote

favorite












I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:



START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
for paper in soup.findAll("li",class_="list-group-item downfree"):
print(paper.text)


This produces the following for the first, of many, publications:




2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model
Solutions: An Algorithm with Error Formulasby Gary S. Anderson




I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:



Papers = {
'Date': 2018 - 070,
'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',
'Author/s': 'Gary S. Anderson'
}









share|improve this question
























  • Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
    – usr2564301
    Nov 7 at 11:10















up vote
2
down vote

favorite












I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:



START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
for paper in soup.findAll("li",class_="list-group-item downfree"):
print(paper.text)


This produces the following for the first, of many, publications:




2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model
Solutions: An Algorithm with Error Formulasby Gary S. Anderson




I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:



Papers = {
'Date': 2018 - 070,
'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',
'Author/s': 'Gary S. Anderson'
}









share|improve this question
























  • Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
    – usr2564301
    Nov 7 at 11:10













up vote
2
down vote

favorite









up vote
2
down vote

favorite











I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:



START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
for paper in soup.findAll("li",class_="list-group-item downfree"):
print(paper.text)


This produces the following for the first, of many, publications:




2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model
Solutions: An Algorithm with Error Formulasby Gary S. Anderson




I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:



Papers = {
'Date': 2018 - 070,
'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',
'Author/s': 'Gary S. Anderson'
}









share|improve this question















I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:



START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
for paper in soup.findAll("li",class_="list-group-item downfree"):
print(paper.text)


This produces the following for the first, of many, publications:




2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model
Solutions: An Algorithm with Error Formulasby Gary S. Anderson




I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:



Papers = {
'Date': 2018 - 070,
'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',
'Author/s': 'Gary S. Anderson'
}






python web-scraping beautifulsoup






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 7 at 11:20









ewwink

5,95422232




5,95422232










asked Nov 7 at 10:23









Barton

185




185












  • Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
    – usr2564301
    Nov 7 at 11:10


















  • Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
    – usr2564301
    Nov 7 at 11:10
















Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
– usr2564301
Nov 7 at 11:10




Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
– usr2564301
Nov 7 at 11:10












2 Answers
2






active

oldest

votes

















up vote
0
down vote



accepted










I get good results extracting all the descendants and pick only those that are NavigableStrings. Make sure to import NavigableString from bs4. I also use a numpy list comprehension but you could use for-loops as well.



START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')

papers =
for paper in soup.findAll("li",class_="list-group-item downfree"):
info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})

print(papers[1])

{'Date': '2018-069',
'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}





share|improve this answer






























    up vote
    0
    down vote













    You could use regex to match each part of string.





    • [-d]+ the string only have number and -


    • (?<=s).*?(?=by) the string start with blank and end with by(which is begin with author)


    • (?<=bys).* the author, the rest of whole string


    Full code



    import requests 
    from bs4 import BeautifulSoup
    import re

    START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
    page = requests.get(START_URL,verify=False)
    soup = BeautifulSoup(page.text, 'html.parser')
    datas =
    for paper in soup.findAll("li",class_="list-group-item downfree"):
    data = dict()
    data["date"] = re.findall(r"[-d]+",paper.text)[0]
    data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]
    data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]
    print(data)
    datas.append(data)





    share|improve this answer





















      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














       

      draft saved


      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53187546%2fbeautifulsoup-get-text-create-dictionary%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      0
      down vote



      accepted










      I get good results extracting all the descendants and pick only those that are NavigableStrings. Make sure to import NavigableString from bs4. I also use a numpy list comprehension but you could use for-loops as well.



      START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
      page = requests.get(START_URL)
      soup = BeautifulSoup(page.text, 'html.parser')

      papers =
      for paper in soup.findAll("li",class_="list-group-item downfree"):
      info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
      papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})

      print(papers[1])

      {'Date': '2018-069',
      'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
      'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}





      share|improve this answer



























        up vote
        0
        down vote



        accepted










        I get good results extracting all the descendants and pick only those that are NavigableStrings. Make sure to import NavigableString from bs4. I also use a numpy list comprehension but you could use for-loops as well.



        START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
        page = requests.get(START_URL)
        soup = BeautifulSoup(page.text, 'html.parser')

        papers =
        for paper in soup.findAll("li",class_="list-group-item downfree"):
        info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
        papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})

        print(papers[1])

        {'Date': '2018-069',
        'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
        'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}





        share|improve this answer

























          up vote
          0
          down vote



          accepted







          up vote
          0
          down vote



          accepted






          I get good results extracting all the descendants and pick only those that are NavigableStrings. Make sure to import NavigableString from bs4. I also use a numpy list comprehension but you could use for-loops as well.



          START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
          page = requests.get(START_URL)
          soup = BeautifulSoup(page.text, 'html.parser')

          papers =
          for paper in soup.findAll("li",class_="list-group-item downfree"):
          info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
          papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})

          print(papers[1])

          {'Date': '2018-069',
          'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
          'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}





          share|improve this answer














          I get good results extracting all the descendants and pick only those that are NavigableStrings. Make sure to import NavigableString from bs4. I also use a numpy list comprehension but you could use for-loops as well.



          START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
          page = requests.get(START_URL)
          soup = BeautifulSoup(page.text, 'html.parser')

          papers =
          for paper in soup.findAll("li",class_="list-group-item downfree"):
          info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
          papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})

          print(papers[1])

          {'Date': '2018-069',
          'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
          'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 7 at 16:01

























          answered Nov 7 at 11:24









          Gregor

          488




          488
























              up vote
              0
              down vote













              You could use regex to match each part of string.





              • [-d]+ the string only have number and -


              • (?<=s).*?(?=by) the string start with blank and end with by(which is begin with author)


              • (?<=bys).* the author, the rest of whole string


              Full code



              import requests 
              from bs4 import BeautifulSoup
              import re

              START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
              page = requests.get(START_URL,verify=False)
              soup = BeautifulSoup(page.text, 'html.parser')
              datas =
              for paper in soup.findAll("li",class_="list-group-item downfree"):
              data = dict()
              data["date"] = re.findall(r"[-d]+",paper.text)[0]
              data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]
              data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]
              print(data)
              datas.append(data)





              share|improve this answer

























                up vote
                0
                down vote













                You could use regex to match each part of string.





                • [-d]+ the string only have number and -


                • (?<=s).*?(?=by) the string start with blank and end with by(which is begin with author)


                • (?<=bys).* the author, the rest of whole string


                Full code



                import requests 
                from bs4 import BeautifulSoup
                import re

                START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
                page = requests.get(START_URL,verify=False)
                soup = BeautifulSoup(page.text, 'html.parser')
                datas =
                for paper in soup.findAll("li",class_="list-group-item downfree"):
                data = dict()
                data["date"] = re.findall(r"[-d]+",paper.text)[0]
                data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]
                data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]
                print(data)
                datas.append(data)





                share|improve this answer























                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  You could use regex to match each part of string.





                  • [-d]+ the string only have number and -


                  • (?<=s).*?(?=by) the string start with blank and end with by(which is begin with author)


                  • (?<=bys).* the author, the rest of whole string


                  Full code



                  import requests 
                  from bs4 import BeautifulSoup
                  import re

                  START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
                  page = requests.get(START_URL,verify=False)
                  soup = BeautifulSoup(page.text, 'html.parser')
                  datas =
                  for paper in soup.findAll("li",class_="list-group-item downfree"):
                  data = dict()
                  data["date"] = re.findall(r"[-d]+",paper.text)[0]
                  data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]
                  data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]
                  print(data)
                  datas.append(data)





                  share|improve this answer












                  You could use regex to match each part of string.





                  • [-d]+ the string only have number and -


                  • (?<=s).*?(?=by) the string start with blank and end with by(which is begin with author)


                  • (?<=bys).* the author, the rest of whole string


                  Full code



                  import requests 
                  from bs4 import BeautifulSoup
                  import re

                  START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
                  page = requests.get(START_URL,verify=False)
                  soup = BeautifulSoup(page.text, 'html.parser')
                  datas =
                  for paper in soup.findAll("li",class_="list-group-item downfree"):
                  data = dict()
                  data["date"] = re.findall(r"[-d]+",paper.text)[0]
                  data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]
                  data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]
                  print(data)
                  datas.append(data)






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 8 at 5:54









                  kcorlidy

                  1,276117




                  1,276117






























                       

                      draft saved


                      draft discarded



















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53187546%2fbeautifulsoup-get-text-create-dictionary%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      這個網誌中的熱門文章

                      Xamarin.form Move up view when keyboard appear

                      Post-Redirect-Get with Spring WebFlux and Thymeleaf

                      Anylogic : not able to use stopDelay()