BeautifulSoup: Get text, create dictionary

up vote
2
down vote

favorite

I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:

START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'

page = requests.get(START_URL)

soup = BeautifulSoup(page.text, 'html.parser')

for paper in soup.findAll("li",class_="list-group-item downfree"):

    print(paper.text)

This produces the following for the first, of many, publications:

2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model
Solutions: An Algorithm with Error Formulasby Gary S. Anderson

I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:

Papers = {

  'Date': 2018 - 070,

  'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',

  'Author/s': 'Gary S. Anderson'

  }

edited Nov 7 at 11:20

ewwink

5,95422232

asked Nov 7 at 10:23

Barton

185

Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
– usr2564301
Nov 7 at 11:10

add a comment |

up vote
2
down vote

favorite

I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:

START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'

page = requests.get(START_URL)

soup = BeautifulSoup(page.text, 'html.parser')

for paper in soup.findAll("li",class_="list-group-item downfree"):

    print(paper.text)

This produces the following for the first, of many, publications:

2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model
Solutions: An Algorithm with Error Formulasby Gary S. Anderson

I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:

Papers = {

  'Date': 2018 - 070,

  'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',

  'Author/s': 'Gary S. Anderson'

  }

edited Nov 7 at 11:20

ewwink

5,95422232

asked Nov 7 at 10:23

Barton

185

Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
– usr2564301
Nov 7 at 11:10

add a comment |

up vote
2
down vote

favorite

I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:

START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'

page = requests.get(START_URL)

soup = BeautifulSoup(page.text, 'html.parser')

for paper in soup.findAll("li",class_="list-group-item downfree"):

    print(paper.text)

This produces the following for the first, of many, publications:

2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model
Solutions: An Algorithm with Error Formulasby Gary S. Anderson

I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:

Papers = {

  'Date': 2018 - 070,

  'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',

  'Author/s': 'Gary S. Anderson'

  }

edited Nov 7 at 11:20

ewwink

5,95422232

asked Nov 7 at 10:23

Barton

185

I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:

START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'

page = requests.get(START_URL)

soup = BeautifulSoup(page.text, 'html.parser')

for paper in soup.findAll("li",class_="list-group-item downfree"):

    print(paper.text)

This produces the following for the first, of many, publications:

2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model
Solutions: An Algorithm with Error Formulasby Gary S. Anderson

I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:

Papers = {

  'Date': 2018 - 070,

  'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',

  'Author/s': 'Gary S. Anderson'

  }

python web-scraping beautifulsoup

edited Nov 7 at 11:20

ewwink

5,95422232

asked Nov 7 at 10:23

Barton

185

edited Nov 7 at 11:20

ewwink

5,95422232

asked Nov 7 at 10:23

Barton

185

edited Nov 7 at 11:20

ewwink

5,95422232

edited Nov 7 at 11:20

ewwink

5,95422232

edited Nov 7 at 11:20

ewwink

5,95422232

asked Nov 7 at 10:23

Barton

185

asked Nov 7 at 10:23

Barton

185

asked Nov 7 at 10:23

Barton

185

Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
– usr2564301
Nov 7 at 11:10

add a comment |

Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
– usr2564301
Nov 7 at 11:10

Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
– usr2564301
Nov 7 at 11:10

add a comment |

2 Answers
2

active

oldest

votes

up vote
0
down vote

accepted

I get good results extracting all the descendants and pick only those that are NavigableStrings. Make sure to import NavigableString from bs4. I also use a numpy list comprehension but you could use for-loops as well.

START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'

page = requests.get(START_URL)

soup = BeautifulSoup(page.text, 'html.parser')



papers = 

for paper in soup.findAll("li",class_="list-group-item downfree"):

    info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]

    papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})



print(papers[1])



{'Date': '2018-069',

 'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',

 'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}

edited Nov 7 at 16:01

answered Nov 7 at 11:24

Gregor

488

add a comment |

up vote
0
down vote

You could use regex to match each part of string.

[-d]+ the string only have number and -

(?<=s).*?(?=by) the string start with blank and end with by(which is begin with author)

(?<=bys).* the author, the rest of whole string

Full code

import requests 

from bs4 import BeautifulSoup

import re



START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'

page = requests.get(START_URL,verify=False)

soup = BeautifulSoup(page.text, 'html.parser')

datas = 

for paper in soup.findAll("li",class_="list-group-item downfree"):

    data = dict()

    data["date"] = re.findall(r"[-d]+",paper.text)[0]

    data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]

    data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]

    print(data)

    datas.append(data)

answered Nov 8 at 5:54

kcorlidy

1,276117

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53187546%2fbeautifulsoup-get-text-create-dictionary%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
0
down vote

accepted

START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'

page = requests.get(START_URL)

soup = BeautifulSoup(page.text, 'html.parser')



papers = 

for paper in soup.findAll("li",class_="list-group-item downfree"):

    info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]

    papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})



print(papers[1])



{'Date': '2018-069',

 'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',

 'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}

edited Nov 7 at 16:01

answered Nov 7 at 11:24

Gregor

488

add a comment |

up vote
0
down vote

accepted

START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'

page = requests.get(START_URL)

soup = BeautifulSoup(page.text, 'html.parser')



papers = 

for paper in soup.findAll("li",class_="list-group-item downfree"):

    info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]

    papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})



print(papers[1])



{'Date': '2018-069',

 'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',

 'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}

edited Nov 7 at 16:01

answered Nov 7 at 11:24

Gregor

488

add a comment |

up vote
0
down vote

accepted

START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'

page = requests.get(START_URL)

soup = BeautifulSoup(page.text, 'html.parser')



papers = 

for paper in soup.findAll("li",class_="list-group-item downfree"):

    info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]

    papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})



print(papers[1])



{'Date': '2018-069',

 'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',

 'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}

edited Nov 7 at 16:01

answered Nov 7 at 11:24

Gregor

488

START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'

page = requests.get(START_URL)

soup = BeautifulSoup(page.text, 'html.parser')



papers = 

for paper in soup.findAll("li",class_="list-group-item downfree"):

    info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]

    papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})



print(papers[1])



{'Date': '2018-069',

 'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',

 'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}

edited Nov 7 at 16:01

answered Nov 7 at 11:24

Gregor

488

edited Nov 7 at 16:01

answered Nov 7 at 11:24

Gregor

488

answered Nov 7 at 11:24

Gregor

488

answered Nov 7 at 11:24

Gregor

488

add a comment |

up vote
0
down vote

You could use regex to match each part of string.

[-d]+ the string only have number and -

(?<=s).*?(?=by) the string start with blank and end with by(which is begin with author)

(?<=bys).* the author, the rest of whole string

Full code

import requests 

from bs4 import BeautifulSoup

import re



START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'

page = requests.get(START_URL,verify=False)

soup = BeautifulSoup(page.text, 'html.parser')

datas = 

for paper in soup.findAll("li",class_="list-group-item downfree"):

    data = dict()

    data["date"] = re.findall(r"[-d]+",paper.text)[0]

    data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]

    data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]

    print(data)

    datas.append(data)

answered Nov 8 at 5:54

kcorlidy

1,276117

add a comment |

up vote
0
down vote

You could use regex to match each part of string.

[-d]+ the string only have number and -

(?<=s).*?(?=by) the string start with blank and end with by(which is begin with author)

(?<=bys).* the author, the rest of whole string

Full code

import requests 

from bs4 import BeautifulSoup

import re



START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'

page = requests.get(START_URL,verify=False)

soup = BeautifulSoup(page.text, 'html.parser')

datas = 

for paper in soup.findAll("li",class_="list-group-item downfree"):

    data = dict()

    data["date"] = re.findall(r"[-d]+",paper.text)[0]

    data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]

    data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]

    print(data)

    datas.append(data)

answered Nov 8 at 5:54

kcorlidy

1,276117

add a comment |

up vote
0
down vote

You could use regex to match each part of string.

[-d]+ the string only have number and -

(?<=s).*?(?=by) the string start with blank and end with by(which is begin with author)

(?<=bys).* the author, the rest of whole string

Full code

import requests 

from bs4 import BeautifulSoup

import re



START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'

page = requests.get(START_URL,verify=False)

soup = BeautifulSoup(page.text, 'html.parser')

datas = 

for paper in soup.findAll("li",class_="list-group-item downfree"):

    data = dict()

    data["date"] = re.findall(r"[-d]+",paper.text)[0]

    data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]

    data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]

    print(data)

    datas.append(data)

answered Nov 8 at 5:54

kcorlidy

1,276117

You could use regex to match each part of string.

[-d]+ the string only have number and -

(?<=s).*?(?=by) the string start with blank and end with by(which is begin with author)

(?<=bys).* the author, the rest of whole string

Full code

import requests 

from bs4 import BeautifulSoup

import re



START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'

page = requests.get(START_URL,verify=False)

soup = BeautifulSoup(page.text, 'html.parser')

datas = 

for paper in soup.findAll("li",class_="list-group-item downfree"):

    data = dict()

    data["date"] = re.findall(r"[-d]+",paper.text)[0]

    data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]

    data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]

    print(data)

    datas.append(data)

answered Nov 8 at 5:54

kcorlidy

1,276117

answered Nov 8 at 5:54

kcorlidy

1,276117

answered Nov 8 at 5:54

kcorlidy

1,276117

answered Nov 8 at 5:54

kcorlidy

1,276117

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

aPgHeh7zg11agSHv6cjB,qH,nihBw7p1Y,hMC,yORMkA zhA QNjPGFV2BhCU1eEbICxbz1bSB0I3Xu EVjtrh

搜尋此網誌

Wsrtjtyk