BeautifulSoup: Get text, create dictionary
up vote
2
down vote
favorite
I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
for paper in soup.findAll("li",class_="list-group-item downfree"):
print(paper.text)
This produces the following for the first, of many, publications:
2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model
Solutions: An Algorithm with Error Formulasby Gary S. Anderson
I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:
Papers = {
'Date': 2018 - 070,
'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',
'Author/s': 'Gary S. Anderson'
}
python web-scraping beautifulsoup
add a comment |
up vote
2
down vote
favorite
I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
for paper in soup.findAll("li",class_="list-group-item downfree"):
print(paper.text)
This produces the following for the first, of many, publications:
2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model
Solutions: An Algorithm with Error Formulasby Gary S. Anderson
I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:
Papers = {
'Date': 2018 - 070,
'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',
'Author/s': 'Gary S. Anderson'
}
python web-scraping beautifulsoup
Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
– usr2564301
Nov 7 at 11:10
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
for paper in soup.findAll("li",class_="list-group-item downfree"):
print(paper.text)
This produces the following for the first, of many, publications:
2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model
Solutions: An Algorithm with Error Formulasby Gary S. Anderson
I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:
Papers = {
'Date': 2018 - 070,
'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',
'Author/s': 'Gary S. Anderson'
}
python web-scraping beautifulsoup
I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
for paper in soup.findAll("li",class_="list-group-item downfree"):
print(paper.text)
This produces the following for the first, of many, publications:
2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model
Solutions: An Algorithm with Error Formulasby Gary S. Anderson
I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:
Papers = {
'Date': 2018 - 070,
'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',
'Author/s': 'Gary S. Anderson'
}
python web-scraping beautifulsoup
python web-scraping beautifulsoup
edited Nov 7 at 11:20
ewwink
5,95422232
5,95422232
asked Nov 7 at 10:23
Barton
185
185
Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
– usr2564301
Nov 7 at 11:10
add a comment |
Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
– usr2564301
Nov 7 at 11:10
Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
– usr2564301
Nov 7 at 11:10
Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
– usr2564301
Nov 7 at 11:10
add a comment |
2 Answers
2
active
oldest
votes
up vote
0
down vote
accepted
I get good results extracting all the descendants and pick only those that are NavigableStrings. Make sure to import NavigableString from bs4. I also use a numpy list comprehension but you could use for-loops as well.
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
papers =
for paper in soup.findAll("li",class_="list-group-item downfree"):
info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})
print(papers[1])
{'Date': '2018-069',
'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}
add a comment |
up vote
0
down vote
You could use regex to match each part of string.
[-d]+
the string only have number and-
(?<=s).*?(?=by)
the string start with blank and end with by(which is begin with author)
(?<=bys).*
the author, the rest of whole string
Full code
import requests
from bs4 import BeautifulSoup
import re
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL,verify=False)
soup = BeautifulSoup(page.text, 'html.parser')
datas =
for paper in soup.findAll("li",class_="list-group-item downfree"):
data = dict()
data["date"] = re.findall(r"[-d]+",paper.text)[0]
data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]
data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]
print(data)
datas.append(data)
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
I get good results extracting all the descendants and pick only those that are NavigableStrings. Make sure to import NavigableString from bs4. I also use a numpy list comprehension but you could use for-loops as well.
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
papers =
for paper in soup.findAll("li",class_="list-group-item downfree"):
info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})
print(papers[1])
{'Date': '2018-069',
'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}
add a comment |
up vote
0
down vote
accepted
I get good results extracting all the descendants and pick only those that are NavigableStrings. Make sure to import NavigableString from bs4. I also use a numpy list comprehension but you could use for-loops as well.
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
papers =
for paper in soup.findAll("li",class_="list-group-item downfree"):
info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})
print(papers[1])
{'Date': '2018-069',
'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
I get good results extracting all the descendants and pick only those that are NavigableStrings. Make sure to import NavigableString from bs4. I also use a numpy list comprehension but you could use for-loops as well.
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
papers =
for paper in soup.findAll("li",class_="list-group-item downfree"):
info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})
print(papers[1])
{'Date': '2018-069',
'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}
I get good results extracting all the descendants and pick only those that are NavigableStrings. Make sure to import NavigableString from bs4. I also use a numpy list comprehension but you could use for-loops as well.
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
papers =
for paper in soup.findAll("li",class_="list-group-item downfree"):
info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})
print(papers[1])
{'Date': '2018-069',
'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}
edited Nov 7 at 16:01
answered Nov 7 at 11:24
Gregor
488
488
add a comment |
add a comment |
up vote
0
down vote
You could use regex to match each part of string.
[-d]+
the string only have number and-
(?<=s).*?(?=by)
the string start with blank and end with by(which is begin with author)
(?<=bys).*
the author, the rest of whole string
Full code
import requests
from bs4 import BeautifulSoup
import re
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL,verify=False)
soup = BeautifulSoup(page.text, 'html.parser')
datas =
for paper in soup.findAll("li",class_="list-group-item downfree"):
data = dict()
data["date"] = re.findall(r"[-d]+",paper.text)[0]
data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]
data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]
print(data)
datas.append(data)
add a comment |
up vote
0
down vote
You could use regex to match each part of string.
[-d]+
the string only have number and-
(?<=s).*?(?=by)
the string start with blank and end with by(which is begin with author)
(?<=bys).*
the author, the rest of whole string
Full code
import requests
from bs4 import BeautifulSoup
import re
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL,verify=False)
soup = BeautifulSoup(page.text, 'html.parser')
datas =
for paper in soup.findAll("li",class_="list-group-item downfree"):
data = dict()
data["date"] = re.findall(r"[-d]+",paper.text)[0]
data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]
data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]
print(data)
datas.append(data)
add a comment |
up vote
0
down vote
up vote
0
down vote
You could use regex to match each part of string.
[-d]+
the string only have number and-
(?<=s).*?(?=by)
the string start with blank and end with by(which is begin with author)
(?<=bys).*
the author, the rest of whole string
Full code
import requests
from bs4 import BeautifulSoup
import re
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL,verify=False)
soup = BeautifulSoup(page.text, 'html.parser')
datas =
for paper in soup.findAll("li",class_="list-group-item downfree"):
data = dict()
data["date"] = re.findall(r"[-d]+",paper.text)[0]
data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]
data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]
print(data)
datas.append(data)
You could use regex to match each part of string.
[-d]+
the string only have number and-
(?<=s).*?(?=by)
the string start with blank and end with by(which is begin with author)
(?<=bys).*
the author, the rest of whole string
Full code
import requests
from bs4 import BeautifulSoup
import re
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL,verify=False)
soup = BeautifulSoup(page.text, 'html.parser')
datas =
for paper in soup.findAll("li",class_="list-group-item downfree"):
data = dict()
data["date"] = re.findall(r"[-d]+",paper.text)[0]
data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]
data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]
print(data)
datas.append(data)
answered Nov 8 at 5:54
kcorlidy
1,276117
1,276117
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53187546%2fbeautifulsoup-get-text-create-dictionary%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Don't use "code snippet" for Python code. See the editing help that is available while entering your question.
– usr2564301
Nov 7 at 11:10