Scraping: issues receiving table data when looping with request
UPDATE:
The code only works momentarily. There are 2000+ cryptos and at this moment i have 492 unique files with their history.
When I try to run a url getting skipped in the first place on its own, it works. Therefore I think it has been narrowed down to have something to do with the request of content.
Is it possible to make sure the table i'm interested in is fully loaded before continuing the code?
UPDATE:
I got it working properly. I think there is a limit of requests you can do per sec or min on the website im trying to scrape from.
I put in a delay on 3 sec between every request and NOW IT WORKS!!!!
Thanks to both of you for the help. Even though it didn't provide a direct answer, it put me on the right track to figuring it out.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
def scraping(url):
global line
content = requests.get(url).content
soup = BeautifulSoup(content,'html.parser')
table = soup.find('table', {'class': 'table'})
if not table:
print(url)
return
data = [[td.text.strip() for td in tr.findChildren('td')] for tr in table.findChildren('tr')]
df = pd.DataFrame(data)
df.drop(df.index[0], inplace=True)
df[0] = pd.to_datetime(df[0])
for i in range(1,7):
df[i] = pd.to_numeric(df[i].str.replace(",","").str.replace("-",""))
df.columns = ['Date','Open','High','Low','Close','Volume','Market Cap']
df.set_index('Date',inplace=True)
df.sort_index(inplace=True)
return df.to_csv(line + '_historical_data.csv')
with open("list_of_urls.txt") as file:
for line in file:
time.sleep(3)
line = line.strip()
start = "https://coinmarketcap.com/currencies/"
end = "/historical-data/?start=20000101&end=21000101"
url = start + line + end
scraping(url)
python-3.x pandas beautifulsoup
add a comment |
UPDATE:
The code only works momentarily. There are 2000+ cryptos and at this moment i have 492 unique files with their history.
When I try to run a url getting skipped in the first place on its own, it works. Therefore I think it has been narrowed down to have something to do with the request of content.
Is it possible to make sure the table i'm interested in is fully loaded before continuing the code?
UPDATE:
I got it working properly. I think there is a limit of requests you can do per sec or min on the website im trying to scrape from.
I put in a delay on 3 sec between every request and NOW IT WORKS!!!!
Thanks to both of you for the help. Even though it didn't provide a direct answer, it put me on the right track to figuring it out.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
def scraping(url):
global line
content = requests.get(url).content
soup = BeautifulSoup(content,'html.parser')
table = soup.find('table', {'class': 'table'})
if not table:
print(url)
return
data = [[td.text.strip() for td in tr.findChildren('td')] for tr in table.findChildren('tr')]
df = pd.DataFrame(data)
df.drop(df.index[0], inplace=True)
df[0] = pd.to_datetime(df[0])
for i in range(1,7):
df[i] = pd.to_numeric(df[i].str.replace(",","").str.replace("-",""))
df.columns = ['Date','Open','High','Low','Close','Volume','Market Cap']
df.set_index('Date',inplace=True)
df.sort_index(inplace=True)
return df.to_csv(line + '_historical_data.csv')
with open("list_of_urls.txt") as file:
for line in file:
time.sleep(3)
line = line.strip()
start = "https://coinmarketcap.com/currencies/"
end = "/historical-data/?start=20000101&end=21000101"
url = start + line + end
scraping(url)
python-3.x pandas beautifulsoup
add a comment |
UPDATE:
The code only works momentarily. There are 2000+ cryptos and at this moment i have 492 unique files with their history.
When I try to run a url getting skipped in the first place on its own, it works. Therefore I think it has been narrowed down to have something to do with the request of content.
Is it possible to make sure the table i'm interested in is fully loaded before continuing the code?
UPDATE:
I got it working properly. I think there is a limit of requests you can do per sec or min on the website im trying to scrape from.
I put in a delay on 3 sec between every request and NOW IT WORKS!!!!
Thanks to both of you for the help. Even though it didn't provide a direct answer, it put me on the right track to figuring it out.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
def scraping(url):
global line
content = requests.get(url).content
soup = BeautifulSoup(content,'html.parser')
table = soup.find('table', {'class': 'table'})
if not table:
print(url)
return
data = [[td.text.strip() for td in tr.findChildren('td')] for tr in table.findChildren('tr')]
df = pd.DataFrame(data)
df.drop(df.index[0], inplace=True)
df[0] = pd.to_datetime(df[0])
for i in range(1,7):
df[i] = pd.to_numeric(df[i].str.replace(",","").str.replace("-",""))
df.columns = ['Date','Open','High','Low','Close','Volume','Market Cap']
df.set_index('Date',inplace=True)
df.sort_index(inplace=True)
return df.to_csv(line + '_historical_data.csv')
with open("list_of_urls.txt") as file:
for line in file:
time.sleep(3)
line = line.strip()
start = "https://coinmarketcap.com/currencies/"
end = "/historical-data/?start=20000101&end=21000101"
url = start + line + end
scraping(url)
python-3.x pandas beautifulsoup
UPDATE:
The code only works momentarily. There are 2000+ cryptos and at this moment i have 492 unique files with their history.
When I try to run a url getting skipped in the first place on its own, it works. Therefore I think it has been narrowed down to have something to do with the request of content.
Is it possible to make sure the table i'm interested in is fully loaded before continuing the code?
UPDATE:
I got it working properly. I think there is a limit of requests you can do per sec or min on the website im trying to scrape from.
I put in a delay on 3 sec between every request and NOW IT WORKS!!!!
Thanks to both of you for the help. Even though it didn't provide a direct answer, it put me on the right track to figuring it out.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
def scraping(url):
global line
content = requests.get(url).content
soup = BeautifulSoup(content,'html.parser')
table = soup.find('table', {'class': 'table'})
if not table:
print(url)
return
data = [[td.text.strip() for td in tr.findChildren('td')] for tr in table.findChildren('tr')]
df = pd.DataFrame(data)
df.drop(df.index[0], inplace=True)
df[0] = pd.to_datetime(df[0])
for i in range(1,7):
df[i] = pd.to_numeric(df[i].str.replace(",","").str.replace("-",""))
df.columns = ['Date','Open','High','Low','Close','Volume','Market Cap']
df.set_index('Date',inplace=True)
df.sort_index(inplace=True)
return df.to_csv(line + '_historical_data.csv')
with open("list_of_urls.txt") as file:
for line in file:
time.sleep(3)
line = line.strip()
start = "https://coinmarketcap.com/currencies/"
end = "/historical-data/?start=20000101&end=21000101"
url = start + line + end
scraping(url)
python-3.x pandas beautifulsoup
python-3.x pandas beautifulsoup
edited Nov 18 '18 at 17:49
Kenneth
asked Nov 18 '18 at 0:57
KennethKenneth
62
62
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
it could be URL not found 404
or the page has no table. To debug normalize your loop and print current processing Crypto name
table = soup.find('table', {'class': 'table'})
if not table:
print('no table')
return
Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?
– Kenneth
Nov 18 '18 at 14:50
If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…
– Kenneth
Nov 18 '18 at 15:03
in my machice both return table, I think you needsleep
between request
– ewwink
Nov 19 '18 at 5:22
add a comment |
You may perform findChildren()
only of the returned table
and tr
objects are not NoneType
, as follows:
data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else
if len(data) > 0:
# process your data here
Hope it helps.
Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.
– Kenneth
Nov 18 '18 at 1:25
What about now (see my updated answer)? By addingif table
andif td
. Should work hopefully.
– TeeKea
Nov 18 '18 at 1:55
Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"
– Kenneth
Nov 18 '18 at 14:44
Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.
– TeeKea
Nov 18 '18 at 17:25
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53356960%2fscraping-issues-receiving-table-data-when-looping-with-request%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
it could be URL not found 404
or the page has no table. To debug normalize your loop and print current processing Crypto name
table = soup.find('table', {'class': 'table'})
if not table:
print('no table')
return
Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?
– Kenneth
Nov 18 '18 at 14:50
If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…
– Kenneth
Nov 18 '18 at 15:03
in my machice both return table, I think you needsleep
between request
– ewwink
Nov 19 '18 at 5:22
add a comment |
it could be URL not found 404
or the page has no table. To debug normalize your loop and print current processing Crypto name
table = soup.find('table', {'class': 'table'})
if not table:
print('no table')
return
Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?
– Kenneth
Nov 18 '18 at 14:50
If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…
– Kenneth
Nov 18 '18 at 15:03
in my machice both return table, I think you needsleep
between request
– ewwink
Nov 19 '18 at 5:22
add a comment |
it could be URL not found 404
or the page has no table. To debug normalize your loop and print current processing Crypto name
table = soup.find('table', {'class': 'table'})
if not table:
print('no table')
return
it could be URL not found 404
or the page has no table. To debug normalize your loop and print current processing Crypto name
table = soup.find('table', {'class': 'table'})
if not table:
print('no table')
return
answered Nov 18 '18 at 2:31
ewwinkewwink
11.8k22239
11.8k22239
Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?
– Kenneth
Nov 18 '18 at 14:50
If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…
– Kenneth
Nov 18 '18 at 15:03
in my machice both return table, I think you needsleep
between request
– ewwink
Nov 19 '18 at 5:22
add a comment |
Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?
– Kenneth
Nov 18 '18 at 14:50
If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…
– Kenneth
Nov 18 '18 at 15:03
in my machice both return table, I think you needsleep
between request
– ewwink
Nov 19 '18 at 5:22
Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?
– Kenneth
Nov 18 '18 at 14:50
Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?
– Kenneth
Nov 18 '18 at 14:50
If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…
– Kenneth
Nov 18 '18 at 15:03
If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…
– Kenneth
Nov 18 '18 at 15:03
in my machice both return table, I think you need
sleep
between request– ewwink
Nov 19 '18 at 5:22
in my machice both return table, I think you need
sleep
between request– ewwink
Nov 19 '18 at 5:22
add a comment |
You may perform findChildren()
only of the returned table
and tr
objects are not NoneType
, as follows:
data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else
if len(data) > 0:
# process your data here
Hope it helps.
Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.
– Kenneth
Nov 18 '18 at 1:25
What about now (see my updated answer)? By addingif table
andif td
. Should work hopefully.
– TeeKea
Nov 18 '18 at 1:55
Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"
– Kenneth
Nov 18 '18 at 14:44
Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.
– TeeKea
Nov 18 '18 at 17:25
add a comment |
You may perform findChildren()
only of the returned table
and tr
objects are not NoneType
, as follows:
data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else
if len(data) > 0:
# process your data here
Hope it helps.
Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.
– Kenneth
Nov 18 '18 at 1:25
What about now (see my updated answer)? By addingif table
andif td
. Should work hopefully.
– TeeKea
Nov 18 '18 at 1:55
Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"
– Kenneth
Nov 18 '18 at 14:44
Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.
– TeeKea
Nov 18 '18 at 17:25
add a comment |
You may perform findChildren()
only of the returned table
and tr
objects are not NoneType
, as follows:
data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else
if len(data) > 0:
# process your data here
Hope it helps.
You may perform findChildren()
only of the returned table
and tr
objects are not NoneType
, as follows:
data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else
if len(data) > 0:
# process your data here
Hope it helps.
edited Nov 18 '18 at 17:23
answered Nov 18 '18 at 1:06
TeeKeaTeeKea
3,20851730
3,20851730
Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.
– Kenneth
Nov 18 '18 at 1:25
What about now (see my updated answer)? By addingif table
andif td
. Should work hopefully.
– TeeKea
Nov 18 '18 at 1:55
Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"
– Kenneth
Nov 18 '18 at 14:44
Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.
– TeeKea
Nov 18 '18 at 17:25
add a comment |
Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.
– Kenneth
Nov 18 '18 at 1:25
What about now (see my updated answer)? By addingif table
andif td
. Should work hopefully.
– TeeKea
Nov 18 '18 at 1:55
Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"
– Kenneth
Nov 18 '18 at 14:44
Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.
– TeeKea
Nov 18 '18 at 17:25
Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.
– Kenneth
Nov 18 '18 at 1:25
Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.
– Kenneth
Nov 18 '18 at 1:25
What about now (see my updated answer)? By adding
if table
and if td
. Should work hopefully.– TeeKea
Nov 18 '18 at 1:55
What about now (see my updated answer)? By adding
if table
and if td
. Should work hopefully.– TeeKea
Nov 18 '18 at 1:55
Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"
– Kenneth
Nov 18 '18 at 14:44
Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"
– Kenneth
Nov 18 '18 at 14:44
Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.
– TeeKea
Nov 18 '18 at 17:25
Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.
– TeeKea
Nov 18 '18 at 17:25
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53356960%2fscraping-issues-receiving-table-data-when-looping-with-request%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown