Scraping: issues receiving table data when looping with request

UPDATE:
The code only works momentarily. There are 2000+ cryptos and at this moment i have 492 unique files with their history.

When I try to run a url getting skipped in the first place on its own, it works. Therefore I think it has been narrowed down to have something to do with the request of content.

Is it possible to make sure the table i'm interested in is fully loaded before continuing the code?

UPDATE:
I got it working properly. I think there is a limit of requests you can do per sec or min on the website im trying to scrape from.
I put in a delay on 3 sec between every request and NOW IT WORKS!!!!
Thanks to both of you for the help. Even though it didn't provide a direct answer, it put me on the right track to figuring it out.

from bs4 import BeautifulSoup

import requests

import pandas as pd

import time



def scraping(url):

    global line

    content = requests.get(url).content

    soup = BeautifulSoup(content,'html.parser')

    table = soup.find('table', {'class': 'table'})

    if not table:

        print(url)

        return

    data = [[td.text.strip() for td in tr.findChildren('td')] for tr in table.findChildren('tr')]

    df = pd.DataFrame(data)

    df.drop(df.index[0], inplace=True) 

    df[0] =  pd.to_datetime(df[0])

    for i in range(1,7):

        df[i] = pd.to_numeric(df[i].str.replace(",","").str.replace("-",""))

    df.columns = ['Date','Open','High','Low','Close','Volume','Market Cap']

    df.set_index('Date',inplace=True)

    df.sort_index(inplace=True)

    return df.to_csv(line + '_historical_data.csv')





with open("list_of_urls.txt") as file:

    for line in file:

        time.sleep(3)

        line = line.strip()

        start = "https://coinmarketcap.com/currencies/"

        end = "/historical-data/?start=20000101&end=21000101"

        url = start + line + end

        scraping(url)

edited Nov 18 '18 at 17:49

asked Nov 18 '18 at 0:57

Kenneth

add a comment |

UPDATE:
The code only works momentarily. There are 2000+ cryptos and at this moment i have 492 unique files with their history.

When I try to run a url getting skipped in the first place on its own, it works. Therefore I think it has been narrowed down to have something to do with the request of content.

Is it possible to make sure the table i'm interested in is fully loaded before continuing the code?

from bs4 import BeautifulSoup

import requests

import pandas as pd

import time



def scraping(url):

    global line

    content = requests.get(url).content

    soup = BeautifulSoup(content,'html.parser')

    table = soup.find('table', {'class': 'table'})

    if not table:

        print(url)

        return

    data = [[td.text.strip() for td in tr.findChildren('td')] for tr in table.findChildren('tr')]

    df = pd.DataFrame(data)

    df.drop(df.index[0], inplace=True) 

    df[0] =  pd.to_datetime(df[0])

    for i in range(1,7):

        df[i] = pd.to_numeric(df[i].str.replace(",","").str.replace("-",""))

    df.columns = ['Date','Open','High','Low','Close','Volume','Market Cap']

    df.set_index('Date',inplace=True)

    df.sort_index(inplace=True)

    return df.to_csv(line + '_historical_data.csv')





with open("list_of_urls.txt") as file:

    for line in file:

        time.sleep(3)

        line = line.strip()

        start = "https://coinmarketcap.com/currencies/"

        end = "/historical-data/?start=20000101&end=21000101"

        url = start + line + end

        scraping(url)

edited Nov 18 '18 at 17:49

asked Nov 18 '18 at 0:57

Kenneth

add a comment |

UPDATE:
The code only works momentarily. There are 2000+ cryptos and at this moment i have 492 unique files with their history.

When I try to run a url getting skipped in the first place on its own, it works. Therefore I think it has been narrowed down to have something to do with the request of content.

Is it possible to make sure the table i'm interested in is fully loaded before continuing the code?

from bs4 import BeautifulSoup

import requests

import pandas as pd

import time



def scraping(url):

    global line

    content = requests.get(url).content

    soup = BeautifulSoup(content,'html.parser')

    table = soup.find('table', {'class': 'table'})

    if not table:

        print(url)

        return

    data = [[td.text.strip() for td in tr.findChildren('td')] for tr in table.findChildren('tr')]

    df = pd.DataFrame(data)

    df.drop(df.index[0], inplace=True) 

    df[0] =  pd.to_datetime(df[0])

    for i in range(1,7):

        df[i] = pd.to_numeric(df[i].str.replace(",","").str.replace("-",""))

    df.columns = ['Date','Open','High','Low','Close','Volume','Market Cap']

    df.set_index('Date',inplace=True)

    df.sort_index(inplace=True)

    return df.to_csv(line + '_historical_data.csv')





with open("list_of_urls.txt") as file:

    for line in file:

        time.sleep(3)

        line = line.strip()

        start = "https://coinmarketcap.com/currencies/"

        end = "/historical-data/?start=20000101&end=21000101"

        url = start + line + end

        scraping(url)

edited Nov 18 '18 at 17:49

asked Nov 18 '18 at 0:57

Kenneth

UPDATE:
The code only works momentarily. There are 2000+ cryptos and at this moment i have 492 unique files with their history.

When I try to run a url getting skipped in the first place on its own, it works. Therefore I think it has been narrowed down to have something to do with the request of content.

Is it possible to make sure the table i'm interested in is fully loaded before continuing the code?

from bs4 import BeautifulSoup

import requests

import pandas as pd

import time



def scraping(url):

    global line

    content = requests.get(url).content

    soup = BeautifulSoup(content,'html.parser')

    table = soup.find('table', {'class': 'table'})

    if not table:

        print(url)

        return

    data = [[td.text.strip() for td in tr.findChildren('td')] for tr in table.findChildren('tr')]

    df = pd.DataFrame(data)

    df.drop(df.index[0], inplace=True) 

    df[0] =  pd.to_datetime(df[0])

    for i in range(1,7):

        df[i] = pd.to_numeric(df[i].str.replace(",","").str.replace("-",""))

    df.columns = ['Date','Open','High','Low','Close','Volume','Market Cap']

    df.set_index('Date',inplace=True)

    df.sort_index(inplace=True)

    return df.to_csv(line + '_historical_data.csv')





with open("list_of_urls.txt") as file:

    for line in file:

        time.sleep(3)

        line = line.strip()

        start = "https://coinmarketcap.com/currencies/"

        end = "/historical-data/?start=20000101&end=21000101"

        url = start + line + end

        scraping(url)

python-3.x pandas beautifulsoup

edited Nov 18 '18 at 17:49

asked Nov 18 '18 at 0:57

Kenneth

edited Nov 18 '18 at 17:49

asked Nov 18 '18 at 0:57

Kenneth

edited Nov 18 '18 at 17:49

asked Nov 18 '18 at 0:57

Kenneth

asked Nov 18 '18 at 0:57

Kenneth

asked Nov 18 '18 at 0:57

Kenneth

add a comment |

2 Answers
2

active

oldest

votes

it could be URL not found 404 or the page has no table. To debug normalize your loop and print current processing Crypto name

table = soup.find('table', {'class': 'table'})

if not table:

    print('no table')

    return

answered Nov 18 '18 at 2:31

ewwink

11.8k22239

Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?

– Kenneth
Nov 18 '18 at 14:50

If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…

– Kenneth
Nov 18 '18 at 15:03

in my machice both return table, I think you need sleep between request

– ewwink
Nov 19 '18 at 5:22

add a comment |

You may perform findChildren() only of the returned table and tr objects are not NoneType, as follows:

data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else 

if len(data) > 0:

    # process your data here

Hope it helps.

edited Nov 18 '18 at 17:23

answered Nov 18 '18 at 1:06

TeeKea

3,20851730

Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.

– Kenneth
Nov 18 '18 at 1:25

What about now (see my updated answer)? By adding if table and if td. Should work hopefully.

– TeeKea
Nov 18 '18 at 1:55

Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"

– Kenneth
Nov 18 '18 at 14:44

Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.

– TeeKea
Nov 18 '18 at 17:25

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53356960%2fscraping-issues-receiving-table-data-when-looping-with-request%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

it could be URL not found 404 or the page has no table. To debug normalize your loop and print current processing Crypto name

table = soup.find('table', {'class': 'table'})

if not table:

    print('no table')

    return

answered Nov 18 '18 at 2:31

ewwink

11.8k22239

Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?

– Kenneth
Nov 18 '18 at 14:50

If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…

– Kenneth
Nov 18 '18 at 15:03

in my machice both return table, I think you need sleep between request

– ewwink
Nov 19 '18 at 5:22

add a comment |

it could be URL not found 404 or the page has no table. To debug normalize your loop and print current processing Crypto name

table = soup.find('table', {'class': 'table'})

if not table:

    print('no table')

    return

answered Nov 18 '18 at 2:31

ewwink

11.8k22239

Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?

– Kenneth
Nov 18 '18 at 14:50

If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…

– Kenneth
Nov 18 '18 at 15:03

in my machice both return table, I think you need sleep between request

– ewwink
Nov 19 '18 at 5:22

add a comment |

it could be URL not found 404 or the page has no table. To debug normalize your loop and print current processing Crypto name

table = soup.find('table', {'class': 'table'})

if not table:

    print('no table')

    return

answered Nov 18 '18 at 2:31

ewwink

11.8k22239

it could be URL not found 404 or the page has no table. To debug normalize your loop and print current processing Crypto name

table = soup.find('table', {'class': 'table'})

if not table:

    print('no table')

    return

answered Nov 18 '18 at 2:31

ewwink

11.8k22239

answered Nov 18 '18 at 2:31

ewwink

11.8k22239

answered Nov 18 '18 at 2:31

ewwink

11.8k22239

answered Nov 18 '18 at 2:31

ewwink

11.8k22239

Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?

– Kenneth
Nov 18 '18 at 14:50

If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…

– Kenneth
Nov 18 '18 at 15:03

in my machice both return table, I think you need sleep between request

– ewwink
Nov 19 '18 at 5:22

add a comment |

Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?

– Kenneth
Nov 18 '18 at 14:50

If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…

– Kenneth
Nov 18 '18 at 15:03

in my machice both return table, I think you need sleep between request

– ewwink
Nov 19 '18 at 5:22

Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?

– Kenneth
Nov 18 '18 at 14:50

If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…

– Kenneth
Nov 18 '18 at 15:03

in my machice both return table, I think you need sleep between request

– ewwink
Nov 19 '18 at 5:22

add a comment |

You may perform findChildren() only of the returned table and tr objects are not NoneType, as follows:

data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else 

if len(data) > 0:

    # process your data here

Hope it helps.

edited Nov 18 '18 at 17:23

answered Nov 18 '18 at 1:06

TeeKea

3,20851730

Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.

– Kenneth
Nov 18 '18 at 1:25

What about now (see my updated answer)? By adding if table and if td. Should work hopefully.

– TeeKea
Nov 18 '18 at 1:55

Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"

– Kenneth
Nov 18 '18 at 14:44

Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.

– TeeKea
Nov 18 '18 at 17:25

add a comment |

You may perform findChildren() only of the returned table and tr objects are not NoneType, as follows:

data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else 

if len(data) > 0:

    # process your data here

Hope it helps.

edited Nov 18 '18 at 17:23

answered Nov 18 '18 at 1:06

TeeKea

3,20851730

Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.

– Kenneth
Nov 18 '18 at 1:25

What about now (see my updated answer)? By adding if table and if td. Should work hopefully.

– TeeKea
Nov 18 '18 at 1:55

Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"

– Kenneth
Nov 18 '18 at 14:44

Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.

– TeeKea
Nov 18 '18 at 17:25

add a comment |

You may perform findChildren() only of the returned table and tr objects are not NoneType, as follows:

data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else 

if len(data) > 0:

    # process your data here

Hope it helps.

edited Nov 18 '18 at 17:23

answered Nov 18 '18 at 1:06

TeeKea

3,20851730

You may perform findChildren() only of the returned table and tr objects are not NoneType, as follows:

data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else 

if len(data) > 0:

    # process your data here

Hope it helps.

edited Nov 18 '18 at 17:23

answered Nov 18 '18 at 1:06

TeeKea

3,20851730

edited Nov 18 '18 at 17:23

answered Nov 18 '18 at 1:06

TeeKea

3,20851730

answered Nov 18 '18 at 1:06

TeeKea

3,20851730

answered Nov 18 '18 at 1:06

TeeKea

3,20851730

Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.

– Kenneth
Nov 18 '18 at 1:25

What about now (see my updated answer)? By adding if table and if td. Should work hopefully.

– TeeKea
Nov 18 '18 at 1:55

Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"

– Kenneth
Nov 18 '18 at 14:44

Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.

– TeeKea
Nov 18 '18 at 17:25

add a comment |

Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.

– Kenneth
Nov 18 '18 at 1:25

What about now (see my updated answer)? By adding if table and if td. Should work hopefully.

– TeeKea
Nov 18 '18 at 1:55

Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"

– Kenneth
Nov 18 '18 at 14:44

Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.

– TeeKea
Nov 18 '18 at 17:25

Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.

– Kenneth
Nov 18 '18 at 1:25

What about now (see my updated answer)? By adding if table and if td. Should work hopefully.

– TeeKea
Nov 18 '18 at 1:55

Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"

– Kenneth
Nov 18 '18 at 14:44

Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.

– TeeKea
Nov 18 '18 at 17:25

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk