Scraping: issues receiving table data when looping with request












1















UPDATE:
The code only works momentarily. There are 2000+ cryptos and at this moment i have 492 unique files with their history.



When I try to run a url getting skipped in the first place on its own, it works. Therefore I think it has been narrowed down to have something to do with the request of content.



Is it possible to make sure the table i'm interested in is fully loaded before continuing the code?



UPDATE:
I got it working properly. I think there is a limit of requests you can do per sec or min on the website im trying to scrape from.
I put in a delay on 3 sec between every request and NOW IT WORKS!!!!
Thanks to both of you for the help. Even though it didn't provide a direct answer, it put me on the right track to figuring it out.



from bs4 import BeautifulSoup
import requests
import pandas as pd
import time

def scraping(url):
global line
content = requests.get(url).content
soup = BeautifulSoup(content,'html.parser')
table = soup.find('table', {'class': 'table'})
if not table:
print(url)
return
data = [[td.text.strip() for td in tr.findChildren('td')] for tr in table.findChildren('tr')]
df = pd.DataFrame(data)
df.drop(df.index[0], inplace=True)
df[0] = pd.to_datetime(df[0])
for i in range(1,7):
df[i] = pd.to_numeric(df[i].str.replace(",","").str.replace("-",""))
df.columns = ['Date','Open','High','Low','Close','Volume','Market Cap']
df.set_index('Date',inplace=True)
df.sort_index(inplace=True)
return df.to_csv(line + '_historical_data.csv')


with open("list_of_urls.txt") as file:
for line in file:
time.sleep(3)
line = line.strip()
start = "https://coinmarketcap.com/currencies/"
end = "/historical-data/?start=20000101&end=21000101"
url = start + line + end
scraping(url)









share|improve this question





























    1















    UPDATE:
    The code only works momentarily. There are 2000+ cryptos and at this moment i have 492 unique files with their history.



    When I try to run a url getting skipped in the first place on its own, it works. Therefore I think it has been narrowed down to have something to do with the request of content.



    Is it possible to make sure the table i'm interested in is fully loaded before continuing the code?



    UPDATE:
    I got it working properly. I think there is a limit of requests you can do per sec or min on the website im trying to scrape from.
    I put in a delay on 3 sec between every request and NOW IT WORKS!!!!
    Thanks to both of you for the help. Even though it didn't provide a direct answer, it put me on the right track to figuring it out.



    from bs4 import BeautifulSoup
    import requests
    import pandas as pd
    import time

    def scraping(url):
    global line
    content = requests.get(url).content
    soup = BeautifulSoup(content,'html.parser')
    table = soup.find('table', {'class': 'table'})
    if not table:
    print(url)
    return
    data = [[td.text.strip() for td in tr.findChildren('td')] for tr in table.findChildren('tr')]
    df = pd.DataFrame(data)
    df.drop(df.index[0], inplace=True)
    df[0] = pd.to_datetime(df[0])
    for i in range(1,7):
    df[i] = pd.to_numeric(df[i].str.replace(",","").str.replace("-",""))
    df.columns = ['Date','Open','High','Low','Close','Volume','Market Cap']
    df.set_index('Date',inplace=True)
    df.sort_index(inplace=True)
    return df.to_csv(line + '_historical_data.csv')


    with open("list_of_urls.txt") as file:
    for line in file:
    time.sleep(3)
    line = line.strip()
    start = "https://coinmarketcap.com/currencies/"
    end = "/historical-data/?start=20000101&end=21000101"
    url = start + line + end
    scraping(url)









    share|improve this question



























      1












      1








      1








      UPDATE:
      The code only works momentarily. There are 2000+ cryptos and at this moment i have 492 unique files with their history.



      When I try to run a url getting skipped in the first place on its own, it works. Therefore I think it has been narrowed down to have something to do with the request of content.



      Is it possible to make sure the table i'm interested in is fully loaded before continuing the code?



      UPDATE:
      I got it working properly. I think there is a limit of requests you can do per sec or min on the website im trying to scrape from.
      I put in a delay on 3 sec between every request and NOW IT WORKS!!!!
      Thanks to both of you for the help. Even though it didn't provide a direct answer, it put me on the right track to figuring it out.



      from bs4 import BeautifulSoup
      import requests
      import pandas as pd
      import time

      def scraping(url):
      global line
      content = requests.get(url).content
      soup = BeautifulSoup(content,'html.parser')
      table = soup.find('table', {'class': 'table'})
      if not table:
      print(url)
      return
      data = [[td.text.strip() for td in tr.findChildren('td')] for tr in table.findChildren('tr')]
      df = pd.DataFrame(data)
      df.drop(df.index[0], inplace=True)
      df[0] = pd.to_datetime(df[0])
      for i in range(1,7):
      df[i] = pd.to_numeric(df[i].str.replace(",","").str.replace("-",""))
      df.columns = ['Date','Open','High','Low','Close','Volume','Market Cap']
      df.set_index('Date',inplace=True)
      df.sort_index(inplace=True)
      return df.to_csv(line + '_historical_data.csv')


      with open("list_of_urls.txt") as file:
      for line in file:
      time.sleep(3)
      line = line.strip()
      start = "https://coinmarketcap.com/currencies/"
      end = "/historical-data/?start=20000101&end=21000101"
      url = start + line + end
      scraping(url)









      share|improve this question
















      UPDATE:
      The code only works momentarily. There are 2000+ cryptos and at this moment i have 492 unique files with their history.



      When I try to run a url getting skipped in the first place on its own, it works. Therefore I think it has been narrowed down to have something to do with the request of content.



      Is it possible to make sure the table i'm interested in is fully loaded before continuing the code?



      UPDATE:
      I got it working properly. I think there is a limit of requests you can do per sec or min on the website im trying to scrape from.
      I put in a delay on 3 sec between every request and NOW IT WORKS!!!!
      Thanks to both of you for the help. Even though it didn't provide a direct answer, it put me on the right track to figuring it out.



      from bs4 import BeautifulSoup
      import requests
      import pandas as pd
      import time

      def scraping(url):
      global line
      content = requests.get(url).content
      soup = BeautifulSoup(content,'html.parser')
      table = soup.find('table', {'class': 'table'})
      if not table:
      print(url)
      return
      data = [[td.text.strip() for td in tr.findChildren('td')] for tr in table.findChildren('tr')]
      df = pd.DataFrame(data)
      df.drop(df.index[0], inplace=True)
      df[0] = pd.to_datetime(df[0])
      for i in range(1,7):
      df[i] = pd.to_numeric(df[i].str.replace(",","").str.replace("-",""))
      df.columns = ['Date','Open','High','Low','Close','Volume','Market Cap']
      df.set_index('Date',inplace=True)
      df.sort_index(inplace=True)
      return df.to_csv(line + '_historical_data.csv')


      with open("list_of_urls.txt") as file:
      for line in file:
      time.sleep(3)
      line = line.strip()
      start = "https://coinmarketcap.com/currencies/"
      end = "/historical-data/?start=20000101&end=21000101"
      url = start + line + end
      scraping(url)






      python-3.x pandas beautifulsoup






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 18 '18 at 17:49







      Kenneth

















      asked Nov 18 '18 at 0:57









      KennethKenneth

      62




      62
























          2 Answers
          2






          active

          oldest

          votes


















          0














          it could be URL not found 404 or the page has no table. To debug normalize your loop and print current processing Crypto name



          table = soup.find('table', {'class': 'table'})
          if not table:
          print('no table')
          return





          share|improve this answer
























          • Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?

            – Kenneth
            Nov 18 '18 at 14:50













          • If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…

            – Kenneth
            Nov 18 '18 at 15:03













          • in my machice both return table, I think you need sleep between request

            – ewwink
            Nov 19 '18 at 5:22



















          0














          You may perform findChildren() only of the returned table and tr objects are not NoneType, as follows:



          data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else 
          if len(data) > 0:
          # process your data here


          Hope it helps.






          share|improve this answer


























          • Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.

            – Kenneth
            Nov 18 '18 at 1:25













          • What about now (see my updated answer)? By adding if table and if td. Should work hopefully.

            – TeeKea
            Nov 18 '18 at 1:55











          • Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"

            – Kenneth
            Nov 18 '18 at 14:44











          • Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.

            – TeeKea
            Nov 18 '18 at 17:25











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53356960%2fscraping-issues-receiving-table-data-when-looping-with-request%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          it could be URL not found 404 or the page has no table. To debug normalize your loop and print current processing Crypto name



          table = soup.find('table', {'class': 'table'})
          if not table:
          print('no table')
          return





          share|improve this answer
























          • Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?

            – Kenneth
            Nov 18 '18 at 14:50













          • If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…

            – Kenneth
            Nov 18 '18 at 15:03













          • in my machice both return table, I think you need sleep between request

            – ewwink
            Nov 19 '18 at 5:22
















          0














          it could be URL not found 404 or the page has no table. To debug normalize your loop and print current processing Crypto name



          table = soup.find('table', {'class': 'table'})
          if not table:
          print('no table')
          return





          share|improve this answer
























          • Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?

            – Kenneth
            Nov 18 '18 at 14:50













          • If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…

            – Kenneth
            Nov 18 '18 at 15:03













          • in my machice both return table, I think you need sleep between request

            – ewwink
            Nov 19 '18 at 5:22














          0












          0








          0







          it could be URL not found 404 or the page has no table. To debug normalize your loop and print current processing Crypto name



          table = soup.find('table', {'class': 'table'})
          if not table:
          print('no table')
          return





          share|improve this answer













          it could be URL not found 404 or the page has no table. To debug normalize your loop and print current processing Crypto name



          table = soup.find('table', {'class': 'table'})
          if not table:
          print('no table')
          return






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 18 '18 at 2:31









          ewwinkewwink

          11.8k22239




          11.8k22239













          • Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?

            – Kenneth
            Nov 18 '18 at 14:50













          • If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…

            – Kenneth
            Nov 18 '18 at 15:03













          • in my machice both return table, I think you need sleep between request

            – ewwink
            Nov 19 '18 at 5:22



















          • Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?

            – Kenneth
            Nov 18 '18 at 14:50













          • If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…

            – Kenneth
            Nov 18 '18 at 15:03













          • in my machice both return table, I think you need sleep between request

            – ewwink
            Nov 19 '18 at 5:22

















          Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?

          – Kenneth
          Nov 18 '18 at 14:50







          Hi, thank you so much. It works better now (i got 207 files this time). Do you have any idea why it cant find the table in the skipped ones, because if i check the website it is there? Maybe even a solution to get the last 1500+ cryptos history?

          – Kenneth
          Nov 18 '18 at 14:50















          If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…

          – Kenneth
          Nov 18 '18 at 15:03







          If you want to inspect the differences (i dont think there is any), this got skipped: coinmarketcap.com/currencies/commerceblock/historical-data/… and this worked: coinmarketcap.com/currencies/bitcoin/historical-data/…

          – Kenneth
          Nov 18 '18 at 15:03















          in my machice both return table, I think you need sleep between request

          – ewwink
          Nov 19 '18 at 5:22





          in my machice both return table, I think you need sleep between request

          – ewwink
          Nov 19 '18 at 5:22













          0














          You may perform findChildren() only of the returned table and tr objects are not NoneType, as follows:



          data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else 
          if len(data) > 0:
          # process your data here


          Hope it helps.






          share|improve this answer


























          • Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.

            – Kenneth
            Nov 18 '18 at 1:25













          • What about now (see my updated answer)? By adding if table and if td. Should work hopefully.

            – TeeKea
            Nov 18 '18 at 1:55











          • Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"

            – Kenneth
            Nov 18 '18 at 14:44











          • Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.

            – TeeKea
            Nov 18 '18 at 17:25
















          0














          You may perform findChildren() only of the returned table and tr objects are not NoneType, as follows:



          data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else 
          if len(data) > 0:
          # process your data here


          Hope it helps.






          share|improve this answer


























          • Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.

            – Kenneth
            Nov 18 '18 at 1:25













          • What about now (see my updated answer)? By adding if table and if td. Should work hopefully.

            – TeeKea
            Nov 18 '18 at 1:55











          • Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"

            – Kenneth
            Nov 18 '18 at 14:44











          • Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.

            – TeeKea
            Nov 18 '18 at 17:25














          0












          0








          0







          You may perform findChildren() only of the returned table and tr objects are not NoneType, as follows:



          data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else 
          if len(data) > 0:
          # process your data here


          Hope it helps.






          share|improve this answer















          You may perform findChildren() only of the returned table and tr objects are not NoneType, as follows:



          data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else 
          if len(data) > 0:
          # process your data here


          Hope it helps.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 18 '18 at 17:23

























          answered Nov 18 '18 at 1:06









          TeeKeaTeeKea

          3,20851730




          3,20851730













          • Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.

            – Kenneth
            Nov 18 '18 at 1:25













          • What about now (see my updated answer)? By adding if table and if td. Should work hopefully.

            – TeeKea
            Nov 18 '18 at 1:55











          • Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"

            – Kenneth
            Nov 18 '18 at 14:44











          • Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.

            – TeeKea
            Nov 18 '18 at 17:25



















          • Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.

            – Kenneth
            Nov 18 '18 at 1:25













          • What about now (see my updated answer)? By adding if table and if td. Should work hopefully.

            – TeeKea
            Nov 18 '18 at 1:55











          • Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"

            – Kenneth
            Nov 18 '18 at 14:44











          • Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.

            – TeeKea
            Nov 18 '18 at 17:25

















          Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.

          – Kenneth
          Nov 18 '18 at 1:25







          Unfortunately not. Just tried it out. Same error. I got through the first 38 before i get the error. It seems so odd to me.

          – Kenneth
          Nov 18 '18 at 1:25















          What about now (see my updated answer)? By adding if table and if td. Should work hopefully.

          – TeeKea
          Nov 18 '18 at 1:55





          What about now (see my updated answer)? By adding if table and if td. Should work hopefully.

          – TeeKea
          Nov 18 '18 at 1:55













          Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"

          – Kenneth
          Nov 18 '18 at 14:44





          Hi, now it gives me this instead: "IndexError: index 0 is out of bounds for axis 0 with size 0" for this line "df.drop(df.index[0], inplace=True)"

          – Kenneth
          Nov 18 '18 at 14:44













          Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.

          – TeeKea
          Nov 18 '18 at 17:25





          Because you would need to skip processing if there are no td cells. Look at my updated answer. Hope it works now.

          – TeeKea
          Nov 18 '18 at 17:25


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53356960%2fscraping-issues-receiving-table-data-when-looping-with-request%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          Xamarin.form Move up view when keyboard appear

          Post-Redirect-Get with Spring WebFlux and Thymeleaf

          Anylogic : not able to use stopDelay()