python requests & beautifulsoup bot detection












2















I'm trying to scrape all the HTML elements of a page using requests & beautifulsoup. I'm using ASIN (Amazon Standard Identification Number) to get the product details of a page. My code is as follows:



from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup

url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = urlopen(url)
soup = BeautifulSoup(response, "html.parser")
print(soup)


But the output doesn't show the entire HTML of the page, so I can't do my further work with product details.
Any help on this?



EDIT 1:



From the given answer, It shows the markup of the bot detection page. I researched a bit & found two ways to breach it :




  1. I might need to add a header in the requests, but I couldn't understand what should be the value of header.

  2. Use Selenium.
    Now my question is, do both of the ways provide equal support?










share|improve this question




















  • 1





    It'd be nice if you can give out what B004CNH98C is supposed to be so people can look at the actual page. My guess is that some of the html stuff are hidden under javascript functions. You should load the page on to Selenium and click it.

    – Joseph Seung Jae Dollar
    Aug 29 '18 at 3:37











  • meanwhile I just got acquainted with selenium webdriver. Is a new chrome window going to open everytime when I try to scrape for each page?

    – Proteeti Prova
    Aug 29 '18 at 3:50






  • 1





    Use headless options.

    – Joseph Seung Jae Dollar
    Aug 29 '18 at 3:52






  • 2





    pages that use javascript frameworks cannot be scraped wtih BS. And why scrape when amazon has such a nice API??

    – e4c5
    Aug 29 '18 at 4:02











  • I don't think Amazon API is supported in my country

    – Proteeti Prova
    Aug 29 '18 at 4:16
















2















I'm trying to scrape all the HTML elements of a page using requests & beautifulsoup. I'm using ASIN (Amazon Standard Identification Number) to get the product details of a page. My code is as follows:



from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup

url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = urlopen(url)
soup = BeautifulSoup(response, "html.parser")
print(soup)


But the output doesn't show the entire HTML of the page, so I can't do my further work with product details.
Any help on this?



EDIT 1:



From the given answer, It shows the markup of the bot detection page. I researched a bit & found two ways to breach it :




  1. I might need to add a header in the requests, but I couldn't understand what should be the value of header.

  2. Use Selenium.
    Now my question is, do both of the ways provide equal support?










share|improve this question




















  • 1





    It'd be nice if you can give out what B004CNH98C is supposed to be so people can look at the actual page. My guess is that some of the html stuff are hidden under javascript functions. You should load the page on to Selenium and click it.

    – Joseph Seung Jae Dollar
    Aug 29 '18 at 3:37











  • meanwhile I just got acquainted with selenium webdriver. Is a new chrome window going to open everytime when I try to scrape for each page?

    – Proteeti Prova
    Aug 29 '18 at 3:50






  • 1





    Use headless options.

    – Joseph Seung Jae Dollar
    Aug 29 '18 at 3:52






  • 2





    pages that use javascript frameworks cannot be scraped wtih BS. And why scrape when amazon has such a nice API??

    – e4c5
    Aug 29 '18 at 4:02











  • I don't think Amazon API is supported in my country

    – Proteeti Prova
    Aug 29 '18 at 4:16














2












2








2








I'm trying to scrape all the HTML elements of a page using requests & beautifulsoup. I'm using ASIN (Amazon Standard Identification Number) to get the product details of a page. My code is as follows:



from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup

url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = urlopen(url)
soup = BeautifulSoup(response, "html.parser")
print(soup)


But the output doesn't show the entire HTML of the page, so I can't do my further work with product details.
Any help on this?



EDIT 1:



From the given answer, It shows the markup of the bot detection page. I researched a bit & found two ways to breach it :




  1. I might need to add a header in the requests, but I couldn't understand what should be the value of header.

  2. Use Selenium.
    Now my question is, do both of the ways provide equal support?










share|improve this question
















I'm trying to scrape all the HTML elements of a page using requests & beautifulsoup. I'm using ASIN (Amazon Standard Identification Number) to get the product details of a page. My code is as follows:



from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup

url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = urlopen(url)
soup = BeautifulSoup(response, "html.parser")
print(soup)


But the output doesn't show the entire HTML of the page, so I can't do my further work with product details.
Any help on this?



EDIT 1:



From the given answer, It shows the markup of the bot detection page. I researched a bit & found two ways to breach it :




  1. I might need to add a header in the requests, but I couldn't understand what should be the value of header.

  2. Use Selenium.
    Now my question is, do both of the ways provide equal support?







python html web-scraping beautifulsoup python-requests






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Aug 29 '18 at 6:54







Proteeti Prova

















asked Aug 29 '18 at 3:09









Proteeti ProvaProteeti Prova

303315




303315








  • 1





    It'd be nice if you can give out what B004CNH98C is supposed to be so people can look at the actual page. My guess is that some of the html stuff are hidden under javascript functions. You should load the page on to Selenium and click it.

    – Joseph Seung Jae Dollar
    Aug 29 '18 at 3:37











  • meanwhile I just got acquainted with selenium webdriver. Is a new chrome window going to open everytime when I try to scrape for each page?

    – Proteeti Prova
    Aug 29 '18 at 3:50






  • 1





    Use headless options.

    – Joseph Seung Jae Dollar
    Aug 29 '18 at 3:52






  • 2





    pages that use javascript frameworks cannot be scraped wtih BS. And why scrape when amazon has such a nice API??

    – e4c5
    Aug 29 '18 at 4:02











  • I don't think Amazon API is supported in my country

    – Proteeti Prova
    Aug 29 '18 at 4:16














  • 1





    It'd be nice if you can give out what B004CNH98C is supposed to be so people can look at the actual page. My guess is that some of the html stuff are hidden under javascript functions. You should load the page on to Selenium and click it.

    – Joseph Seung Jae Dollar
    Aug 29 '18 at 3:37











  • meanwhile I just got acquainted with selenium webdriver. Is a new chrome window going to open everytime when I try to scrape for each page?

    – Proteeti Prova
    Aug 29 '18 at 3:50






  • 1





    Use headless options.

    – Joseph Seung Jae Dollar
    Aug 29 '18 at 3:52






  • 2





    pages that use javascript frameworks cannot be scraped wtih BS. And why scrape when amazon has such a nice API??

    – e4c5
    Aug 29 '18 at 4:02











  • I don't think Amazon API is supported in my country

    – Proteeti Prova
    Aug 29 '18 at 4:16








1




1





It'd be nice if you can give out what B004CNH98C is supposed to be so people can look at the actual page. My guess is that some of the html stuff are hidden under javascript functions. You should load the page on to Selenium and click it.

– Joseph Seung Jae Dollar
Aug 29 '18 at 3:37





It'd be nice if you can give out what B004CNH98C is supposed to be so people can look at the actual page. My guess is that some of the html stuff are hidden under javascript functions. You should load the page on to Selenium and click it.

– Joseph Seung Jae Dollar
Aug 29 '18 at 3:37













meanwhile I just got acquainted with selenium webdriver. Is a new chrome window going to open everytime when I try to scrape for each page?

– Proteeti Prova
Aug 29 '18 at 3:50





meanwhile I just got acquainted with selenium webdriver. Is a new chrome window going to open everytime when I try to scrape for each page?

– Proteeti Prova
Aug 29 '18 at 3:50




1




1





Use headless options.

– Joseph Seung Jae Dollar
Aug 29 '18 at 3:52





Use headless options.

– Joseph Seung Jae Dollar
Aug 29 '18 at 3:52




2




2





pages that use javascript frameworks cannot be scraped wtih BS. And why scrape when amazon has such a nice API??

– e4c5
Aug 29 '18 at 4:02





pages that use javascript frameworks cannot be scraped wtih BS. And why scrape when amazon has such a nice API??

– e4c5
Aug 29 '18 at 4:02













I don't think Amazon API is supported in my country

– Proteeti Prova
Aug 29 '18 at 4:16





I don't think Amazon API is supported in my country

– Proteeti Prova
Aug 29 '18 at 4:16












3 Answers
3






active

oldest

votes


















3














As some of the comments already suggested, if you need to somehow interact with Javascript on a page, it is better to use selenium. However, regarding your first approach using a header:



import requests
from bs4 import BeautifulSoup

url = "http://www.amazon.com/dp/" + 'B004CNH98C'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text,"html.parser")


These headers are a bit old, but should still work. By using them you are pretending that your request is coming from a normal webbrowser. If you use requests without such a header your code is basically telling the server that the request is coming from python, which most of the servers are rejecting right away.



Another alternative for you could also be fake-useragent maybe you can also have a try with this.






share|improve this answer



















  • 1





    I was confused if 'User-Agent' takes any predefined format to give my machine information. I came across this developers.whatismybrowser.com/useragents/explore/…. I guess this will be header I pass, am I correct?

    – Proteeti Prova
    Aug 29 '18 at 6:46













  • Also from the docs, it says that custom made headers are given less precendence. Does it mean "less precedence" in terms of accepting the requests?

    – Proteeti Prova
    Aug 29 '18 at 6:49






  • 1





    From the list of browsers you posted you can select the header you want to use. Your request is then pretending to come from this browser. I haven't found the passage about "less precedence" so I can only assume what is meant, but in general the servers are mostly rejecting requests which look in some way automated in order to keep a good performance. This is why it is necessary to pretend to be a real browser so that the server is accepting your request.

    – WurzelseppQX
    Aug 29 '18 at 7:26






  • 2





    However these days most websites are providing APIs for people who want to use automated requests. This is actually good for both parties. API requests are better for server performance and also for you less code is necessary and it is much more straightforward. So in general I can recommend to check if a page is providing an API, before trying to parse it the "hacky" way.

    – WurzelseppQX
    Aug 29 '18 at 7:30



















0














try this:



import requests
from bs4 import BeautifulSoup

url = "http://www.amazon.com/dp/" + 'B004CNH98C'
r = requests.get(url)
r = r.text

##options #1
# print r.text

soup = BeautifulSoup( r.encode("utf-8") , "html.parser")

### options 2
print(soup)





share|improve this answer
























  • Already tried this way, leads to the "make sure you are not a robot" page.

    – Proteeti Prova
    Aug 29 '18 at 5:39



















0














It is better to use fake_useragent here for making things easy. A random user agent sends request via real world browser usage statistic. If you don't need dynamic content, you're almost always better off just requesting the page content over HTTP and parsing it programmatically.



import requests
from fake_useragent import UserAgent
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
ua=UserAgent();
hdr = {'User-Agent': ua.random,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = requests.get(url, headers=hdr)
print response.content


Selenium is used for browser automation and high level web scraping for dynamic contents.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52068920%2fpython-requests-beautifulsoup-bot-detection%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3














    As some of the comments already suggested, if you need to somehow interact with Javascript on a page, it is better to use selenium. However, regarding your first approach using a header:



    import requests
    from bs4 import BeautifulSoup

    url = "http://www.amazon.com/dp/" + 'B004CNH98C'
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text,"html.parser")


    These headers are a bit old, but should still work. By using them you are pretending that your request is coming from a normal webbrowser. If you use requests without such a header your code is basically telling the server that the request is coming from python, which most of the servers are rejecting right away.



    Another alternative for you could also be fake-useragent maybe you can also have a try with this.






    share|improve this answer



















    • 1





      I was confused if 'User-Agent' takes any predefined format to give my machine information. I came across this developers.whatismybrowser.com/useragents/explore/…. I guess this will be header I pass, am I correct?

      – Proteeti Prova
      Aug 29 '18 at 6:46













    • Also from the docs, it says that custom made headers are given less precendence. Does it mean "less precedence" in terms of accepting the requests?

      – Proteeti Prova
      Aug 29 '18 at 6:49






    • 1





      From the list of browsers you posted you can select the header you want to use. Your request is then pretending to come from this browser. I haven't found the passage about "less precedence" so I can only assume what is meant, but in general the servers are mostly rejecting requests which look in some way automated in order to keep a good performance. This is why it is necessary to pretend to be a real browser so that the server is accepting your request.

      – WurzelseppQX
      Aug 29 '18 at 7:26






    • 2





      However these days most websites are providing APIs for people who want to use automated requests. This is actually good for both parties. API requests are better for server performance and also for you less code is necessary and it is much more straightforward. So in general I can recommend to check if a page is providing an API, before trying to parse it the "hacky" way.

      – WurzelseppQX
      Aug 29 '18 at 7:30
















    3














    As some of the comments already suggested, if you need to somehow interact with Javascript on a page, it is better to use selenium. However, regarding your first approach using a header:



    import requests
    from bs4 import BeautifulSoup

    url = "http://www.amazon.com/dp/" + 'B004CNH98C'
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text,"html.parser")


    These headers are a bit old, but should still work. By using them you are pretending that your request is coming from a normal webbrowser. If you use requests without such a header your code is basically telling the server that the request is coming from python, which most of the servers are rejecting right away.



    Another alternative for you could also be fake-useragent maybe you can also have a try with this.






    share|improve this answer



















    • 1





      I was confused if 'User-Agent' takes any predefined format to give my machine information. I came across this developers.whatismybrowser.com/useragents/explore/…. I guess this will be header I pass, am I correct?

      – Proteeti Prova
      Aug 29 '18 at 6:46













    • Also from the docs, it says that custom made headers are given less precendence. Does it mean "less precedence" in terms of accepting the requests?

      – Proteeti Prova
      Aug 29 '18 at 6:49






    • 1





      From the list of browsers you posted you can select the header you want to use. Your request is then pretending to come from this browser. I haven't found the passage about "less precedence" so I can only assume what is meant, but in general the servers are mostly rejecting requests which look in some way automated in order to keep a good performance. This is why it is necessary to pretend to be a real browser so that the server is accepting your request.

      – WurzelseppQX
      Aug 29 '18 at 7:26






    • 2





      However these days most websites are providing APIs for people who want to use automated requests. This is actually good for both parties. API requests are better for server performance and also for you less code is necessary and it is much more straightforward. So in general I can recommend to check if a page is providing an API, before trying to parse it the "hacky" way.

      – WurzelseppQX
      Aug 29 '18 at 7:30














    3












    3








    3







    As some of the comments already suggested, if you need to somehow interact with Javascript on a page, it is better to use selenium. However, regarding your first approach using a header:



    import requests
    from bs4 import BeautifulSoup

    url = "http://www.amazon.com/dp/" + 'B004CNH98C'
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text,"html.parser")


    These headers are a bit old, but should still work. By using them you are pretending that your request is coming from a normal webbrowser. If you use requests without such a header your code is basically telling the server that the request is coming from python, which most of the servers are rejecting right away.



    Another alternative for you could also be fake-useragent maybe you can also have a try with this.






    share|improve this answer













    As some of the comments already suggested, if you need to somehow interact with Javascript on a page, it is better to use selenium. However, regarding your first approach using a header:



    import requests
    from bs4 import BeautifulSoup

    url = "http://www.amazon.com/dp/" + 'B004CNH98C'
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text,"html.parser")


    These headers are a bit old, but should still work. By using them you are pretending that your request is coming from a normal webbrowser. If you use requests without such a header your code is basically telling the server that the request is coming from python, which most of the servers are rejecting right away.



    Another alternative for you could also be fake-useragent maybe you can also have a try with this.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Aug 29 '18 at 6:36









    WurzelseppQXWurzelseppQX

    426414




    426414








    • 1





      I was confused if 'User-Agent' takes any predefined format to give my machine information. I came across this developers.whatismybrowser.com/useragents/explore/…. I guess this will be header I pass, am I correct?

      – Proteeti Prova
      Aug 29 '18 at 6:46













    • Also from the docs, it says that custom made headers are given less precendence. Does it mean "less precedence" in terms of accepting the requests?

      – Proteeti Prova
      Aug 29 '18 at 6:49






    • 1





      From the list of browsers you posted you can select the header you want to use. Your request is then pretending to come from this browser. I haven't found the passage about "less precedence" so I can only assume what is meant, but in general the servers are mostly rejecting requests which look in some way automated in order to keep a good performance. This is why it is necessary to pretend to be a real browser so that the server is accepting your request.

      – WurzelseppQX
      Aug 29 '18 at 7:26






    • 2





      However these days most websites are providing APIs for people who want to use automated requests. This is actually good for both parties. API requests are better for server performance and also for you less code is necessary and it is much more straightforward. So in general I can recommend to check if a page is providing an API, before trying to parse it the "hacky" way.

      – WurzelseppQX
      Aug 29 '18 at 7:30














    • 1





      I was confused if 'User-Agent' takes any predefined format to give my machine information. I came across this developers.whatismybrowser.com/useragents/explore/…. I guess this will be header I pass, am I correct?

      – Proteeti Prova
      Aug 29 '18 at 6:46













    • Also from the docs, it says that custom made headers are given less precendence. Does it mean "less precedence" in terms of accepting the requests?

      – Proteeti Prova
      Aug 29 '18 at 6:49






    • 1





      From the list of browsers you posted you can select the header you want to use. Your request is then pretending to come from this browser. I haven't found the passage about "less precedence" so I can only assume what is meant, but in general the servers are mostly rejecting requests which look in some way automated in order to keep a good performance. This is why it is necessary to pretend to be a real browser so that the server is accepting your request.

      – WurzelseppQX
      Aug 29 '18 at 7:26






    • 2





      However these days most websites are providing APIs for people who want to use automated requests. This is actually good for both parties. API requests are better for server performance and also for you less code is necessary and it is much more straightforward. So in general I can recommend to check if a page is providing an API, before trying to parse it the "hacky" way.

      – WurzelseppQX
      Aug 29 '18 at 7:30








    1




    1





    I was confused if 'User-Agent' takes any predefined format to give my machine information. I came across this developers.whatismybrowser.com/useragents/explore/…. I guess this will be header I pass, am I correct?

    – Proteeti Prova
    Aug 29 '18 at 6:46







    I was confused if 'User-Agent' takes any predefined format to give my machine information. I came across this developers.whatismybrowser.com/useragents/explore/…. I guess this will be header I pass, am I correct?

    – Proteeti Prova
    Aug 29 '18 at 6:46















    Also from the docs, it says that custom made headers are given less precendence. Does it mean "less precedence" in terms of accepting the requests?

    – Proteeti Prova
    Aug 29 '18 at 6:49





    Also from the docs, it says that custom made headers are given less precendence. Does it mean "less precedence" in terms of accepting the requests?

    – Proteeti Prova
    Aug 29 '18 at 6:49




    1




    1





    From the list of browsers you posted you can select the header you want to use. Your request is then pretending to come from this browser. I haven't found the passage about "less precedence" so I can only assume what is meant, but in general the servers are mostly rejecting requests which look in some way automated in order to keep a good performance. This is why it is necessary to pretend to be a real browser so that the server is accepting your request.

    – WurzelseppQX
    Aug 29 '18 at 7:26





    From the list of browsers you posted you can select the header you want to use. Your request is then pretending to come from this browser. I haven't found the passage about "less precedence" so I can only assume what is meant, but in general the servers are mostly rejecting requests which look in some way automated in order to keep a good performance. This is why it is necessary to pretend to be a real browser so that the server is accepting your request.

    – WurzelseppQX
    Aug 29 '18 at 7:26




    2




    2





    However these days most websites are providing APIs for people who want to use automated requests. This is actually good for both parties. API requests are better for server performance and also for you less code is necessary and it is much more straightforward. So in general I can recommend to check if a page is providing an API, before trying to parse it the "hacky" way.

    – WurzelseppQX
    Aug 29 '18 at 7:30





    However these days most websites are providing APIs for people who want to use automated requests. This is actually good for both parties. API requests are better for server performance and also for you less code is necessary and it is much more straightforward. So in general I can recommend to check if a page is providing an API, before trying to parse it the "hacky" way.

    – WurzelseppQX
    Aug 29 '18 at 7:30













    0














    try this:



    import requests
    from bs4 import BeautifulSoup

    url = "http://www.amazon.com/dp/" + 'B004CNH98C'
    r = requests.get(url)
    r = r.text

    ##options #1
    # print r.text

    soup = BeautifulSoup( r.encode("utf-8") , "html.parser")

    ### options 2
    print(soup)





    share|improve this answer
























    • Already tried this way, leads to the "make sure you are not a robot" page.

      – Proteeti Prova
      Aug 29 '18 at 5:39
















    0














    try this:



    import requests
    from bs4 import BeautifulSoup

    url = "http://www.amazon.com/dp/" + 'B004CNH98C'
    r = requests.get(url)
    r = r.text

    ##options #1
    # print r.text

    soup = BeautifulSoup( r.encode("utf-8") , "html.parser")

    ### options 2
    print(soup)





    share|improve this answer
























    • Already tried this way, leads to the "make sure you are not a robot" page.

      – Proteeti Prova
      Aug 29 '18 at 5:39














    0












    0








    0







    try this:



    import requests
    from bs4 import BeautifulSoup

    url = "http://www.amazon.com/dp/" + 'B004CNH98C'
    r = requests.get(url)
    r = r.text

    ##options #1
    # print r.text

    soup = BeautifulSoup( r.encode("utf-8") , "html.parser")

    ### options 2
    print(soup)





    share|improve this answer













    try this:



    import requests
    from bs4 import BeautifulSoup

    url = "http://www.amazon.com/dp/" + 'B004CNH98C'
    r = requests.get(url)
    r = r.text

    ##options #1
    # print r.text

    soup = BeautifulSoup( r.encode("utf-8") , "html.parser")

    ### options 2
    print(soup)






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Aug 29 '18 at 5:35









    BryroBryro

    1686




    1686













    • Already tried this way, leads to the "make sure you are not a robot" page.

      – Proteeti Prova
      Aug 29 '18 at 5:39



















    • Already tried this way, leads to the "make sure you are not a robot" page.

      – Proteeti Prova
      Aug 29 '18 at 5:39

















    Already tried this way, leads to the "make sure you are not a robot" page.

    – Proteeti Prova
    Aug 29 '18 at 5:39





    Already tried this way, leads to the "make sure you are not a robot" page.

    – Proteeti Prova
    Aug 29 '18 at 5:39











    0














    It is better to use fake_useragent here for making things easy. A random user agent sends request via real world browser usage statistic. If you don't need dynamic content, you're almost always better off just requesting the page content over HTTP and parsing it programmatically.



    import requests
    from fake_useragent import UserAgent
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    ua=UserAgent();
    hdr = {'User-Agent': ua.random,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding': 'none',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive'}
    url = "http://www.amazon.com/dp/" + 'B004CNH98C'
    response = requests.get(url, headers=hdr)
    print response.content


    Selenium is used for browser automation and high level web scraping for dynamic contents.






    share|improve this answer




























      0














      It is better to use fake_useragent here for making things easy. A random user agent sends request via real world browser usage statistic. If you don't need dynamic content, you're almost always better off just requesting the page content over HTTP and parsing it programmatically.



      import requests
      from fake_useragent import UserAgent
      headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
      ua=UserAgent();
      hdr = {'User-Agent': ua.random,
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
      'Accept-Encoding': 'none',
      'Accept-Language': 'en-US,en;q=0.8',
      'Connection': 'keep-alive'}
      url = "http://www.amazon.com/dp/" + 'B004CNH98C'
      response = requests.get(url, headers=hdr)
      print response.content


      Selenium is used for browser automation and high level web scraping for dynamic contents.






      share|improve this answer


























        0












        0








        0







        It is better to use fake_useragent here for making things easy. A random user agent sends request via real world browser usage statistic. If you don't need dynamic content, you're almost always better off just requesting the page content over HTTP and parsing it programmatically.



        import requests
        from fake_useragent import UserAgent
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        ua=UserAgent();
        hdr = {'User-Agent': ua.random,
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
        'Accept-Encoding': 'none',
        'Accept-Language': 'en-US,en;q=0.8',
        'Connection': 'keep-alive'}
        url = "http://www.amazon.com/dp/" + 'B004CNH98C'
        response = requests.get(url, headers=hdr)
        print response.content


        Selenium is used for browser automation and high level web scraping for dynamic contents.






        share|improve this answer













        It is better to use fake_useragent here for making things easy. A random user agent sends request via real world browser usage statistic. If you don't need dynamic content, you're almost always better off just requesting the page content over HTTP and parsing it programmatically.



        import requests
        from fake_useragent import UserAgent
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        ua=UserAgent();
        hdr = {'User-Agent': ua.random,
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
        'Accept-Encoding': 'none',
        'Accept-Language': 'en-US,en;q=0.8',
        'Connection': 'keep-alive'}
        url = "http://www.amazon.com/dp/" + 'B004CNH98C'
        response = requests.get(url, headers=hdr)
        print response.content


        Selenium is used for browser automation and high level web scraping for dynamic contents.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 20 '18 at 16:00









        Mutasim SadiMutasim Sadi

        62




        62






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52068920%2fpython-requests-beautifulsoup-bot-detection%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Xamarin.form Move up view when keyboard appear

            Post-Redirect-Get with Spring WebFlux and Thymeleaf

            Anylogic : not able to use stopDelay()