Need to create a dataset on news using python











up vote
0
down vote

favorite












I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code



import requests     
from bs4 import BeautifulSoup
import pandas
import csv
from datetime import datetime

records=

def cnbc(base_url):

r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='https://www.cnbc.com/'
comments=''
genre='Political'
date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
records.append((Title,content,Country,website,comments,genre,date))

cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")


but this is only allowing me to extract one news.



Can anyone tell me that how can I extract all the news url from the root directory of website.










share|improve this question






















  • of course, you need to get string from https://www.cnbc.com/ for all latest news.
    – ewwink
    Nov 10 at 12:45















up vote
0
down vote

favorite












I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code



import requests     
from bs4 import BeautifulSoup
import pandas
import csv
from datetime import datetime

records=

def cnbc(base_url):

r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='https://www.cnbc.com/'
comments=''
genre='Political'
date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
records.append((Title,content,Country,website,comments,genre,date))

cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")


but this is only allowing me to extract one news.



Can anyone tell me that how can I extract all the news url from the root directory of website.










share|improve this question






















  • of course, you need to get string from https://www.cnbc.com/ for all latest news.
    – ewwink
    Nov 10 at 12:45













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code



import requests     
from bs4 import BeautifulSoup
import pandas
import csv
from datetime import datetime

records=

def cnbc(base_url):

r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='https://www.cnbc.com/'
comments=''
genre='Political'
date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
records.append((Title,content,Country,website,comments,genre,date))

cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")


but this is only allowing me to extract one news.



Can anyone tell me that how can I extract all the news url from the root directory of website.










share|improve this question













I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code



import requests     
from bs4 import BeautifulSoup
import pandas
import csv
from datetime import datetime

records=

def cnbc(base_url):

r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='https://www.cnbc.com/'
comments=''
genre='Political'
date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
records.append((Title,content,Country,website,comments,genre,date))

cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")


but this is only allowing me to extract one news.



Can anyone tell me that how can I extract all the news url from the root directory of website.







python beautifulsoup dataset






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 10 at 9:19









Ahmed

11




11












  • of course, you need to get string from https://www.cnbc.com/ for all latest news.
    – ewwink
    Nov 10 at 12:45


















  • of course, you need to get string from https://www.cnbc.com/ for all latest news.
    – ewwink
    Nov 10 at 12:45
















of course, you need to get string from https://www.cnbc.com/ for all latest news.
– ewwink
Nov 10 at 12:45




of course, you need to get string from https://www.cnbc.com/ for all latest news.
– ewwink
Nov 10 at 12:45












2 Answers
2






active

oldest

votes

















up vote
0
down vote













This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.



import requests     
from bs4 import BeautifulSoup
from datetime import datetime

# https://www.xxxx.com"
WEB_SITE_BASE_URL= ""
# https://www.xxxx.com/?region=us
WEB_SITE_REGION_URL = ""

def get_categories(web_site_base_url):
r = requests.get(web_site_base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
spans = soup.find_all(attrs={"nav-menu-buttonText"})
categories = [category.text for category in spans]
return categories

def get_links(category_url):
r = requests.get(category_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
links = [a.get('href') for a in soup.find_all('a', href=True)]
filtered_links = list(set([k for k in links if '/2018/11/' in k]))
return filtered_links

def news(link):
r = requests.get(link)
c = r.content
soup = BeautifulSoup(c,"html.parser")
Title=soup.find("h1",{"class":"title"}).text.replace("r","").replace("n","")
content=' '
for content_tag in soup.find_all("p"):
content = content+content_tag.text.replace("r","").replace("n","")
content= content[18:-458]
Country ='United States'
website='WEB_SITE_BASE_URL'
comments=''
date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")
d = datetime.strptime(date, "%d %b %Y")
date = d.strftime("%d-%m-%Y")
spans = soup.find_all(attrs={"header_title last breadcrumb"})
categories = [category.text for category in spans]
genre = categories
return(Title,content,Country,website,comments,genre,date)

categories = get_categories(WEB_SITE_REGION_URL)
list_of_link_lists =
for category in categories:
list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
articles_list =
for link in flat_link_list:
try:
articles_list.append(news(WEB_SITE_BASE_URL + link))
except:
print("Something was wrong")
continue

print(articles_list)





share|improve this answer




























    up vote
    0
    down vote













    There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div class is headline news_headline = soup.find_all("div",class_="headline"). Then check if element is what we want.



    new = 
    for div in news_headline:
    each = ()
    if div.a:
    each[0] = url + div.a.get("href")
    if div.a.text:
    # use split to remove t n blankspace
    each[1] = " ".join(div.a.text.split())
    else:
    each[1] = " ".join(div.a.get("title").split())
    new.append(each)
    else:
    continue


    It is the full code but i wrote this as short as i can.



    import requests     
    from bs4 import BeautifulSoup

    def index(url="https://www.cnbc.com/world/"):
    with requests.Session() as se:
    se.encoding = "UTF-8"
    res = se.get(url)
    text = res.text
    soup = BeautifulSoup(text,"lxml")
    news_headline = soup.find_all("div",class_="headline")
    news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
    print(news_)

    index()





    share|improve this answer





















    • kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
      – Ahmed
      Nov 11 at 8:53










    • Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
      – kcorlidy
      Nov 11 at 9:05










    • Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
      – Ahmed
      Nov 12 at 10:40










    • @Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
      – kcorlidy
      Nov 12 at 11:28











    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53237567%2fneed-to-create-a-dataset-on-news-using-python%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.



    import requests     
    from bs4 import BeautifulSoup
    from datetime import datetime

    # https://www.xxxx.com"
    WEB_SITE_BASE_URL= ""
    # https://www.xxxx.com/?region=us
    WEB_SITE_REGION_URL = ""

    def get_categories(web_site_base_url):
    r = requests.get(web_site_base_url)
    c = r.content
    soup = BeautifulSoup(c,"html.parser")
    spans = soup.find_all(attrs={"nav-menu-buttonText"})
    categories = [category.text for category in spans]
    return categories

    def get_links(category_url):
    r = requests.get(category_url)
    c = r.content
    soup = BeautifulSoup(c,"html.parser")
    links = [a.get('href') for a in soup.find_all('a', href=True)]
    filtered_links = list(set([k for k in links if '/2018/11/' in k]))
    return filtered_links

    def news(link):
    r = requests.get(link)
    c = r.content
    soup = BeautifulSoup(c,"html.parser")
    Title=soup.find("h1",{"class":"title"}).text.replace("r","").replace("n","")
    content=' '
    for content_tag in soup.find_all("p"):
    content = content+content_tag.text.replace("r","").replace("n","")
    content= content[18:-458]
    Country ='United States'
    website='WEB_SITE_BASE_URL'
    comments=''
    date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")
    d = datetime.strptime(date, "%d %b %Y")
    date = d.strftime("%d-%m-%Y")
    spans = soup.find_all(attrs={"header_title last breadcrumb"})
    categories = [category.text for category in spans]
    genre = categories
    return(Title,content,Country,website,comments,genre,date)

    categories = get_categories(WEB_SITE_REGION_URL)
    list_of_link_lists =
    for category in categories:
    list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
    flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
    articles_list =
    for link in flat_link_list:
    try:
    articles_list.append(news(WEB_SITE_BASE_URL + link))
    except:
    print("Something was wrong")
    continue

    print(articles_list)





    share|improve this answer

























      up vote
      0
      down vote













      This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.



      import requests     
      from bs4 import BeautifulSoup
      from datetime import datetime

      # https://www.xxxx.com"
      WEB_SITE_BASE_URL= ""
      # https://www.xxxx.com/?region=us
      WEB_SITE_REGION_URL = ""

      def get_categories(web_site_base_url):
      r = requests.get(web_site_base_url)
      c = r.content
      soup = BeautifulSoup(c,"html.parser")
      spans = soup.find_all(attrs={"nav-menu-buttonText"})
      categories = [category.text for category in spans]
      return categories

      def get_links(category_url):
      r = requests.get(category_url)
      c = r.content
      soup = BeautifulSoup(c,"html.parser")
      links = [a.get('href') for a in soup.find_all('a', href=True)]
      filtered_links = list(set([k for k in links if '/2018/11/' in k]))
      return filtered_links

      def news(link):
      r = requests.get(link)
      c = r.content
      soup = BeautifulSoup(c,"html.parser")
      Title=soup.find("h1",{"class":"title"}).text.replace("r","").replace("n","")
      content=' '
      for content_tag in soup.find_all("p"):
      content = content+content_tag.text.replace("r","").replace("n","")
      content= content[18:-458]
      Country ='United States'
      website='WEB_SITE_BASE_URL'
      comments=''
      date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")
      d = datetime.strptime(date, "%d %b %Y")
      date = d.strftime("%d-%m-%Y")
      spans = soup.find_all(attrs={"header_title last breadcrumb"})
      categories = [category.text for category in spans]
      genre = categories
      return(Title,content,Country,website,comments,genre,date)

      categories = get_categories(WEB_SITE_REGION_URL)
      list_of_link_lists =
      for category in categories:
      list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
      flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
      articles_list =
      for link in flat_link_list:
      try:
      articles_list.append(news(WEB_SITE_BASE_URL + link))
      except:
      print("Something was wrong")
      continue

      print(articles_list)





      share|improve this answer























        up vote
        0
        down vote










        up vote
        0
        down vote









        This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.



        import requests     
        from bs4 import BeautifulSoup
        from datetime import datetime

        # https://www.xxxx.com"
        WEB_SITE_BASE_URL= ""
        # https://www.xxxx.com/?region=us
        WEB_SITE_REGION_URL = ""

        def get_categories(web_site_base_url):
        r = requests.get(web_site_base_url)
        c = r.content
        soup = BeautifulSoup(c,"html.parser")
        spans = soup.find_all(attrs={"nav-menu-buttonText"})
        categories = [category.text for category in spans]
        return categories

        def get_links(category_url):
        r = requests.get(category_url)
        c = r.content
        soup = BeautifulSoup(c,"html.parser")
        links = [a.get('href') for a in soup.find_all('a', href=True)]
        filtered_links = list(set([k for k in links if '/2018/11/' in k]))
        return filtered_links

        def news(link):
        r = requests.get(link)
        c = r.content
        soup = BeautifulSoup(c,"html.parser")
        Title=soup.find("h1",{"class":"title"}).text.replace("r","").replace("n","")
        content=' '
        for content_tag in soup.find_all("p"):
        content = content+content_tag.text.replace("r","").replace("n","")
        content= content[18:-458]
        Country ='United States'
        website='WEB_SITE_BASE_URL'
        comments=''
        date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")
        d = datetime.strptime(date, "%d %b %Y")
        date = d.strftime("%d-%m-%Y")
        spans = soup.find_all(attrs={"header_title last breadcrumb"})
        categories = [category.text for category in spans]
        genre = categories
        return(Title,content,Country,website,comments,genre,date)

        categories = get_categories(WEB_SITE_REGION_URL)
        list_of_link_lists =
        for category in categories:
        list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
        flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
        articles_list =
        for link in flat_link_list:
        try:
        articles_list.append(news(WEB_SITE_BASE_URL + link))
        except:
        print("Something was wrong")
        continue

        print(articles_list)





        share|improve this answer












        This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.



        import requests     
        from bs4 import BeautifulSoup
        from datetime import datetime

        # https://www.xxxx.com"
        WEB_SITE_BASE_URL= ""
        # https://www.xxxx.com/?region=us
        WEB_SITE_REGION_URL = ""

        def get_categories(web_site_base_url):
        r = requests.get(web_site_base_url)
        c = r.content
        soup = BeautifulSoup(c,"html.parser")
        spans = soup.find_all(attrs={"nav-menu-buttonText"})
        categories = [category.text for category in spans]
        return categories

        def get_links(category_url):
        r = requests.get(category_url)
        c = r.content
        soup = BeautifulSoup(c,"html.parser")
        links = [a.get('href') for a in soup.find_all('a', href=True)]
        filtered_links = list(set([k for k in links if '/2018/11/' in k]))
        return filtered_links

        def news(link):
        r = requests.get(link)
        c = r.content
        soup = BeautifulSoup(c,"html.parser")
        Title=soup.find("h1",{"class":"title"}).text.replace("r","").replace("n","")
        content=' '
        for content_tag in soup.find_all("p"):
        content = content+content_tag.text.replace("r","").replace("n","")
        content= content[18:-458]
        Country ='United States'
        website='WEB_SITE_BASE_URL'
        comments=''
        date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")
        d = datetime.strptime(date, "%d %b %Y")
        date = d.strftime("%d-%m-%Y")
        spans = soup.find_all(attrs={"header_title last breadcrumb"})
        categories = [category.text for category in spans]
        genre = categories
        return(Title,content,Country,website,comments,genre,date)

        categories = get_categories(WEB_SITE_REGION_URL)
        list_of_link_lists =
        for category in categories:
        list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))
        flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))
        articles_list =
        for link in flat_link_list:
        try:
        articles_list.append(news(WEB_SITE_BASE_URL + link))
        except:
        print("Something was wrong")
        continue

        print(articles_list)






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 10 at 20:57









        jaskowitchious

        61




        61
























            up vote
            0
            down vote













            There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div class is headline news_headline = soup.find_all("div",class_="headline"). Then check if element is what we want.



            new = 
            for div in news_headline:
            each = ()
            if div.a:
            each[0] = url + div.a.get("href")
            if div.a.text:
            # use split to remove t n blankspace
            each[1] = " ".join(div.a.text.split())
            else:
            each[1] = " ".join(div.a.get("title").split())
            new.append(each)
            else:
            continue


            It is the full code but i wrote this as short as i can.



            import requests     
            from bs4 import BeautifulSoup

            def index(url="https://www.cnbc.com/world/"):
            with requests.Session() as se:
            se.encoding = "UTF-8"
            res = se.get(url)
            text = res.text
            soup = BeautifulSoup(text,"lxml")
            news_headline = soup.find_all("div",class_="headline")
            news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
            print(news_)

            index()





            share|improve this answer





















            • kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
              – Ahmed
              Nov 11 at 8:53










            • Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
              – kcorlidy
              Nov 11 at 9:05










            • Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
              – Ahmed
              Nov 12 at 10:40










            • @Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
              – kcorlidy
              Nov 12 at 11:28















            up vote
            0
            down vote













            There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div class is headline news_headline = soup.find_all("div",class_="headline"). Then check if element is what we want.



            new = 
            for div in news_headline:
            each = ()
            if div.a:
            each[0] = url + div.a.get("href")
            if div.a.text:
            # use split to remove t n blankspace
            each[1] = " ".join(div.a.text.split())
            else:
            each[1] = " ".join(div.a.get("title").split())
            new.append(each)
            else:
            continue


            It is the full code but i wrote this as short as i can.



            import requests     
            from bs4 import BeautifulSoup

            def index(url="https://www.cnbc.com/world/"):
            with requests.Session() as se:
            se.encoding = "UTF-8"
            res = se.get(url)
            text = res.text
            soup = BeautifulSoup(text,"lxml")
            news_headline = soup.find_all("div",class_="headline")
            news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
            print(news_)

            index()





            share|improve this answer





















            • kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
              – Ahmed
              Nov 11 at 8:53










            • Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
              – kcorlidy
              Nov 11 at 9:05










            • Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
              – Ahmed
              Nov 12 at 10:40










            • @Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
              – kcorlidy
              Nov 12 at 11:28













            up vote
            0
            down vote










            up vote
            0
            down vote









            There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div class is headline news_headline = soup.find_all("div",class_="headline"). Then check if element is what we want.



            new = 
            for div in news_headline:
            each = ()
            if div.a:
            each[0] = url + div.a.get("href")
            if div.a.text:
            # use split to remove t n blankspace
            each[1] = " ".join(div.a.text.split())
            else:
            each[1] = " ".join(div.a.get("title").split())
            new.append(each)
            else:
            continue


            It is the full code but i wrote this as short as i can.



            import requests     
            from bs4 import BeautifulSoup

            def index(url="https://www.cnbc.com/world/"):
            with requests.Session() as se:
            se.encoding = "UTF-8"
            res = se.get(url)
            text = res.text
            soup = BeautifulSoup(text,"lxml")
            news_headline = soup.find_all("div",class_="headline")
            news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
            print(news_)

            index()





            share|improve this answer












            There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div class is headline news_headline = soup.find_all("div",class_="headline"). Then check if element is what we want.



            new = 
            for div in news_headline:
            each = ()
            if div.a:
            each[0] = url + div.a.get("href")
            if div.a.text:
            # use split to remove t n blankspace
            each[1] = " ".join(div.a.text.split())
            else:
            each[1] = " ".join(div.a.get("title").split())
            new.append(each)
            else:
            continue


            It is the full code but i wrote this as short as i can.



            import requests     
            from bs4 import BeautifulSoup

            def index(url="https://www.cnbc.com/world/"):
            with requests.Session() as se:
            se.encoding = "UTF-8"
            res = se.get(url)
            text = res.text
            soup = BeautifulSoup(text,"lxml")
            news_headline = soup.find_all("div",class_="headline")
            news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else "".join(div.a.get("title").split()) ) for div in news_headline if div.a]
            print(news_)

            index()






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 11 at 4:36









            kcorlidy

            1,8721317




            1,8721317












            • kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
              – Ahmed
              Nov 11 at 8:53










            • Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
              – kcorlidy
              Nov 11 at 9:05










            • Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
              – Ahmed
              Nov 12 at 10:40










            • @Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
              – kcorlidy
              Nov 12 at 11:28


















            • kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
              – Ahmed
              Nov 11 at 8:53










            • Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
              – kcorlidy
              Nov 11 at 9:05










            • Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
              – Ahmed
              Nov 12 at 10:40










            • @Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
              – kcorlidy
              Nov 12 at 11:28
















            kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
            – Ahmed
            Nov 11 at 8:53




            kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
            – Ahmed
            Nov 11 at 8:53












            Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
            – kcorlidy
            Nov 11 at 9:05




            Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
            – kcorlidy
            Nov 11 at 9:05












            Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
            – Ahmed
            Nov 12 at 10:40




            Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
            – Ahmed
            Nov 12 at 10:40












            @Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
            – kcorlidy
            Nov 12 at 11:28




            @Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
            – kcorlidy
            Nov 12 at 11:28


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53237567%2fneed-to-create-a-dataset-on-news-using-python%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Academy of Television Arts & Sciences

            L'Équipe

            1995 France bombings