Need to create a dataset on news using python

up vote
0
down vote

favorite

I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code

import requests     

from bs4 import BeautifulSoup    

import pandas    

import csv    

from datetime import datetime



records=



def cnbc(base_url):



    r = requests.get(base_url)    

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")    

    Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")

    content=' '

    for content_tag in soup.find_all("p"):

        content = content+content_tag.text.replace("r","").replace("n","")

    content= content[18:-458]

    Country ='United States'

    website='https://www.cnbc.com/' 

    comments='' 

    genre='Political'

    date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")

    d = datetime.strptime(date, "%d %b %Y")

    date = d.strftime("%d-%m-%Y")

    records.append((Title,content,Country,website,comments,genre,date))



cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")

but this is only allowing me to extract one news.

Can anyone tell me that how can I extract all the news url from the root directory of website.

asked Nov 10 at 9:19

Ahmed

of course, you need to get string from https://www.cnbc.com/ for all latest news.
– ewwink
Nov 10 at 12:45

add a comment |

up vote
0
down vote

favorite

I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code

import requests     

from bs4 import BeautifulSoup    

import pandas    

import csv    

from datetime import datetime



records=



def cnbc(base_url):



    r = requests.get(base_url)    

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")    

    Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")

    content=' '

    for content_tag in soup.find_all("p"):

        content = content+content_tag.text.replace("r","").replace("n","")

    content= content[18:-458]

    Country ='United States'

    website='https://www.cnbc.com/' 

    comments='' 

    genre='Political'

    date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")

    d = datetime.strptime(date, "%d %b %Y")

    date = d.strftime("%d-%m-%Y")

    records.append((Title,content,Country,website,comments,genre,date))



cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")

but this is only allowing me to extract one news.

Can anyone tell me that how can I extract all the news url from the root directory of website.

asked Nov 10 at 9:19

Ahmed

of course, you need to get string from https://www.cnbc.com/ for all latest news.
– ewwink
Nov 10 at 12:45

add a comment |

up vote
0
down vote

favorite

I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code

import requests     

from bs4 import BeautifulSoup    

import pandas    

import csv    

from datetime import datetime



records=



def cnbc(base_url):



    r = requests.get(base_url)    

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")    

    Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")

    content=' '

    for content_tag in soup.find_all("p"):

        content = content+content_tag.text.replace("r","").replace("n","")

    content= content[18:-458]

    Country ='United States'

    website='https://www.cnbc.com/' 

    comments='' 

    genre='Political'

    date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")

    d = datetime.strptime(date, "%d %b %Y")

    date = d.strftime("%d-%m-%Y")

    records.append((Title,content,Country,website,comments,genre,date))



cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")

but this is only allowing me to extract one news.

Can anyone tell me that how can I extract all the news url from the root directory of website.

asked Nov 10 at 9:19

Ahmed

I need to create the dataset on news. I need to extract all the news of given news website which have ever been posted on that website. I have write this code

import requests     

from bs4 import BeautifulSoup    

import pandas    

import csv    

from datetime import datetime



records=



def cnbc(base_url):



    r = requests.get(base_url)    

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")    

    Title=soup.find("h1","class":"title"}).text.replace("r","").replace("n","")

    content=' '

    for content_tag in soup.find_all("p"):

        content = content+content_tag.text.replace("r","").replace("n","")

    content= content[18:-458]

    Country ='United States'

    website='https://www.cnbc.com/' 

    comments='' 

    genre='Political'

    date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")

    d = datetime.strptime(date, "%d %b %Y")

    date = d.strftime("%d-%m-%Y")

    records.append((Title,content,Country,website,comments,genre,date))



cnbc("https://www.cnbc.com/2018/11/02/here-are-the-three-things-pulling-down-the-stock-market-again.html")

but this is only allowing me to extract one news.

Can anyone tell me that how can I extract all the news url from the root directory of website.

python beautifulsoup dataset

asked Nov 10 at 9:19

Ahmed

asked Nov 10 at 9:19

Ahmed

asked Nov 10 at 9:19

Ahmed

asked Nov 10 at 9:19

Ahmed

asked Nov 10 at 9:19

Ahmed

of course, you need to get string from https://www.cnbc.com/ for all latest news.
– ewwink
Nov 10 at 12:45

add a comment |

of course, you need to get string from https://www.cnbc.com/ for all latest news.
– ewwink
Nov 10 at 12:45

of course, you need to get string from https://www.cnbc.com/ for all latest news.
– ewwink
Nov 10 at 12:45

add a comment |

2 Answers
2

active

oldest

votes

up vote
0
down vote

This is python3 script and it is not flawless but I hope it can serve as starting point so you can achieve what you are trying to. I am not sure does this site from which you are trying to scrape data allows such operation so I will not place their web address for constants WEB_SITE_BASE_URL and WEB_SITE_REGION_URL. It is your choice what you are going to put there.

import requests     

from bs4 import BeautifulSoup     

from datetime import datetime



# https://www.xxxx.com"

WEB_SITE_BASE_URL= ""

# https://www.xxxx.com/?region=us

WEB_SITE_REGION_URL = ""



def get_categories(web_site_base_url):

    r = requests.get(web_site_base_url)    

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")    

    spans = soup.find_all(attrs={"nav-menu-buttonText"})

    categories = [category.text for category in spans]

    return categories



def get_links(category_url):

    r = requests.get(category_url)

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")

    links = [a.get('href') for a in soup.find_all('a', href=True)]

    filtered_links = list(set([k for k in links if '/2018/11/' in k]))

    return filtered_links



def news(link):

    r = requests.get(link)    

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")    

    Title=soup.find("h1",{"class":"title"}).text.replace("r","").replace("n","")

    content=' '

    for content_tag in soup.find_all("p"):

        content = content+content_tag.text.replace("r","").replace("n","")

    content= content[18:-458]

    Country ='United States'

    website='WEB_SITE_BASE_URL' 

    comments=''

    date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")

    d = datetime.strptime(date, "%d %b %Y")

    date = d.strftime("%d-%m-%Y")  

    spans = soup.find_all(attrs={"header_title last breadcrumb"})

    categories = [category.text for category in spans]

    genre = categories

    return(Title,content,Country,website,comments,genre,date)



categories = get_categories(WEB_SITE_REGION_URL)

list_of_link_lists = 

for category in categories:

    list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))

flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))

articles_list = 

for link in flat_link_list:

    try:

        articles_list.append(news(WEB_SITE_BASE_URL + link))

    except:

        print("Something was wrong")

    continue



print(articles_list)

answered Nov 10 at 20:57

jaskowitchious

add a comment |

up vote
0
down vote

There is a rough method to extract part of all news, the method showed as my code. First, extract all the news which div class is headline news_headline = soup.find_all("div",class_="headline"). Then check if element is what we want.

new = 

for div in news_headline:

    each = ()

    if div.a:

        each[0] = url + div.a.get("href")

        if div.a.text:

            # use split to remove t n blankspace

            each[1] = " ".join(div.a.text.split())

        else:

            each[1] = " ".join(div.a.get("title").split())

        new.append(each)

    else:

        continue

It is the full code but i wrote this as short as i can.

import requests     

from bs4 import BeautifulSoup    



def index(url="https://www.cnbc.com/world/"):

    with requests.Session() as se:

        se.encoding = "UTF-8"

        res = se.get(url)

        text = res.text

    soup = BeautifulSoup(text,"lxml")

    news_headline = soup.find_all("div",class_="headline")

    news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else  "".join(div.a.get("title").split()) ) for div in news_headline if div.a]

    print(news_)



index()

answered Nov 11 at 4:36

kcorlidy

1,8721317

kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 at 8:53

Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 at 9:05

Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 at 10:40

@Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
– kcorlidy
Nov 12 at 11:28

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53237567%2fneed-to-create-a-dataset-on-news-using-python%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
0
down vote

import requests     

from bs4 import BeautifulSoup     

from datetime import datetime



# https://www.xxxx.com"

WEB_SITE_BASE_URL= ""

# https://www.xxxx.com/?region=us

WEB_SITE_REGION_URL = ""



def get_categories(web_site_base_url):

    r = requests.get(web_site_base_url)    

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")    

    spans = soup.find_all(attrs={"nav-menu-buttonText"})

    categories = [category.text for category in spans]

    return categories



def get_links(category_url):

    r = requests.get(category_url)

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")

    links = [a.get('href') for a in soup.find_all('a', href=True)]

    filtered_links = list(set([k for k in links if '/2018/11/' in k]))

    return filtered_links



def news(link):

    r = requests.get(link)    

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")    

    Title=soup.find("h1",{"class":"title"}).text.replace("r","").replace("n","")

    content=' '

    for content_tag in soup.find_all("p"):

        content = content+content_tag.text.replace("r","").replace("n","")

    content= content[18:-458]

    Country ='United States'

    website='WEB_SITE_BASE_URL' 

    comments=''

    date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")

    d = datetime.strptime(date, "%d %b %Y")

    date = d.strftime("%d-%m-%Y")  

    spans = soup.find_all(attrs={"header_title last breadcrumb"})

    categories = [category.text for category in spans]

    genre = categories

    return(Title,content,Country,website,comments,genre,date)



categories = get_categories(WEB_SITE_REGION_URL)

list_of_link_lists = 

for category in categories:

    list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))

flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))

articles_list = 

for link in flat_link_list:

    try:

        articles_list.append(news(WEB_SITE_BASE_URL + link))

    except:

        print("Something was wrong")

    continue



print(articles_list)

answered Nov 10 at 20:57

jaskowitchious

add a comment |

up vote
0
down vote

import requests     

from bs4 import BeautifulSoup     

from datetime import datetime



# https://www.xxxx.com"

WEB_SITE_BASE_URL= ""

# https://www.xxxx.com/?region=us

WEB_SITE_REGION_URL = ""



def get_categories(web_site_base_url):

    r = requests.get(web_site_base_url)    

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")    

    spans = soup.find_all(attrs={"nav-menu-buttonText"})

    categories = [category.text for category in spans]

    return categories



def get_links(category_url):

    r = requests.get(category_url)

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")

    links = [a.get('href') for a in soup.find_all('a', href=True)]

    filtered_links = list(set([k for k in links if '/2018/11/' in k]))

    return filtered_links



def news(link):

    r = requests.get(link)    

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")    

    Title=soup.find("h1",{"class":"title"}).text.replace("r","").replace("n","")

    content=' '

    for content_tag in soup.find_all("p"):

        content = content+content_tag.text.replace("r","").replace("n","")

    content= content[18:-458]

    Country ='United States'

    website='WEB_SITE_BASE_URL' 

    comments=''

    date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")

    d = datetime.strptime(date, "%d %b %Y")

    date = d.strftime("%d-%m-%Y")  

    spans = soup.find_all(attrs={"header_title last breadcrumb"})

    categories = [category.text for category in spans]

    genre = categories

    return(Title,content,Country,website,comments,genre,date)



categories = get_categories(WEB_SITE_REGION_URL)

list_of_link_lists = 

for category in categories:

    list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))

flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))

articles_list = 

for link in flat_link_list:

    try:

        articles_list.append(news(WEB_SITE_BASE_URL + link))

    except:

        print("Something was wrong")

    continue



print(articles_list)

answered Nov 10 at 20:57

jaskowitchious

add a comment |

up vote
0
down vote

import requests     

from bs4 import BeautifulSoup     

from datetime import datetime



# https://www.xxxx.com"

WEB_SITE_BASE_URL= ""

# https://www.xxxx.com/?region=us

WEB_SITE_REGION_URL = ""



def get_categories(web_site_base_url):

    r = requests.get(web_site_base_url)    

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")    

    spans = soup.find_all(attrs={"nav-menu-buttonText"})

    categories = [category.text for category in spans]

    return categories



def get_links(category_url):

    r = requests.get(category_url)

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")

    links = [a.get('href') for a in soup.find_all('a', href=True)]

    filtered_links = list(set([k for k in links if '/2018/11/' in k]))

    return filtered_links



def news(link):

    r = requests.get(link)    

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")    

    Title=soup.find("h1",{"class":"title"}).text.replace("r","").replace("n","")

    content=' '

    for content_tag in soup.find_all("p"):

        content = content+content_tag.text.replace("r","").replace("n","")

    content= content[18:-458]

    Country ='United States'

    website='WEB_SITE_BASE_URL' 

    comments=''

    date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")

    d = datetime.strptime(date, "%d %b %Y")

    date = d.strftime("%d-%m-%Y")  

    spans = soup.find_all(attrs={"header_title last breadcrumb"})

    categories = [category.text for category in spans]

    genre = categories

    return(Title,content,Country,website,comments,genre,date)



categories = get_categories(WEB_SITE_REGION_URL)

list_of_link_lists = 

for category in categories:

    list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))

flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))

articles_list = 

for link in flat_link_list:

    try:

        articles_list.append(news(WEB_SITE_BASE_URL + link))

    except:

        print("Something was wrong")

    continue



print(articles_list)

answered Nov 10 at 20:57

jaskowitchious

import requests     

from bs4 import BeautifulSoup     

from datetime import datetime



# https://www.xxxx.com"

WEB_SITE_BASE_URL= ""

# https://www.xxxx.com/?region=us

WEB_SITE_REGION_URL = ""



def get_categories(web_site_base_url):

    r = requests.get(web_site_base_url)    

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")    

    spans = soup.find_all(attrs={"nav-menu-buttonText"})

    categories = [category.text for category in spans]

    return categories



def get_links(category_url):

    r = requests.get(category_url)

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")

    links = [a.get('href') for a in soup.find_all('a', href=True)]

    filtered_links = list(set([k for k in links if '/2018/11/' in k]))

    return filtered_links



def news(link):

    r = requests.get(link)    

    c = r.content    

    soup = BeautifulSoup(c,"html.parser")    

    Title=soup.find("h1",{"class":"title"}).text.replace("r","").replace("n","")

    content=' '

    for content_tag in soup.find_all("p"):

        content = content+content_tag.text.replace("r","").replace("n","")

    content= content[18:-458]

    Country ='United States'

    website='WEB_SITE_BASE_URL' 

    comments=''

    date= soup.find("time",{"class":"datestamp"}).text[35:-2].replace("r","").replace("n","")

    d = datetime.strptime(date, "%d %b %Y")

    date = d.strftime("%d-%m-%Y")  

    spans = soup.find_all(attrs={"header_title last breadcrumb"})

    categories = [category.text for category in spans]

    genre = categories

    return(Title,content,Country,website,comments,genre,date)



categories = get_categories(WEB_SITE_REGION_URL)

list_of_link_lists = 

for category in categories:

    list_of_link_lists.append(get_links(WEB_SITE_BASE_URL + "/" + category.replace(" ", "20%")))

flat_link_list = list(set([item for sublist in list_of_link_lists for item in sublist]))

articles_list = 

for link in flat_link_list:

    try:

        articles_list.append(news(WEB_SITE_BASE_URL + link))

    except:

        print("Something was wrong")

    continue



print(articles_list)

answered Nov 10 at 20:57

jaskowitchious

answered Nov 10 at 20:57

jaskowitchious

answered Nov 10 at 20:57

jaskowitchious

answered Nov 10 at 20:57

jaskowitchious

add a comment |

up vote
0
down vote

new = 

for div in news_headline:

    each = ()

    if div.a:

        each[0] = url + div.a.get("href")

        if div.a.text:

            # use split to remove t n blankspace

            each[1] = " ".join(div.a.text.split())

        else:

            each[1] = " ".join(div.a.get("title").split())

        new.append(each)

    else:

        continue

It is the full code but i wrote this as short as i can.

import requests     

from bs4 import BeautifulSoup    



def index(url="https://www.cnbc.com/world/"):

    with requests.Session() as se:

        se.encoding = "UTF-8"

        res = se.get(url)

        text = res.text

    soup = BeautifulSoup(text,"lxml")

    news_headline = soup.find_all("div",class_="headline")

    news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else  "".join(div.a.get("title").split()) ) for div in news_headline if div.a]

    print(news_)



index()

answered Nov 11 at 4:36

kcorlidy

1,8721317

kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 at 8:53

Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 at 9:05

Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 at 10:40

@Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
– kcorlidy
Nov 12 at 11:28

add a comment |

up vote
0
down vote

new = 

for div in news_headline:

    each = ()

    if div.a:

        each[0] = url + div.a.get("href")

        if div.a.text:

            # use split to remove t n blankspace

            each[1] = " ".join(div.a.text.split())

        else:

            each[1] = " ".join(div.a.get("title").split())

        new.append(each)

    else:

        continue

It is the full code but i wrote this as short as i can.

import requests     

from bs4 import BeautifulSoup    



def index(url="https://www.cnbc.com/world/"):

    with requests.Session() as se:

        se.encoding = "UTF-8"

        res = se.get(url)

        text = res.text

    soup = BeautifulSoup(text,"lxml")

    news_headline = soup.find_all("div",class_="headline")

    news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else  "".join(div.a.get("title").split()) ) for div in news_headline if div.a]

    print(news_)



index()

answered Nov 11 at 4:36

kcorlidy

1,8721317

kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 at 8:53

Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 at 9:05

Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 at 10:40

@Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
– kcorlidy
Nov 12 at 11:28

add a comment |

up vote
0
down vote

new = 

for div in news_headline:

    each = ()

    if div.a:

        each[0] = url + div.a.get("href")

        if div.a.text:

            # use split to remove t n blankspace

            each[1] = " ".join(div.a.text.split())

        else:

            each[1] = " ".join(div.a.get("title").split())

        new.append(each)

    else:

        continue

It is the full code but i wrote this as short as i can.

import requests     

from bs4 import BeautifulSoup    



def index(url="https://www.cnbc.com/world/"):

    with requests.Session() as se:

        se.encoding = "UTF-8"

        res = se.get(url)

        text = res.text

    soup = BeautifulSoup(text,"lxml")

    news_headline = soup.find_all("div",class_="headline")

    news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else  "".join(div.a.get("title").split()) ) for div in news_headline if div.a]

    print(news_)



index()

answered Nov 11 at 4:36

kcorlidy

1,8721317

new = 

for div in news_headline:

    each = ()

    if div.a:

        each[0] = url + div.a.get("href")

        if div.a.text:

            # use split to remove t n blankspace

            each[1] = " ".join(div.a.text.split())

        else:

            each[1] = " ".join(div.a.get("title").split())

        new.append(each)

    else:

        continue

It is the full code but i wrote this as short as i can.

import requests     

from bs4 import BeautifulSoup    



def index(url="https://www.cnbc.com/world/"):

    with requests.Session() as se:

        se.encoding = "UTF-8"

        res = se.get(url)

        text = res.text

    soup = BeautifulSoup(text,"lxml")

    news_headline = soup.find_all("div",class_="headline")

    news_ = [(url + div.a.get("href"), " ".join(div.a.text.split()) if div.a.text else  "".join(div.a.get("title").split()) ) for div in news_headline if div.a]

    print(news_)



index()

answered Nov 11 at 4:36

kcorlidy

1,8721317

answered Nov 11 at 4:36

kcorlidy

1,8721317

answered Nov 11 at 4:36

kcorlidy

1,8721317

answered Nov 11 at 4:36

kcorlidy

1,8721317

kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 at 8:53

Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 at 9:05

Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 at 10:40

@Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
– kcorlidy
Nov 12 at 11:28

add a comment |

kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 at 8:53

Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 at 9:05

Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 at 10:40

@Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
– kcorlidy
Nov 12 at 11:28

kcorlidy it is a good technique but it is giving the results from 6/Nov/2018 to 9/Nov/2018, I was to extract all the news that are ever posted on that new website.
– Ahmed
Nov 11 at 8:53

Yes, as i said extract part of all news, we need a url or urls to do that. Like https://www.cnbc.com/us-news/ , https://www.cnbc.com/pre-markets/. But i am not sure how many section did it have. @Ahmed
– kcorlidy
Nov 11 at 9:05

Yes kcorlidy you said right. The basic problem was that. I did this thing by iterating through the root link but it was giving the news of max 2 days from current time. By the thanks. Kindly tell me more if u get any thing.
– Ahmed
Nov 12 at 10:40

@Ahmed you can extract them from menu, then put the url into my index() and extract each page of news
– kcorlidy
Nov 12 at 11:28

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk