Python scraping 'things to do' from tripadvisor
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
From this page, I want to scrape the list 'Types of Things to Do in Miami' (you can find it near the end of the page). Here's what I have so far:
import requests
from bs4 import BeautifulSoup
# Define header to prevent errors
user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"
headers = {'User-Agent': user_agent}
new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
# Get response from url
response = requests.get(new_url, headers = headers)
# Encode response for parsing
html = response.text.encode('utf-8')
# Soupify response
soup = BeautifulSoup(html, "lxml")
tag_elements = soup.findAll("a", {"class":"attractions-attraction-overview-main-Pill__pill--23S2Q"})
# Iterate over tag_elements and exctract strings
tags_list =
for i in tag_elements:
tags_list.append(i.string)
The problem is, I get values like 'Good for Couples (201)', 'Good for Big Groups (130)', 'Good for Kids (100)'
which are from the 'Commonly Searched For in Miami' area of the page which is below the "Types of Things..." part of the page. I also don't get some of the values that I need like "Traveler Resources (7)", "Day Trips (7)"
etc. The class names for both these lists "Things to do..." and "Commonly searched..." are same and I'm using class in soup.findAll()
which might be the cause of this problem I guess. What is the correct way to do this? Is there some other approach that I should take?
python web-scraping beautifulsoup tripadvisor
add a comment |
From this page, I want to scrape the list 'Types of Things to Do in Miami' (you can find it near the end of the page). Here's what I have so far:
import requests
from bs4 import BeautifulSoup
# Define header to prevent errors
user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"
headers = {'User-Agent': user_agent}
new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
# Get response from url
response = requests.get(new_url, headers = headers)
# Encode response for parsing
html = response.text.encode('utf-8')
# Soupify response
soup = BeautifulSoup(html, "lxml")
tag_elements = soup.findAll("a", {"class":"attractions-attraction-overview-main-Pill__pill--23S2Q"})
# Iterate over tag_elements and exctract strings
tags_list =
for i in tag_elements:
tags_list.append(i.string)
The problem is, I get values like 'Good for Couples (201)', 'Good for Big Groups (130)', 'Good for Kids (100)'
which are from the 'Commonly Searched For in Miami' area of the page which is below the "Types of Things..." part of the page. I also don't get some of the values that I need like "Traveler Resources (7)", "Day Trips (7)"
etc. The class names for both these lists "Things to do..." and "Commonly searched..." are same and I'm using class in soup.findAll()
which might be the cause of this problem I guess. What is the correct way to do this? Is there some other approach that I should take?
python web-scraping beautifulsoup tripadvisor
add a comment |
From this page, I want to scrape the list 'Types of Things to Do in Miami' (you can find it near the end of the page). Here's what I have so far:
import requests
from bs4 import BeautifulSoup
# Define header to prevent errors
user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"
headers = {'User-Agent': user_agent}
new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
# Get response from url
response = requests.get(new_url, headers = headers)
# Encode response for parsing
html = response.text.encode('utf-8')
# Soupify response
soup = BeautifulSoup(html, "lxml")
tag_elements = soup.findAll("a", {"class":"attractions-attraction-overview-main-Pill__pill--23S2Q"})
# Iterate over tag_elements and exctract strings
tags_list =
for i in tag_elements:
tags_list.append(i.string)
The problem is, I get values like 'Good for Couples (201)', 'Good for Big Groups (130)', 'Good for Kids (100)'
which are from the 'Commonly Searched For in Miami' area of the page which is below the "Types of Things..." part of the page. I also don't get some of the values that I need like "Traveler Resources (7)", "Day Trips (7)"
etc. The class names for both these lists "Things to do..." and "Commonly searched..." are same and I'm using class in soup.findAll()
which might be the cause of this problem I guess. What is the correct way to do this? Is there some other approach that I should take?
python web-scraping beautifulsoup tripadvisor
From this page, I want to scrape the list 'Types of Things to Do in Miami' (you can find it near the end of the page). Here's what I have so far:
import requests
from bs4 import BeautifulSoup
# Define header to prevent errors
user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"
headers = {'User-Agent': user_agent}
new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
# Get response from url
response = requests.get(new_url, headers = headers)
# Encode response for parsing
html = response.text.encode('utf-8')
# Soupify response
soup = BeautifulSoup(html, "lxml")
tag_elements = soup.findAll("a", {"class":"attractions-attraction-overview-main-Pill__pill--23S2Q"})
# Iterate over tag_elements and exctract strings
tags_list =
for i in tag_elements:
tags_list.append(i.string)
The problem is, I get values like 'Good for Couples (201)', 'Good for Big Groups (130)', 'Good for Kids (100)'
which are from the 'Commonly Searched For in Miami' area of the page which is below the "Types of Things..." part of the page. I also don't get some of the values that I need like "Traveler Resources (7)", "Day Trips (7)"
etc. The class names for both these lists "Things to do..." and "Commonly searched..." are same and I'm using class in soup.findAll()
which might be the cause of this problem I guess. What is the correct way to do this? Is there some other approach that I should take?
python web-scraping beautifulsoup tripadvisor
python web-scraping beautifulsoup tripadvisor
asked Nov 23 '18 at 20:58
Vishesh ShrivastavVishesh Shrivastav
1,2852824
1,2852824
add a comment |
add a comment |
4 Answers
4
active
oldest
votes
To get only the contents within Types of Things to Do in Miami
headers is a little bit tricky. To do so you need to define the selectors in an organized manner like I did below. The following script should click on the See all
buton under the aforesaid headers. Once the click is initiated, the script will parse the relevant content you look for:
from selenium import webdriver
from selenium.webdriver.support import ui
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
wait = ui.WebDriverWait(driver, 10)
driver.get("https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html")
show_more = wait.until(lambda driver: driver.find_element_by_css_selector("[class='ui_container'] div:nth-of-type(1) .caret-down"))
driver.execute_script("arguments[0].click();",show_more)
soup = BeautifulSoup(driver.page_source,"lxml")
items = [item.text for item in soup.select("[class='ui_container'] div:nth-of-type(1) a[href^='/Attractions-']")]
print(items)
driver.quit()
The output It produces:
['Tours (277)', 'Outdoor Activities (255)', 'Boat Tours & Water Sports (184)', 'Shopping (126)', 'Nightlife (126)', 'Spas & Wellness (109)', 'Fun & Games (67)', 'Transportation (66)', 'Museums (61)', 'Sights & Landmarks (54)', 'Nature & Parks (54)', 'Food & Drink (27)', 'Concerts & Shows (25)', 'Classes & Workshops (22)', 'Zoos & Aquariums (7)', 'Traveler Resources (7)', 'Day Trips (7)', 'Water & Amusement Parks (5)', 'Casinos & Gambling (3)', 'Events (2)']
add a comment |
This is pretty straightforward to do in the browser:
filters = driver.execute_script("return [...document.querySelectorAll('.filterName a')].map(a => a.innerText)")
add a comment |
Looks like you'll need to use selenium. The problem is the dropdown doesn't show the remaining options until after you click it.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
driver = webdriver.Chrome(options=options)
driver.get('https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html')
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, '//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span')))
driver.execute_script("arguments[0].scrollIntoView();", driver.find_element_by_xpath('//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span'))
driver.execute_script("arguments[0].click();", driver.find_element_by_xpath('//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span'))
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
items = soup.findAll('a', {'class':'attractions-attraction-overview-main-Pill__pill--23S2Q'})
#You could use this to not just get text but also the ['href'] too.
for item in items:
print(item.get_text())
driver.quit()
add a comment |
I think you need to be able to click the show more to see all the available. So use something like selenium. This includes waits to ensure all elements are present and for drop down to be clickable.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
d = webdriver.Chrome()
d.get("https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html")
WebDriverWait(d,5).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".filter_list_0 div a")))
WebDriverWait(d, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#taplc_attraction_filters_clarity_0 span.ui_icon.caret-down"))).click()
tag_elements = WebDriverWait(d,5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".filter_list_0 div a")))
tags_list = [i.text for i in tag_elements]
print(tags_list)
d.quit()
Without selenium I only get 15 items
import requests
from bs4 import BeautifulSoup
user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"
headers = {'User-Agent': user_agent}
new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
response = requests.get(new_url, headers = headers)
soup = BeautifulSoup(response.content, "lxml")
tag_elements = soup.select('#component_3 > div > div > div:nth-of-type(12) > div:nth-of-type(1) > div > div a')
tags_list = [i.text for i in tag_elements]
print(tags_list)
The lineWebDriverWait(d,5).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".filter_list_0 div a")))
results in aTimeoutException: Message:
with no message displayed. I changed the time to 10 and 20 but it results in the same.
– Vishesh Shrivastav
Nov 24 '18 at 2:36
Odd. What happens if you comment out that line and increase the wait on the next line to 10? You can always execute_script on the dropdown otherwise to move things along.
– QHarr
Nov 24 '18 at 5:52
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452863%2fpython-scraping-things-to-do-from-tripadvisor%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
To get only the contents within Types of Things to Do in Miami
headers is a little bit tricky. To do so you need to define the selectors in an organized manner like I did below. The following script should click on the See all
buton under the aforesaid headers. Once the click is initiated, the script will parse the relevant content you look for:
from selenium import webdriver
from selenium.webdriver.support import ui
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
wait = ui.WebDriverWait(driver, 10)
driver.get("https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html")
show_more = wait.until(lambda driver: driver.find_element_by_css_selector("[class='ui_container'] div:nth-of-type(1) .caret-down"))
driver.execute_script("arguments[0].click();",show_more)
soup = BeautifulSoup(driver.page_source,"lxml")
items = [item.text for item in soup.select("[class='ui_container'] div:nth-of-type(1) a[href^='/Attractions-']")]
print(items)
driver.quit()
The output It produces:
['Tours (277)', 'Outdoor Activities (255)', 'Boat Tours & Water Sports (184)', 'Shopping (126)', 'Nightlife (126)', 'Spas & Wellness (109)', 'Fun & Games (67)', 'Transportation (66)', 'Museums (61)', 'Sights & Landmarks (54)', 'Nature & Parks (54)', 'Food & Drink (27)', 'Concerts & Shows (25)', 'Classes & Workshops (22)', 'Zoos & Aquariums (7)', 'Traveler Resources (7)', 'Day Trips (7)', 'Water & Amusement Parks (5)', 'Casinos & Gambling (3)', 'Events (2)']
add a comment |
To get only the contents within Types of Things to Do in Miami
headers is a little bit tricky. To do so you need to define the selectors in an organized manner like I did below. The following script should click on the See all
buton under the aforesaid headers. Once the click is initiated, the script will parse the relevant content you look for:
from selenium import webdriver
from selenium.webdriver.support import ui
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
wait = ui.WebDriverWait(driver, 10)
driver.get("https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html")
show_more = wait.until(lambda driver: driver.find_element_by_css_selector("[class='ui_container'] div:nth-of-type(1) .caret-down"))
driver.execute_script("arguments[0].click();",show_more)
soup = BeautifulSoup(driver.page_source,"lxml")
items = [item.text for item in soup.select("[class='ui_container'] div:nth-of-type(1) a[href^='/Attractions-']")]
print(items)
driver.quit()
The output It produces:
['Tours (277)', 'Outdoor Activities (255)', 'Boat Tours & Water Sports (184)', 'Shopping (126)', 'Nightlife (126)', 'Spas & Wellness (109)', 'Fun & Games (67)', 'Transportation (66)', 'Museums (61)', 'Sights & Landmarks (54)', 'Nature & Parks (54)', 'Food & Drink (27)', 'Concerts & Shows (25)', 'Classes & Workshops (22)', 'Zoos & Aquariums (7)', 'Traveler Resources (7)', 'Day Trips (7)', 'Water & Amusement Parks (5)', 'Casinos & Gambling (3)', 'Events (2)']
add a comment |
To get only the contents within Types of Things to Do in Miami
headers is a little bit tricky. To do so you need to define the selectors in an organized manner like I did below. The following script should click on the See all
buton under the aforesaid headers. Once the click is initiated, the script will parse the relevant content you look for:
from selenium import webdriver
from selenium.webdriver.support import ui
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
wait = ui.WebDriverWait(driver, 10)
driver.get("https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html")
show_more = wait.until(lambda driver: driver.find_element_by_css_selector("[class='ui_container'] div:nth-of-type(1) .caret-down"))
driver.execute_script("arguments[0].click();",show_more)
soup = BeautifulSoup(driver.page_source,"lxml")
items = [item.text for item in soup.select("[class='ui_container'] div:nth-of-type(1) a[href^='/Attractions-']")]
print(items)
driver.quit()
The output It produces:
['Tours (277)', 'Outdoor Activities (255)', 'Boat Tours & Water Sports (184)', 'Shopping (126)', 'Nightlife (126)', 'Spas & Wellness (109)', 'Fun & Games (67)', 'Transportation (66)', 'Museums (61)', 'Sights & Landmarks (54)', 'Nature & Parks (54)', 'Food & Drink (27)', 'Concerts & Shows (25)', 'Classes & Workshops (22)', 'Zoos & Aquariums (7)', 'Traveler Resources (7)', 'Day Trips (7)', 'Water & Amusement Parks (5)', 'Casinos & Gambling (3)', 'Events (2)']
To get only the contents within Types of Things to Do in Miami
headers is a little bit tricky. To do so you need to define the selectors in an organized manner like I did below. The following script should click on the See all
buton under the aforesaid headers. Once the click is initiated, the script will parse the relevant content you look for:
from selenium import webdriver
from selenium.webdriver.support import ui
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
wait = ui.WebDriverWait(driver, 10)
driver.get("https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html")
show_more = wait.until(lambda driver: driver.find_element_by_css_selector("[class='ui_container'] div:nth-of-type(1) .caret-down"))
driver.execute_script("arguments[0].click();",show_more)
soup = BeautifulSoup(driver.page_source,"lxml")
items = [item.text for item in soup.select("[class='ui_container'] div:nth-of-type(1) a[href^='/Attractions-']")]
print(items)
driver.quit()
The output It produces:
['Tours (277)', 'Outdoor Activities (255)', 'Boat Tours & Water Sports (184)', 'Shopping (126)', 'Nightlife (126)', 'Spas & Wellness (109)', 'Fun & Games (67)', 'Transportation (66)', 'Museums (61)', 'Sights & Landmarks (54)', 'Nature & Parks (54)', 'Food & Drink (27)', 'Concerts & Shows (25)', 'Classes & Workshops (22)', 'Zoos & Aquariums (7)', 'Traveler Resources (7)', 'Day Trips (7)', 'Water & Amusement Parks (5)', 'Casinos & Gambling (3)', 'Events (2)']
edited Nov 24 '18 at 5:35
answered Nov 24 '18 at 5:30
SIMSIM
10.9k31148
10.9k31148
add a comment |
add a comment |
This is pretty straightforward to do in the browser:
filters = driver.execute_script("return [...document.querySelectorAll('.filterName a')].map(a => a.innerText)")
add a comment |
This is pretty straightforward to do in the browser:
filters = driver.execute_script("return [...document.querySelectorAll('.filterName a')].map(a => a.innerText)")
add a comment |
This is pretty straightforward to do in the browser:
filters = driver.execute_script("return [...document.querySelectorAll('.filterName a')].map(a => a.innerText)")
This is pretty straightforward to do in the browser:
filters = driver.execute_script("return [...document.querySelectorAll('.filterName a')].map(a => a.innerText)")
answered Nov 23 '18 at 23:52
pguardiariopguardiario
37k980118
37k980118
add a comment |
add a comment |
Looks like you'll need to use selenium. The problem is the dropdown doesn't show the remaining options until after you click it.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
driver = webdriver.Chrome(options=options)
driver.get('https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html')
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, '//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span')))
driver.execute_script("arguments[0].scrollIntoView();", driver.find_element_by_xpath('//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span'))
driver.execute_script("arguments[0].click();", driver.find_element_by_xpath('//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span'))
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
items = soup.findAll('a', {'class':'attractions-attraction-overview-main-Pill__pill--23S2Q'})
#You could use this to not just get text but also the ['href'] too.
for item in items:
print(item.get_text())
driver.quit()
add a comment |
Looks like you'll need to use selenium. The problem is the dropdown doesn't show the remaining options until after you click it.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
driver = webdriver.Chrome(options=options)
driver.get('https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html')
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, '//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span')))
driver.execute_script("arguments[0].scrollIntoView();", driver.find_element_by_xpath('//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span'))
driver.execute_script("arguments[0].click();", driver.find_element_by_xpath('//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span'))
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
items = soup.findAll('a', {'class':'attractions-attraction-overview-main-Pill__pill--23S2Q'})
#You could use this to not just get text but also the ['href'] too.
for item in items:
print(item.get_text())
driver.quit()
add a comment |
Looks like you'll need to use selenium. The problem is the dropdown doesn't show the remaining options until after you click it.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
driver = webdriver.Chrome(options=options)
driver.get('https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html')
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, '//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span')))
driver.execute_script("arguments[0].scrollIntoView();", driver.find_element_by_xpath('//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span'))
driver.execute_script("arguments[0].click();", driver.find_element_by_xpath('//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span'))
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
items = soup.findAll('a', {'class':'attractions-attraction-overview-main-Pill__pill--23S2Q'})
#You could use this to not just get text but also the ['href'] too.
for item in items:
print(item.get_text())
driver.quit()
Looks like you'll need to use selenium. The problem is the dropdown doesn't show the remaining options until after you click it.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
driver = webdriver.Chrome(options=options)
driver.get('https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html')
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, '//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span')))
driver.execute_script("arguments[0].scrollIntoView();", driver.find_element_by_xpath('//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span'))
driver.execute_script("arguments[0].click();", driver.find_element_by_xpath('//*[@id="component_3"]/div/div/div[12]/div[1]/div/div/div/div[1]/span'))
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
items = soup.findAll('a', {'class':'attractions-attraction-overview-main-Pill__pill--23S2Q'})
#You could use this to not just get text but also the ['href'] too.
for item in items:
print(item.get_text())
driver.quit()
answered Nov 23 '18 at 22:48
Kamikaze_goldfishKamikaze_goldfish
498311
498311
add a comment |
add a comment |
I think you need to be able to click the show more to see all the available. So use something like selenium. This includes waits to ensure all elements are present and for drop down to be clickable.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
d = webdriver.Chrome()
d.get("https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html")
WebDriverWait(d,5).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".filter_list_0 div a")))
WebDriverWait(d, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#taplc_attraction_filters_clarity_0 span.ui_icon.caret-down"))).click()
tag_elements = WebDriverWait(d,5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".filter_list_0 div a")))
tags_list = [i.text for i in tag_elements]
print(tags_list)
d.quit()
Without selenium I only get 15 items
import requests
from bs4 import BeautifulSoup
user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"
headers = {'User-Agent': user_agent}
new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
response = requests.get(new_url, headers = headers)
soup = BeautifulSoup(response.content, "lxml")
tag_elements = soup.select('#component_3 > div > div > div:nth-of-type(12) > div:nth-of-type(1) > div > div a')
tags_list = [i.text for i in tag_elements]
print(tags_list)
The lineWebDriverWait(d,5).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".filter_list_0 div a")))
results in aTimeoutException: Message:
with no message displayed. I changed the time to 10 and 20 but it results in the same.
– Vishesh Shrivastav
Nov 24 '18 at 2:36
Odd. What happens if you comment out that line and increase the wait on the next line to 10? You can always execute_script on the dropdown otherwise to move things along.
– QHarr
Nov 24 '18 at 5:52
add a comment |
I think you need to be able to click the show more to see all the available. So use something like selenium. This includes waits to ensure all elements are present and for drop down to be clickable.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
d = webdriver.Chrome()
d.get("https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html")
WebDriverWait(d,5).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".filter_list_0 div a")))
WebDriverWait(d, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#taplc_attraction_filters_clarity_0 span.ui_icon.caret-down"))).click()
tag_elements = WebDriverWait(d,5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".filter_list_0 div a")))
tags_list = [i.text for i in tag_elements]
print(tags_list)
d.quit()
Without selenium I only get 15 items
import requests
from bs4 import BeautifulSoup
user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"
headers = {'User-Agent': user_agent}
new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
response = requests.get(new_url, headers = headers)
soup = BeautifulSoup(response.content, "lxml")
tag_elements = soup.select('#component_3 > div > div > div:nth-of-type(12) > div:nth-of-type(1) > div > div a')
tags_list = [i.text for i in tag_elements]
print(tags_list)
The lineWebDriverWait(d,5).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".filter_list_0 div a")))
results in aTimeoutException: Message:
with no message displayed. I changed the time to 10 and 20 but it results in the same.
– Vishesh Shrivastav
Nov 24 '18 at 2:36
Odd. What happens if you comment out that line and increase the wait on the next line to 10? You can always execute_script on the dropdown otherwise to move things along.
– QHarr
Nov 24 '18 at 5:52
add a comment |
I think you need to be able to click the show more to see all the available. So use something like selenium. This includes waits to ensure all elements are present and for drop down to be clickable.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
d = webdriver.Chrome()
d.get("https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html")
WebDriverWait(d,5).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".filter_list_0 div a")))
WebDriverWait(d, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#taplc_attraction_filters_clarity_0 span.ui_icon.caret-down"))).click()
tag_elements = WebDriverWait(d,5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".filter_list_0 div a")))
tags_list = [i.text for i in tag_elements]
print(tags_list)
d.quit()
Without selenium I only get 15 items
import requests
from bs4 import BeautifulSoup
user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"
headers = {'User-Agent': user_agent}
new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
response = requests.get(new_url, headers = headers)
soup = BeautifulSoup(response.content, "lxml")
tag_elements = soup.select('#component_3 > div > div > div:nth-of-type(12) > div:nth-of-type(1) > div > div a')
tags_list = [i.text for i in tag_elements]
print(tags_list)
I think you need to be able to click the show more to see all the available. So use something like selenium. This includes waits to ensure all elements are present and for drop down to be clickable.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
d = webdriver.Chrome()
d.get("https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html")
WebDriverWait(d,5).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".filter_list_0 div a")))
WebDriverWait(d, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#taplc_attraction_filters_clarity_0 span.ui_icon.caret-down"))).click()
tag_elements = WebDriverWait(d,5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".filter_list_0 div a")))
tags_list = [i.text for i in tag_elements]
print(tags_list)
d.quit()
Without selenium I only get 15 items
import requests
from bs4 import BeautifulSoup
user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"
headers = {'User-Agent': user_agent}
new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
response = requests.get(new_url, headers = headers)
soup = BeautifulSoup(response.content, "lxml")
tag_elements = soup.select('#component_3 > div > div > div:nth-of-type(12) > div:nth-of-type(1) > div > div a')
tags_list = [i.text for i in tag_elements]
print(tags_list)
edited Nov 23 '18 at 22:52
answered Nov 23 '18 at 21:59
QHarrQHarr
37.9k82245
37.9k82245
The lineWebDriverWait(d,5).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".filter_list_0 div a")))
results in aTimeoutException: Message:
with no message displayed. I changed the time to 10 and 20 but it results in the same.
– Vishesh Shrivastav
Nov 24 '18 at 2:36
Odd. What happens if you comment out that line and increase the wait on the next line to 10? You can always execute_script on the dropdown otherwise to move things along.
– QHarr
Nov 24 '18 at 5:52
add a comment |
The lineWebDriverWait(d,5).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".filter_list_0 div a")))
results in aTimeoutException: Message:
with no message displayed. I changed the time to 10 and 20 but it results in the same.
– Vishesh Shrivastav
Nov 24 '18 at 2:36
Odd. What happens if you comment out that line and increase the wait on the next line to 10? You can always execute_script on the dropdown otherwise to move things along.
– QHarr
Nov 24 '18 at 5:52
The line
WebDriverWait(d,5).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".filter_list_0 div a")))
results in a TimeoutException: Message:
with no message displayed. I changed the time to 10 and 20 but it results in the same.– Vishesh Shrivastav
Nov 24 '18 at 2:36
The line
WebDriverWait(d,5).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".filter_list_0 div a")))
results in a TimeoutException: Message:
with no message displayed. I changed the time to 10 and 20 but it results in the same.– Vishesh Shrivastav
Nov 24 '18 at 2:36
Odd. What happens if you comment out that line and increase the wait on the next line to 10? You can always execute_script on the dropdown otherwise to move things along.
– QHarr
Nov 24 '18 at 5:52
Odd. What happens if you comment out that line and increase the wait on the next line to 10? You can always execute_script on the dropdown otherwise to move things along.
– QHarr
Nov 24 '18 at 5:52
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452863%2fpython-scraping-things-to-do-from-tripadvisor%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown