Extracting html text using R - can't access some nodes
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I have a large number of water take permits that are available online and I want to extract some data from them. For example
url <- "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1"
I don't know html at all, but have been plugging away with help from google and a friend. I can get to some of the nodes without any issues using the xpath or css selector, for instance to get to the title:
library(rvest)
url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="main"]/div/h1') %>%
html_text()
[1] "Details for CRC000002.1"
Or using the css selectors:
url %>%
read_html() %>%
html_nodes(css = "#main") %>%
html_nodes(css = "div") %>%
html_nodes(css = "h1") %>%
html_text()
[1] "Details for CRC000002.1"
So far, so good, but the information I actually want is buried a bit deeper and I can't seem to get to it. For instance, the client name field ("Killermont Station Limited", in this case) has this xpath:
clientxpath <- '//*[@id="main"]/div/div[1]/div/table/tbody/tr[1]/td[2]'
url %>%
read_html() %>%
html_nodes(xpath = clientxpath) %>%
html_text()
character(0)
The css selectors gets quite convoluted, but I get the same result. The help file for html_nodes() says:
# XPath selectors ---------------------------------------------
# chaining with XPath is a little trickier - you may need to vary
# the prefix you're using - // always selects from the root noot
# regardless of where you currently are in the doc
But it doesn't give me any clues on what to use an an alternative prefix in the xpath (might be obvious if I knew html).
My friend pointed out that some of the document is in javascript (ajax), which may be part of the problem too. That said, the bit I'm trying to get to above shows up in the html, but it is within a node called 'div.ajax-block'.
css selectors: #main > div > div.ajax-block > div > table > tbody > tr:nth-child(1) > td:nth-child(4)
Can anyone help? Thanks!
html r web-scraping rvest
add a comment |
I have a large number of water take permits that are available online and I want to extract some data from them. For example
url <- "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1"
I don't know html at all, but have been plugging away with help from google and a friend. I can get to some of the nodes without any issues using the xpath or css selector, for instance to get to the title:
library(rvest)
url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="main"]/div/h1') %>%
html_text()
[1] "Details for CRC000002.1"
Or using the css selectors:
url %>%
read_html() %>%
html_nodes(css = "#main") %>%
html_nodes(css = "div") %>%
html_nodes(css = "h1") %>%
html_text()
[1] "Details for CRC000002.1"
So far, so good, but the information I actually want is buried a bit deeper and I can't seem to get to it. For instance, the client name field ("Killermont Station Limited", in this case) has this xpath:
clientxpath <- '//*[@id="main"]/div/div[1]/div/table/tbody/tr[1]/td[2]'
url %>%
read_html() %>%
html_nodes(xpath = clientxpath) %>%
html_text()
character(0)
The css selectors gets quite convoluted, but I get the same result. The help file for html_nodes() says:
# XPath selectors ---------------------------------------------
# chaining with XPath is a little trickier - you may need to vary
# the prefix you're using - // always selects from the root noot
# regardless of where you currently are in the doc
But it doesn't give me any clues on what to use an an alternative prefix in the xpath (might be obvious if I knew html).
My friend pointed out that some of the document is in javascript (ajax), which may be part of the problem too. That said, the bit I'm trying to get to above shows up in the html, but it is within a node called 'div.ajax-block'.
css selectors: #main > div > div.ajax-block > div > table > tbody > tr:nth-child(1) > td:nth-child(4)
Can anyone help? Thanks!
html r web-scraping rvest
1
First of all, is it legal for you to get data from that page?
– NelsonGon
Nov 24 '18 at 10:14
Yes, it's all public information.
– TimM
Nov 24 '18 at 10:17
it dynamic page, use selenium
– ewwink
Nov 24 '18 at 10:22
How would you go about extracting the data in RSelenium? I had a quick look and it seems like it's pretty involved!
– TimM
Nov 24 '18 at 11:16
2
Please see my answer. This "use selenium" craze is just crazy.
– hrbrmstr
Nov 24 '18 at 17:38
add a comment |
I have a large number of water take permits that are available online and I want to extract some data from them. For example
url <- "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1"
I don't know html at all, but have been plugging away with help from google and a friend. I can get to some of the nodes without any issues using the xpath or css selector, for instance to get to the title:
library(rvest)
url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="main"]/div/h1') %>%
html_text()
[1] "Details for CRC000002.1"
Or using the css selectors:
url %>%
read_html() %>%
html_nodes(css = "#main") %>%
html_nodes(css = "div") %>%
html_nodes(css = "h1") %>%
html_text()
[1] "Details for CRC000002.1"
So far, so good, but the information I actually want is buried a bit deeper and I can't seem to get to it. For instance, the client name field ("Killermont Station Limited", in this case) has this xpath:
clientxpath <- '//*[@id="main"]/div/div[1]/div/table/tbody/tr[1]/td[2]'
url %>%
read_html() %>%
html_nodes(xpath = clientxpath) %>%
html_text()
character(0)
The css selectors gets quite convoluted, but I get the same result. The help file for html_nodes() says:
# XPath selectors ---------------------------------------------
# chaining with XPath is a little trickier - you may need to vary
# the prefix you're using - // always selects from the root noot
# regardless of where you currently are in the doc
But it doesn't give me any clues on what to use an an alternative prefix in the xpath (might be obvious if I knew html).
My friend pointed out that some of the document is in javascript (ajax), which may be part of the problem too. That said, the bit I'm trying to get to above shows up in the html, but it is within a node called 'div.ajax-block'.
css selectors: #main > div > div.ajax-block > div > table > tbody > tr:nth-child(1) > td:nth-child(4)
Can anyone help? Thanks!
html r web-scraping rvest
I have a large number of water take permits that are available online and I want to extract some data from them. For example
url <- "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1"
I don't know html at all, but have been plugging away with help from google and a friend. I can get to some of the nodes without any issues using the xpath or css selector, for instance to get to the title:
library(rvest)
url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="main"]/div/h1') %>%
html_text()
[1] "Details for CRC000002.1"
Or using the css selectors:
url %>%
read_html() %>%
html_nodes(css = "#main") %>%
html_nodes(css = "div") %>%
html_nodes(css = "h1") %>%
html_text()
[1] "Details for CRC000002.1"
So far, so good, but the information I actually want is buried a bit deeper and I can't seem to get to it. For instance, the client name field ("Killermont Station Limited", in this case) has this xpath:
clientxpath <- '//*[@id="main"]/div/div[1]/div/table/tbody/tr[1]/td[2]'
url %>%
read_html() %>%
html_nodes(xpath = clientxpath) %>%
html_text()
character(0)
The css selectors gets quite convoluted, but I get the same result. The help file for html_nodes() says:
# XPath selectors ---------------------------------------------
# chaining with XPath is a little trickier - you may need to vary
# the prefix you're using - // always selects from the root noot
# regardless of where you currently are in the doc
But it doesn't give me any clues on what to use an an alternative prefix in the xpath (might be obvious if I knew html).
My friend pointed out that some of the document is in javascript (ajax), which may be part of the problem too. That said, the bit I'm trying to get to above shows up in the html, but it is within a node called 'div.ajax-block'.
css selectors: #main > div > div.ajax-block > div > table > tbody > tr:nth-child(1) > td:nth-child(4)
Can anyone help? Thanks!
html r web-scraping rvest
html r web-scraping rvest
asked Nov 24 '18 at 10:05
TimMTimM
474
474
1
First of all, is it legal for you to get data from that page?
– NelsonGon
Nov 24 '18 at 10:14
Yes, it's all public information.
– TimM
Nov 24 '18 at 10:17
it dynamic page, use selenium
– ewwink
Nov 24 '18 at 10:22
How would you go about extracting the data in RSelenium? I had a quick look and it seems like it's pretty involved!
– TimM
Nov 24 '18 at 11:16
2
Please see my answer. This "use selenium" craze is just crazy.
– hrbrmstr
Nov 24 '18 at 17:38
add a comment |
1
First of all, is it legal for you to get data from that page?
– NelsonGon
Nov 24 '18 at 10:14
Yes, it's all public information.
– TimM
Nov 24 '18 at 10:17
it dynamic page, use selenium
– ewwink
Nov 24 '18 at 10:22
How would you go about extracting the data in RSelenium? I had a quick look and it seems like it's pretty involved!
– TimM
Nov 24 '18 at 11:16
2
Please see my answer. This "use selenium" craze is just crazy.
– hrbrmstr
Nov 24 '18 at 17:38
1
1
First of all, is it legal for you to get data from that page?
– NelsonGon
Nov 24 '18 at 10:14
First of all, is it legal for you to get data from that page?
– NelsonGon
Nov 24 '18 at 10:14
Yes, it's all public information.
– TimM
Nov 24 '18 at 10:17
Yes, it's all public information.
– TimM
Nov 24 '18 at 10:17
it dynamic page, use selenium
– ewwink
Nov 24 '18 at 10:22
it dynamic page, use selenium
– ewwink
Nov 24 '18 at 10:22
How would you go about extracting the data in RSelenium? I had a quick look and it seems like it's pretty involved!
– TimM
Nov 24 '18 at 11:16
How would you go about extracting the data in RSelenium? I had a quick look and it seems like it's pretty involved!
– TimM
Nov 24 '18 at 11:16
2
2
Please see my answer. This "use selenium" craze is just crazy.
– hrbrmstr
Nov 24 '18 at 17:38
Please see my answer. This "use selenium" craze is just crazy.
– hrbrmstr
Nov 24 '18 at 17:38
add a comment |
1 Answer
1
active
oldest
votes
It's super disconcerting that most if not all SO R contributors default to "use a heavyweight third-party dependency" in curt "answers" when it comes to scraping. 99% of the time you don't need Selenium. You just need to exercise the little gray cells.
First, a big clue that the page loads content asynchronously is the wait-spinner that appears. The second one is in your snippet where the div
actually has part of a selector name with ajax
in it. Tell-tale signs XHR requests are in-play.
If you open Developer Tools in your browser and reload the page then go to Network and then the XHR tab you'll see:
Most of the "real" data on the page is loaded dynamically. We can write httr
calls that mimic the browser calls.
However…
We first need to make one GET
call to the main page to prime some cookies which will be carried over for us and then find a per-generated session token that's used to prevent abuse of the site. It's defined using JavaScript so we'll use the V8
package to evaluate it. We could have just use regular expressions to find the string. Do whatev you like.
library(httr)
library(rvest)
library(dplyr)
library(V8)
ctx <- v8() # we need this to eval some javascript
# Prime Cookies -----------------------------------------------------------
res <- httr::GET("https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1")
httr::cookies(res)
## domain flag path secure expiration name
## 1 .ecan.govt.nz TRUE / FALSE 2019-11-24 11:46:13 visid_incap_927063
## 2 .ecan.govt.nz TRUE / FALSE <NA> incap_ses_148_927063
## value
## 1 +p8XAM6uReGmEnVIdnaxoxWL+VsAAAAAQUIPAAAAAABjdOjQDbXt7PG3tpBpELha
## 2 nXJSYz8zbCRj8tGhzNANAhaL+VsAAAAA7JyOH7Gu4qeIb6KKk/iSYQ==
pg <- httr::content(res)
html_node(pg, xpath=".//script[contains(., '_monsido')]") %>%
html_text() %>%
ctx$eval()
## [1] "2"
monsido_token <- ctx$get("_monsido")[1,2]
Here's the searchlist
(which is, indeed empty):
httr::VERB(
verb = "POST", url = "https://www.ecan.govt.nz/data/document-library/searchlist",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
`X-Requested-With` = "XMLHttpRequest",
TE = "Trailers"
), httr::set_cookies(
monsido = monsido_token
),
body = list(
name = "CRC000002.1",
pageSize = "999999"
),
encode = "form"
) -> res
httr::content(res)
## NULL ## <<=== this is OK as there is no response
Here's the "Consent Overview" section:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentoverview/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
html_table() %>%
glimpse()
## List of 1
## $ :'data.frame': 5 obs. of 4 variables:
## ..$ X1: chr [1:5] "RMA Authorisation Number" "Consent Location" "To" "Commencement Date" ...
## ..$ X2: chr [1:5] "CRC000002.1" "Manuka Creek, KILLERMONT STATION" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X3: chr [1:5] "Client Name" "State" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X4: chr [1:5] "Killermont Station Limited" "Issued - Active" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
Here are the "Consent Conditions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentconditions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="consentDetails">
## <ul class="unstyled-list">
## <li>
##
##
## <strong class="pull-left">1</strong> <div class="pad-left1">The rate at which wa
Here's the "Consent Related":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentrelated/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body>
## <p>There are no related documents.</p>
##
##
##
##
##
## <div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead><tr>
## <th>Relationship</th>
## <th>Recor
Here's the "Workflow:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res)
## {xml_document}
## <html>
## [1] <body><p>No workflow</p></body>
Here are the "Consent Flow Restrictions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentflowrestrictions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead>
## <th colspan="2">Low Flow Site</th>
## <th>Todays Flow <span class="lower">(m3/s)</span>
## </th>
You still need to parse HTML but now you can do it all with just plain R packages.
Thanks, that's brilliant! Works perfectly, and now I'm on to banging my head against a wall with the text pattern matching. Can you explain to me briefly how to select the arguments for GET? They work perfectly in this case, but I don't think I could replicate it, and the help file in R is a little opaque.
– TimM
Nov 24 '18 at 21:54
1
If this isn't time-sensitive, lemme put this into GitHub by tomorrow and I'll drop a link here and we can iron it out it GitHub issues.
– hrbrmstr
Nov 24 '18 at 22:11
That would be amazing, thanks!
– TimM
Nov 24 '18 at 22:13
Cool. If you could put your initial comment to this answer as an issue in github.com/hrbrmstr/nz-ecan I'll drop a note on the morrow. Tonight is last night with my college freshman son home for the Thanksgiving break so I'll be able to crank on this tomorrow with zeal.
– hrbrmstr
Nov 24 '18 at 22:17
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53457078%2fextracting-html-text-using-r-cant-access-some-nodes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
It's super disconcerting that most if not all SO R contributors default to "use a heavyweight third-party dependency" in curt "answers" when it comes to scraping. 99% of the time you don't need Selenium. You just need to exercise the little gray cells.
First, a big clue that the page loads content asynchronously is the wait-spinner that appears. The second one is in your snippet where the div
actually has part of a selector name with ajax
in it. Tell-tale signs XHR requests are in-play.
If you open Developer Tools in your browser and reload the page then go to Network and then the XHR tab you'll see:
Most of the "real" data on the page is loaded dynamically. We can write httr
calls that mimic the browser calls.
However…
We first need to make one GET
call to the main page to prime some cookies which will be carried over for us and then find a per-generated session token that's used to prevent abuse of the site. It's defined using JavaScript so we'll use the V8
package to evaluate it. We could have just use regular expressions to find the string. Do whatev you like.
library(httr)
library(rvest)
library(dplyr)
library(V8)
ctx <- v8() # we need this to eval some javascript
# Prime Cookies -----------------------------------------------------------
res <- httr::GET("https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1")
httr::cookies(res)
## domain flag path secure expiration name
## 1 .ecan.govt.nz TRUE / FALSE 2019-11-24 11:46:13 visid_incap_927063
## 2 .ecan.govt.nz TRUE / FALSE <NA> incap_ses_148_927063
## value
## 1 +p8XAM6uReGmEnVIdnaxoxWL+VsAAAAAQUIPAAAAAABjdOjQDbXt7PG3tpBpELha
## 2 nXJSYz8zbCRj8tGhzNANAhaL+VsAAAAA7JyOH7Gu4qeIb6KKk/iSYQ==
pg <- httr::content(res)
html_node(pg, xpath=".//script[contains(., '_monsido')]") %>%
html_text() %>%
ctx$eval()
## [1] "2"
monsido_token <- ctx$get("_monsido")[1,2]
Here's the searchlist
(which is, indeed empty):
httr::VERB(
verb = "POST", url = "https://www.ecan.govt.nz/data/document-library/searchlist",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
`X-Requested-With` = "XMLHttpRequest",
TE = "Trailers"
), httr::set_cookies(
monsido = monsido_token
),
body = list(
name = "CRC000002.1",
pageSize = "999999"
),
encode = "form"
) -> res
httr::content(res)
## NULL ## <<=== this is OK as there is no response
Here's the "Consent Overview" section:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentoverview/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
html_table() %>%
glimpse()
## List of 1
## $ :'data.frame': 5 obs. of 4 variables:
## ..$ X1: chr [1:5] "RMA Authorisation Number" "Consent Location" "To" "Commencement Date" ...
## ..$ X2: chr [1:5] "CRC000002.1" "Manuka Creek, KILLERMONT STATION" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X3: chr [1:5] "Client Name" "State" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X4: chr [1:5] "Killermont Station Limited" "Issued - Active" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
Here are the "Consent Conditions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentconditions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="consentDetails">
## <ul class="unstyled-list">
## <li>
##
##
## <strong class="pull-left">1</strong> <div class="pad-left1">The rate at which wa
Here's the "Consent Related":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentrelated/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body>
## <p>There are no related documents.</p>
##
##
##
##
##
## <div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead><tr>
## <th>Relationship</th>
## <th>Recor
Here's the "Workflow:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res)
## {xml_document}
## <html>
## [1] <body><p>No workflow</p></body>
Here are the "Consent Flow Restrictions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentflowrestrictions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead>
## <th colspan="2">Low Flow Site</th>
## <th>Todays Flow <span class="lower">(m3/s)</span>
## </th>
You still need to parse HTML but now you can do it all with just plain R packages.
Thanks, that's brilliant! Works perfectly, and now I'm on to banging my head against a wall with the text pattern matching. Can you explain to me briefly how to select the arguments for GET? They work perfectly in this case, but I don't think I could replicate it, and the help file in R is a little opaque.
– TimM
Nov 24 '18 at 21:54
1
If this isn't time-sensitive, lemme put this into GitHub by tomorrow and I'll drop a link here and we can iron it out it GitHub issues.
– hrbrmstr
Nov 24 '18 at 22:11
That would be amazing, thanks!
– TimM
Nov 24 '18 at 22:13
Cool. If you could put your initial comment to this answer as an issue in github.com/hrbrmstr/nz-ecan I'll drop a note on the morrow. Tonight is last night with my college freshman son home for the Thanksgiving break so I'll be able to crank on this tomorrow with zeal.
– hrbrmstr
Nov 24 '18 at 22:17
add a comment |
It's super disconcerting that most if not all SO R contributors default to "use a heavyweight third-party dependency" in curt "answers" when it comes to scraping. 99% of the time you don't need Selenium. You just need to exercise the little gray cells.
First, a big clue that the page loads content asynchronously is the wait-spinner that appears. The second one is in your snippet where the div
actually has part of a selector name with ajax
in it. Tell-tale signs XHR requests are in-play.
If you open Developer Tools in your browser and reload the page then go to Network and then the XHR tab you'll see:
Most of the "real" data on the page is loaded dynamically. We can write httr
calls that mimic the browser calls.
However…
We first need to make one GET
call to the main page to prime some cookies which will be carried over for us and then find a per-generated session token that's used to prevent abuse of the site. It's defined using JavaScript so we'll use the V8
package to evaluate it. We could have just use regular expressions to find the string. Do whatev you like.
library(httr)
library(rvest)
library(dplyr)
library(V8)
ctx <- v8() # we need this to eval some javascript
# Prime Cookies -----------------------------------------------------------
res <- httr::GET("https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1")
httr::cookies(res)
## domain flag path secure expiration name
## 1 .ecan.govt.nz TRUE / FALSE 2019-11-24 11:46:13 visid_incap_927063
## 2 .ecan.govt.nz TRUE / FALSE <NA> incap_ses_148_927063
## value
## 1 +p8XAM6uReGmEnVIdnaxoxWL+VsAAAAAQUIPAAAAAABjdOjQDbXt7PG3tpBpELha
## 2 nXJSYz8zbCRj8tGhzNANAhaL+VsAAAAA7JyOH7Gu4qeIb6KKk/iSYQ==
pg <- httr::content(res)
html_node(pg, xpath=".//script[contains(., '_monsido')]") %>%
html_text() %>%
ctx$eval()
## [1] "2"
monsido_token <- ctx$get("_monsido")[1,2]
Here's the searchlist
(which is, indeed empty):
httr::VERB(
verb = "POST", url = "https://www.ecan.govt.nz/data/document-library/searchlist",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
`X-Requested-With` = "XMLHttpRequest",
TE = "Trailers"
), httr::set_cookies(
monsido = monsido_token
),
body = list(
name = "CRC000002.1",
pageSize = "999999"
),
encode = "form"
) -> res
httr::content(res)
## NULL ## <<=== this is OK as there is no response
Here's the "Consent Overview" section:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentoverview/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
html_table() %>%
glimpse()
## List of 1
## $ :'data.frame': 5 obs. of 4 variables:
## ..$ X1: chr [1:5] "RMA Authorisation Number" "Consent Location" "To" "Commencement Date" ...
## ..$ X2: chr [1:5] "CRC000002.1" "Manuka Creek, KILLERMONT STATION" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X3: chr [1:5] "Client Name" "State" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X4: chr [1:5] "Killermont Station Limited" "Issued - Active" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
Here are the "Consent Conditions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentconditions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="consentDetails">
## <ul class="unstyled-list">
## <li>
##
##
## <strong class="pull-left">1</strong> <div class="pad-left1">The rate at which wa
Here's the "Consent Related":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentrelated/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body>
## <p>There are no related documents.</p>
##
##
##
##
##
## <div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead><tr>
## <th>Relationship</th>
## <th>Recor
Here's the "Workflow:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res)
## {xml_document}
## <html>
## [1] <body><p>No workflow</p></body>
Here are the "Consent Flow Restrictions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentflowrestrictions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead>
## <th colspan="2">Low Flow Site</th>
## <th>Todays Flow <span class="lower">(m3/s)</span>
## </th>
You still need to parse HTML but now you can do it all with just plain R packages.
Thanks, that's brilliant! Works perfectly, and now I'm on to banging my head against a wall with the text pattern matching. Can you explain to me briefly how to select the arguments for GET? They work perfectly in this case, but I don't think I could replicate it, and the help file in R is a little opaque.
– TimM
Nov 24 '18 at 21:54
1
If this isn't time-sensitive, lemme put this into GitHub by tomorrow and I'll drop a link here and we can iron it out it GitHub issues.
– hrbrmstr
Nov 24 '18 at 22:11
That would be amazing, thanks!
– TimM
Nov 24 '18 at 22:13
Cool. If you could put your initial comment to this answer as an issue in github.com/hrbrmstr/nz-ecan I'll drop a note on the morrow. Tonight is last night with my college freshman son home for the Thanksgiving break so I'll be able to crank on this tomorrow with zeal.
– hrbrmstr
Nov 24 '18 at 22:17
add a comment |
It's super disconcerting that most if not all SO R contributors default to "use a heavyweight third-party dependency" in curt "answers" when it comes to scraping. 99% of the time you don't need Selenium. You just need to exercise the little gray cells.
First, a big clue that the page loads content asynchronously is the wait-spinner that appears. The second one is in your snippet where the div
actually has part of a selector name with ajax
in it. Tell-tale signs XHR requests are in-play.
If you open Developer Tools in your browser and reload the page then go to Network and then the XHR tab you'll see:
Most of the "real" data on the page is loaded dynamically. We can write httr
calls that mimic the browser calls.
However…
We first need to make one GET
call to the main page to prime some cookies which will be carried over for us and then find a per-generated session token that's used to prevent abuse of the site. It's defined using JavaScript so we'll use the V8
package to evaluate it. We could have just use regular expressions to find the string. Do whatev you like.
library(httr)
library(rvest)
library(dplyr)
library(V8)
ctx <- v8() # we need this to eval some javascript
# Prime Cookies -----------------------------------------------------------
res <- httr::GET("https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1")
httr::cookies(res)
## domain flag path secure expiration name
## 1 .ecan.govt.nz TRUE / FALSE 2019-11-24 11:46:13 visid_incap_927063
## 2 .ecan.govt.nz TRUE / FALSE <NA> incap_ses_148_927063
## value
## 1 +p8XAM6uReGmEnVIdnaxoxWL+VsAAAAAQUIPAAAAAABjdOjQDbXt7PG3tpBpELha
## 2 nXJSYz8zbCRj8tGhzNANAhaL+VsAAAAA7JyOH7Gu4qeIb6KKk/iSYQ==
pg <- httr::content(res)
html_node(pg, xpath=".//script[contains(., '_monsido')]") %>%
html_text() %>%
ctx$eval()
## [1] "2"
monsido_token <- ctx$get("_monsido")[1,2]
Here's the searchlist
(which is, indeed empty):
httr::VERB(
verb = "POST", url = "https://www.ecan.govt.nz/data/document-library/searchlist",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
`X-Requested-With` = "XMLHttpRequest",
TE = "Trailers"
), httr::set_cookies(
monsido = monsido_token
),
body = list(
name = "CRC000002.1",
pageSize = "999999"
),
encode = "form"
) -> res
httr::content(res)
## NULL ## <<=== this is OK as there is no response
Here's the "Consent Overview" section:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentoverview/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
html_table() %>%
glimpse()
## List of 1
## $ :'data.frame': 5 obs. of 4 variables:
## ..$ X1: chr [1:5] "RMA Authorisation Number" "Consent Location" "To" "Commencement Date" ...
## ..$ X2: chr [1:5] "CRC000002.1" "Manuka Creek, KILLERMONT STATION" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X3: chr [1:5] "Client Name" "State" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X4: chr [1:5] "Killermont Station Limited" "Issued - Active" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
Here are the "Consent Conditions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentconditions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="consentDetails">
## <ul class="unstyled-list">
## <li>
##
##
## <strong class="pull-left">1</strong> <div class="pad-left1">The rate at which wa
Here's the "Consent Related":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentrelated/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body>
## <p>There are no related documents.</p>
##
##
##
##
##
## <div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead><tr>
## <th>Relationship</th>
## <th>Recor
Here's the "Workflow:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res)
## {xml_document}
## <html>
## [1] <body><p>No workflow</p></body>
Here are the "Consent Flow Restrictions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentflowrestrictions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead>
## <th colspan="2">Low Flow Site</th>
## <th>Todays Flow <span class="lower">(m3/s)</span>
## </th>
You still need to parse HTML but now you can do it all with just plain R packages.
It's super disconcerting that most if not all SO R contributors default to "use a heavyweight third-party dependency" in curt "answers" when it comes to scraping. 99% of the time you don't need Selenium. You just need to exercise the little gray cells.
First, a big clue that the page loads content asynchronously is the wait-spinner that appears. The second one is in your snippet where the div
actually has part of a selector name with ajax
in it. Tell-tale signs XHR requests are in-play.
If you open Developer Tools in your browser and reload the page then go to Network and then the XHR tab you'll see:
Most of the "real" data on the page is loaded dynamically. We can write httr
calls that mimic the browser calls.
However…
We first need to make one GET
call to the main page to prime some cookies which will be carried over for us and then find a per-generated session token that's used to prevent abuse of the site. It's defined using JavaScript so we'll use the V8
package to evaluate it. We could have just use regular expressions to find the string. Do whatev you like.
library(httr)
library(rvest)
library(dplyr)
library(V8)
ctx <- v8() # we need this to eval some javascript
# Prime Cookies -----------------------------------------------------------
res <- httr::GET("https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1")
httr::cookies(res)
## domain flag path secure expiration name
## 1 .ecan.govt.nz TRUE / FALSE 2019-11-24 11:46:13 visid_incap_927063
## 2 .ecan.govt.nz TRUE / FALSE <NA> incap_ses_148_927063
## value
## 1 +p8XAM6uReGmEnVIdnaxoxWL+VsAAAAAQUIPAAAAAABjdOjQDbXt7PG3tpBpELha
## 2 nXJSYz8zbCRj8tGhzNANAhaL+VsAAAAA7JyOH7Gu4qeIb6KKk/iSYQ==
pg <- httr::content(res)
html_node(pg, xpath=".//script[contains(., '_monsido')]") %>%
html_text() %>%
ctx$eval()
## [1] "2"
monsido_token <- ctx$get("_monsido")[1,2]
Here's the searchlist
(which is, indeed empty):
httr::VERB(
verb = "POST", url = "https://www.ecan.govt.nz/data/document-library/searchlist",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
`X-Requested-With` = "XMLHttpRequest",
TE = "Trailers"
), httr::set_cookies(
monsido = monsido_token
),
body = list(
name = "CRC000002.1",
pageSize = "999999"
),
encode = "form"
) -> res
httr::content(res)
## NULL ## <<=== this is OK as there is no response
Here's the "Consent Overview" section:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentoverview/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
html_table() %>%
glimpse()
## List of 1
## $ :'data.frame': 5 obs. of 4 variables:
## ..$ X1: chr [1:5] "RMA Authorisation Number" "Consent Location" "To" "Commencement Date" ...
## ..$ X2: chr [1:5] "CRC000002.1" "Manuka Creek, KILLERMONT STATION" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X3: chr [1:5] "Client Name" "State" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X4: chr [1:5] "Killermont Station Limited" "Issued - Active" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
Here are the "Consent Conditions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentconditions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="consentDetails">
## <ul class="unstyled-list">
## <li>
##
##
## <strong class="pull-left">1</strong> <div class="pad-left1">The rate at which wa
Here's the "Consent Related":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentrelated/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body>
## <p>There are no related documents.</p>
##
##
##
##
##
## <div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead><tr>
## <th>Relationship</th>
## <th>Recor
Here's the "Workflow:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res)
## {xml_document}
## <html>
## [1] <body><p>No workflow</p></body>
Here are the "Consent Flow Restrictions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentflowrestrictions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead>
## <th colspan="2">Low Flow Site</th>
## <th>Todays Flow <span class="lower">(m3/s)</span>
## </th>
You still need to parse HTML but now you can do it all with just plain R packages.
edited Nov 25 '18 at 12:30
answered Nov 24 '18 at 17:38
hrbrmstrhrbrmstr
62.2k694155
62.2k694155
Thanks, that's brilliant! Works perfectly, and now I'm on to banging my head against a wall with the text pattern matching. Can you explain to me briefly how to select the arguments for GET? They work perfectly in this case, but I don't think I could replicate it, and the help file in R is a little opaque.
– TimM
Nov 24 '18 at 21:54
1
If this isn't time-sensitive, lemme put this into GitHub by tomorrow and I'll drop a link here and we can iron it out it GitHub issues.
– hrbrmstr
Nov 24 '18 at 22:11
That would be amazing, thanks!
– TimM
Nov 24 '18 at 22:13
Cool. If you could put your initial comment to this answer as an issue in github.com/hrbrmstr/nz-ecan I'll drop a note on the morrow. Tonight is last night with my college freshman son home for the Thanksgiving break so I'll be able to crank on this tomorrow with zeal.
– hrbrmstr
Nov 24 '18 at 22:17
add a comment |
Thanks, that's brilliant! Works perfectly, and now I'm on to banging my head against a wall with the text pattern matching. Can you explain to me briefly how to select the arguments for GET? They work perfectly in this case, but I don't think I could replicate it, and the help file in R is a little opaque.
– TimM
Nov 24 '18 at 21:54
1
If this isn't time-sensitive, lemme put this into GitHub by tomorrow and I'll drop a link here and we can iron it out it GitHub issues.
– hrbrmstr
Nov 24 '18 at 22:11
That would be amazing, thanks!
– TimM
Nov 24 '18 at 22:13
Cool. If you could put your initial comment to this answer as an issue in github.com/hrbrmstr/nz-ecan I'll drop a note on the morrow. Tonight is last night with my college freshman son home for the Thanksgiving break so I'll be able to crank on this tomorrow with zeal.
– hrbrmstr
Nov 24 '18 at 22:17
Thanks, that's brilliant! Works perfectly, and now I'm on to banging my head against a wall with the text pattern matching. Can you explain to me briefly how to select the arguments for GET? They work perfectly in this case, but I don't think I could replicate it, and the help file in R is a little opaque.
– TimM
Nov 24 '18 at 21:54
Thanks, that's brilliant! Works perfectly, and now I'm on to banging my head against a wall with the text pattern matching. Can you explain to me briefly how to select the arguments for GET? They work perfectly in this case, but I don't think I could replicate it, and the help file in R is a little opaque.
– TimM
Nov 24 '18 at 21:54
1
1
If this isn't time-sensitive, lemme put this into GitHub by tomorrow and I'll drop a link here and we can iron it out it GitHub issues.
– hrbrmstr
Nov 24 '18 at 22:11
If this isn't time-sensitive, lemme put this into GitHub by tomorrow and I'll drop a link here and we can iron it out it GitHub issues.
– hrbrmstr
Nov 24 '18 at 22:11
That would be amazing, thanks!
– TimM
Nov 24 '18 at 22:13
That would be amazing, thanks!
– TimM
Nov 24 '18 at 22:13
Cool. If you could put your initial comment to this answer as an issue in github.com/hrbrmstr/nz-ecan I'll drop a note on the morrow. Tonight is last night with my college freshman son home for the Thanksgiving break so I'll be able to crank on this tomorrow with zeal.
– hrbrmstr
Nov 24 '18 at 22:17
Cool. If you could put your initial comment to this answer as an issue in github.com/hrbrmstr/nz-ecan I'll drop a note on the morrow. Tonight is last night with my college freshman son home for the Thanksgiving break so I'll be able to crank on this tomorrow with zeal.
– hrbrmstr
Nov 24 '18 at 22:17
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53457078%2fextracting-html-text-using-r-cant-access-some-nodes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
First of all, is it legal for you to get data from that page?
– NelsonGon
Nov 24 '18 at 10:14
Yes, it's all public information.
– TimM
Nov 24 '18 at 10:17
it dynamic page, use selenium
– ewwink
Nov 24 '18 at 10:22
How would you go about extracting the data in RSelenium? I had a quick look and it seems like it's pretty involved!
– TimM
Nov 24 '18 at 11:16
2
Please see my answer. This "use selenium" craze is just crazy.
– hrbrmstr
Nov 24 '18 at 17:38