Extracting html text using R - can't access some nodes





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







3















I have a large number of water take permits that are available online and I want to extract some data from them. For example



url <- "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1"


I don't know html at all, but have been plugging away with help from google and a friend. I can get to some of the nodes without any issues using the xpath or css selector, for instance to get to the title:



library(rvest)
url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="main"]/div/h1') %>%
html_text()
[1] "Details for CRC000002.1"


Or using the css selectors:



url %>% 
read_html() %>%
html_nodes(css = "#main") %>%
html_nodes(css = "div") %>%
html_nodes(css = "h1") %>%
html_text()
[1] "Details for CRC000002.1"


So far, so good, but the information I actually want is buried a bit deeper and I can't seem to get to it. For instance, the client name field ("Killermont Station Limited", in this case) has this xpath:



clientxpath <- '//*[@id="main"]/div/div[1]/div/table/tbody/tr[1]/td[2]'
url %>%
read_html() %>%
html_nodes(xpath = clientxpath) %>%
html_text()
character(0)


The css selectors gets quite convoluted, but I get the same result. The help file for html_nodes() says:



# XPath selectors ---------------------------------------------
# chaining with XPath is a little trickier - you may need to vary
# the prefix you're using - // always selects from the root noot
# regardless of where you currently are in the doc


But it doesn't give me any clues on what to use an an alternative prefix in the xpath (might be obvious if I knew html).



My friend pointed out that some of the document is in javascript (ajax), which may be part of the problem too. That said, the bit I'm trying to get to above shows up in the html, but it is within a node called 'div.ajax-block'.



css selectors: #main > div > div.ajax-block > div > table > tbody > tr:nth-child(1) > td:nth-child(4)


Can anyone help? Thanks!










share|improve this question


















  • 1





    First of all, is it legal for you to get data from that page?

    – NelsonGon
    Nov 24 '18 at 10:14











  • Yes, it's all public information.

    – TimM
    Nov 24 '18 at 10:17











  • it dynamic page, use selenium

    – ewwink
    Nov 24 '18 at 10:22











  • How would you go about extracting the data in RSelenium? I had a quick look and it seems like it's pretty involved!

    – TimM
    Nov 24 '18 at 11:16






  • 2





    Please see my answer. This "use selenium" craze is just crazy.

    – hrbrmstr
    Nov 24 '18 at 17:38


















3















I have a large number of water take permits that are available online and I want to extract some data from them. For example



url <- "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1"


I don't know html at all, but have been plugging away with help from google and a friend. I can get to some of the nodes without any issues using the xpath or css selector, for instance to get to the title:



library(rvest)
url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="main"]/div/h1') %>%
html_text()
[1] "Details for CRC000002.1"


Or using the css selectors:



url %>% 
read_html() %>%
html_nodes(css = "#main") %>%
html_nodes(css = "div") %>%
html_nodes(css = "h1") %>%
html_text()
[1] "Details for CRC000002.1"


So far, so good, but the information I actually want is buried a bit deeper and I can't seem to get to it. For instance, the client name field ("Killermont Station Limited", in this case) has this xpath:



clientxpath <- '//*[@id="main"]/div/div[1]/div/table/tbody/tr[1]/td[2]'
url %>%
read_html() %>%
html_nodes(xpath = clientxpath) %>%
html_text()
character(0)


The css selectors gets quite convoluted, but I get the same result. The help file for html_nodes() says:



# XPath selectors ---------------------------------------------
# chaining with XPath is a little trickier - you may need to vary
# the prefix you're using - // always selects from the root noot
# regardless of where you currently are in the doc


But it doesn't give me any clues on what to use an an alternative prefix in the xpath (might be obvious if I knew html).



My friend pointed out that some of the document is in javascript (ajax), which may be part of the problem too. That said, the bit I'm trying to get to above shows up in the html, but it is within a node called 'div.ajax-block'.



css selectors: #main > div > div.ajax-block > div > table > tbody > tr:nth-child(1) > td:nth-child(4)


Can anyone help? Thanks!










share|improve this question


















  • 1





    First of all, is it legal for you to get data from that page?

    – NelsonGon
    Nov 24 '18 at 10:14











  • Yes, it's all public information.

    – TimM
    Nov 24 '18 at 10:17











  • it dynamic page, use selenium

    – ewwink
    Nov 24 '18 at 10:22











  • How would you go about extracting the data in RSelenium? I had a quick look and it seems like it's pretty involved!

    – TimM
    Nov 24 '18 at 11:16






  • 2





    Please see my answer. This "use selenium" craze is just crazy.

    – hrbrmstr
    Nov 24 '18 at 17:38














3












3








3








I have a large number of water take permits that are available online and I want to extract some data from them. For example



url <- "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1"


I don't know html at all, but have been plugging away with help from google and a friend. I can get to some of the nodes without any issues using the xpath or css selector, for instance to get to the title:



library(rvest)
url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="main"]/div/h1') %>%
html_text()
[1] "Details for CRC000002.1"


Or using the css selectors:



url %>% 
read_html() %>%
html_nodes(css = "#main") %>%
html_nodes(css = "div") %>%
html_nodes(css = "h1") %>%
html_text()
[1] "Details for CRC000002.1"


So far, so good, but the information I actually want is buried a bit deeper and I can't seem to get to it. For instance, the client name field ("Killermont Station Limited", in this case) has this xpath:



clientxpath <- '//*[@id="main"]/div/div[1]/div/table/tbody/tr[1]/td[2]'
url %>%
read_html() %>%
html_nodes(xpath = clientxpath) %>%
html_text()
character(0)


The css selectors gets quite convoluted, but I get the same result. The help file for html_nodes() says:



# XPath selectors ---------------------------------------------
# chaining with XPath is a little trickier - you may need to vary
# the prefix you're using - // always selects from the root noot
# regardless of where you currently are in the doc


But it doesn't give me any clues on what to use an an alternative prefix in the xpath (might be obvious if I knew html).



My friend pointed out that some of the document is in javascript (ajax), which may be part of the problem too. That said, the bit I'm trying to get to above shows up in the html, but it is within a node called 'div.ajax-block'.



css selectors: #main > div > div.ajax-block > div > table > tbody > tr:nth-child(1) > td:nth-child(4)


Can anyone help? Thanks!










share|improve this question














I have a large number of water take permits that are available online and I want to extract some data from them. For example



url <- "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1"


I don't know html at all, but have been plugging away with help from google and a friend. I can get to some of the nodes without any issues using the xpath or css selector, for instance to get to the title:



library(rvest)
url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="main"]/div/h1') %>%
html_text()
[1] "Details for CRC000002.1"


Or using the css selectors:



url %>% 
read_html() %>%
html_nodes(css = "#main") %>%
html_nodes(css = "div") %>%
html_nodes(css = "h1") %>%
html_text()
[1] "Details for CRC000002.1"


So far, so good, but the information I actually want is buried a bit deeper and I can't seem to get to it. For instance, the client name field ("Killermont Station Limited", in this case) has this xpath:



clientxpath <- '//*[@id="main"]/div/div[1]/div/table/tbody/tr[1]/td[2]'
url %>%
read_html() %>%
html_nodes(xpath = clientxpath) %>%
html_text()
character(0)


The css selectors gets quite convoluted, but I get the same result. The help file for html_nodes() says:



# XPath selectors ---------------------------------------------
# chaining with XPath is a little trickier - you may need to vary
# the prefix you're using - // always selects from the root noot
# regardless of where you currently are in the doc


But it doesn't give me any clues on what to use an an alternative prefix in the xpath (might be obvious if I knew html).



My friend pointed out that some of the document is in javascript (ajax), which may be part of the problem too. That said, the bit I'm trying to get to above shows up in the html, but it is within a node called 'div.ajax-block'.



css selectors: #main > div > div.ajax-block > div > table > tbody > tr:nth-child(1) > td:nth-child(4)


Can anyone help? Thanks!







html r web-scraping rvest






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 24 '18 at 10:05









TimMTimM

474




474








  • 1





    First of all, is it legal for you to get data from that page?

    – NelsonGon
    Nov 24 '18 at 10:14











  • Yes, it's all public information.

    – TimM
    Nov 24 '18 at 10:17











  • it dynamic page, use selenium

    – ewwink
    Nov 24 '18 at 10:22











  • How would you go about extracting the data in RSelenium? I had a quick look and it seems like it's pretty involved!

    – TimM
    Nov 24 '18 at 11:16






  • 2





    Please see my answer. This "use selenium" craze is just crazy.

    – hrbrmstr
    Nov 24 '18 at 17:38














  • 1





    First of all, is it legal for you to get data from that page?

    – NelsonGon
    Nov 24 '18 at 10:14











  • Yes, it's all public information.

    – TimM
    Nov 24 '18 at 10:17











  • it dynamic page, use selenium

    – ewwink
    Nov 24 '18 at 10:22











  • How would you go about extracting the data in RSelenium? I had a quick look and it seems like it's pretty involved!

    – TimM
    Nov 24 '18 at 11:16






  • 2





    Please see my answer. This "use selenium" craze is just crazy.

    – hrbrmstr
    Nov 24 '18 at 17:38








1




1





First of all, is it legal for you to get data from that page?

– NelsonGon
Nov 24 '18 at 10:14





First of all, is it legal for you to get data from that page?

– NelsonGon
Nov 24 '18 at 10:14













Yes, it's all public information.

– TimM
Nov 24 '18 at 10:17





Yes, it's all public information.

– TimM
Nov 24 '18 at 10:17













it dynamic page, use selenium

– ewwink
Nov 24 '18 at 10:22





it dynamic page, use selenium

– ewwink
Nov 24 '18 at 10:22













How would you go about extracting the data in RSelenium? I had a quick look and it seems like it's pretty involved!

– TimM
Nov 24 '18 at 11:16





How would you go about extracting the data in RSelenium? I had a quick look and it seems like it's pretty involved!

– TimM
Nov 24 '18 at 11:16




2




2





Please see my answer. This "use selenium" craze is just crazy.

– hrbrmstr
Nov 24 '18 at 17:38





Please see my answer. This "use selenium" craze is just crazy.

– hrbrmstr
Nov 24 '18 at 17:38












1 Answer
1






active

oldest

votes


















4














It's super disconcerting that most if not all SO R contributors default to "use a heavyweight third-party dependency" in curt "answers" when it comes to scraping. 99% of the time you don't need Selenium. You just need to exercise the little gray cells.



First, a big clue that the page loads content asynchronously is the wait-spinner that appears. The second one is in your snippet where the div actually has part of a selector name with ajax in it. Tell-tale signs XHR requests are in-play.



If you open Developer Tools in your browser and reload the page then go to Network and then the XHR tab you'll see:



enter image description here



Most of the "real" data on the page is loaded dynamically. We can write httr calls that mimic the browser calls.



However



We first need to make one GET call to the main page to prime some cookies which will be carried over for us and then find a per-generated session token that's used to prevent abuse of the site. It's defined using JavaScript so we'll use the V8 package to evaluate it. We could have just use regular expressions to find the string. Do whatev you like.



library(httr)
library(rvest)
library(dplyr)
library(V8)

ctx <- v8() # we need this to eval some javascript

# Prime Cookies -----------------------------------------------------------

res <- httr::GET("https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1")

httr::cookies(res)
## domain flag path secure expiration name
## 1 .ecan.govt.nz TRUE / FALSE 2019-11-24 11:46:13 visid_incap_927063
## 2 .ecan.govt.nz TRUE / FALSE <NA> incap_ses_148_927063
## value
## 1 +p8XAM6uReGmEnVIdnaxoxWL+VsAAAAAQUIPAAAAAABjdOjQDbXt7PG3tpBpELha
## 2 nXJSYz8zbCRj8tGhzNANAhaL+VsAAAAA7JyOH7Gu4qeIb6KKk/iSYQ==
pg <- httr::content(res)

html_node(pg, xpath=".//script[contains(., '_monsido')]") %>%
html_text() %>%
ctx$eval()
## [1] "2"
monsido_token <- ctx$get("_monsido")[1,2]


Here's the searchlist (which is, indeed empty):



httr::VERB(
verb = "POST", url = "https://www.ecan.govt.nz/data/document-library/searchlist",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
`X-Requested-With` = "XMLHttpRequest",
TE = "Trailers"
), httr::set_cookies(
monsido = monsido_token
),
body = list(
name = "CRC000002.1",
pageSize = "999999"
),
encode = "form"
) -> res

httr::content(res)
## NULL ## <<=== this is OK as there is no response


Here's the "Consent Overview" section:



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentoverview/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
html_table() %>%
glimpse()
## List of 1
## $ :'data.frame': 5 obs. of 4 variables:
## ..$ X1: chr [1:5] "RMA Authorisation Number" "Consent Location" "To" "Commencement Date" ...
## ..$ X2: chr [1:5] "CRC000002.1" "Manuka Creek, KILLERMONT STATION" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X3: chr [1:5] "Client Name" "State" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X4: chr [1:5] "Killermont Station Limited" "Issued - Active" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...


Here are the "Consent Conditions":



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentconditions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="consentDetails">
## <ul class="unstyled-list">
## <li>
##
##
## <strong class="pull-left">1</strong> <div class="pad-left1">The rate at which wa


Here's the "Consent Related":



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentrelated/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body>
## <p>There are no related documents.</p>
##
##
##
##
##
## <div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead><tr>
## <th>Relationship</th>
## <th>Recor


Here's the "Workflow:



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res)
## {xml_document}
## <html>
## [1] <body><p>No workflow</p></body>


Here are the "Consent Flow Restrictions":



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentflowrestrictions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead>
## <th colspan="2">Low Flow Site</th>
## <th>Todays Flow <span class="lower">(m3/s)</span>
## </th>


You still need to parse HTML but now you can do it all with just plain R packages.






share|improve this answer


























  • Thanks, that's brilliant! Works perfectly, and now I'm on to banging my head against a wall with the text pattern matching. Can you explain to me briefly how to select the arguments for GET? They work perfectly in this case, but I don't think I could replicate it, and the help file in R is a little opaque.

    – TimM
    Nov 24 '18 at 21:54






  • 1





    If this isn't time-sensitive, lemme put this into GitHub by tomorrow and I'll drop a link here and we can iron it out it GitHub issues.

    – hrbrmstr
    Nov 24 '18 at 22:11











  • That would be amazing, thanks!

    – TimM
    Nov 24 '18 at 22:13











  • Cool. If you could put your initial comment to this answer as an issue in github.com/hrbrmstr/nz-ecan I'll drop a note on the morrow. Tonight is last night with my college freshman son home for the Thanksgiving break so I'll be able to crank on this tomorrow with zeal.

    – hrbrmstr
    Nov 24 '18 at 22:17












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53457078%2fextracting-html-text-using-r-cant-access-some-nodes%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









4














It's super disconcerting that most if not all SO R contributors default to "use a heavyweight third-party dependency" in curt "answers" when it comes to scraping. 99% of the time you don't need Selenium. You just need to exercise the little gray cells.



First, a big clue that the page loads content asynchronously is the wait-spinner that appears. The second one is in your snippet where the div actually has part of a selector name with ajax in it. Tell-tale signs XHR requests are in-play.



If you open Developer Tools in your browser and reload the page then go to Network and then the XHR tab you'll see:



enter image description here



Most of the "real" data on the page is loaded dynamically. We can write httr calls that mimic the browser calls.



However



We first need to make one GET call to the main page to prime some cookies which will be carried over for us and then find a per-generated session token that's used to prevent abuse of the site. It's defined using JavaScript so we'll use the V8 package to evaluate it. We could have just use regular expressions to find the string. Do whatev you like.



library(httr)
library(rvest)
library(dplyr)
library(V8)

ctx <- v8() # we need this to eval some javascript

# Prime Cookies -----------------------------------------------------------

res <- httr::GET("https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1")

httr::cookies(res)
## domain flag path secure expiration name
## 1 .ecan.govt.nz TRUE / FALSE 2019-11-24 11:46:13 visid_incap_927063
## 2 .ecan.govt.nz TRUE / FALSE <NA> incap_ses_148_927063
## value
## 1 +p8XAM6uReGmEnVIdnaxoxWL+VsAAAAAQUIPAAAAAABjdOjQDbXt7PG3tpBpELha
## 2 nXJSYz8zbCRj8tGhzNANAhaL+VsAAAAA7JyOH7Gu4qeIb6KKk/iSYQ==
pg <- httr::content(res)

html_node(pg, xpath=".//script[contains(., '_monsido')]") %>%
html_text() %>%
ctx$eval()
## [1] "2"
monsido_token <- ctx$get("_monsido")[1,2]


Here's the searchlist (which is, indeed empty):



httr::VERB(
verb = "POST", url = "https://www.ecan.govt.nz/data/document-library/searchlist",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
`X-Requested-With` = "XMLHttpRequest",
TE = "Trailers"
), httr::set_cookies(
monsido = monsido_token
),
body = list(
name = "CRC000002.1",
pageSize = "999999"
),
encode = "form"
) -> res

httr::content(res)
## NULL ## <<=== this is OK as there is no response


Here's the "Consent Overview" section:



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentoverview/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
html_table() %>%
glimpse()
## List of 1
## $ :'data.frame': 5 obs. of 4 variables:
## ..$ X1: chr [1:5] "RMA Authorisation Number" "Consent Location" "To" "Commencement Date" ...
## ..$ X2: chr [1:5] "CRC000002.1" "Manuka Creek, KILLERMONT STATION" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X3: chr [1:5] "Client Name" "State" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X4: chr [1:5] "Killermont Station Limited" "Issued - Active" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...


Here are the "Consent Conditions":



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentconditions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="consentDetails">
## <ul class="unstyled-list">
## <li>
##
##
## <strong class="pull-left">1</strong> <div class="pad-left1">The rate at which wa


Here's the "Consent Related":



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentrelated/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body>
## <p>There are no related documents.</p>
##
##
##
##
##
## <div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead><tr>
## <th>Relationship</th>
## <th>Recor


Here's the "Workflow:



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res)
## {xml_document}
## <html>
## [1] <body><p>No workflow</p></body>


Here are the "Consent Flow Restrictions":



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentflowrestrictions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead>
## <th colspan="2">Low Flow Site</th>
## <th>Todays Flow <span class="lower">(m3/s)</span>
## </th>


You still need to parse HTML but now you can do it all with just plain R packages.






share|improve this answer


























  • Thanks, that's brilliant! Works perfectly, and now I'm on to banging my head against a wall with the text pattern matching. Can you explain to me briefly how to select the arguments for GET? They work perfectly in this case, but I don't think I could replicate it, and the help file in R is a little opaque.

    – TimM
    Nov 24 '18 at 21:54






  • 1





    If this isn't time-sensitive, lemme put this into GitHub by tomorrow and I'll drop a link here and we can iron it out it GitHub issues.

    – hrbrmstr
    Nov 24 '18 at 22:11











  • That would be amazing, thanks!

    – TimM
    Nov 24 '18 at 22:13











  • Cool. If you could put your initial comment to this answer as an issue in github.com/hrbrmstr/nz-ecan I'll drop a note on the morrow. Tonight is last night with my college freshman son home for the Thanksgiving break so I'll be able to crank on this tomorrow with zeal.

    – hrbrmstr
    Nov 24 '18 at 22:17
















4














It's super disconcerting that most if not all SO R contributors default to "use a heavyweight third-party dependency" in curt "answers" when it comes to scraping. 99% of the time you don't need Selenium. You just need to exercise the little gray cells.



First, a big clue that the page loads content asynchronously is the wait-spinner that appears. The second one is in your snippet where the div actually has part of a selector name with ajax in it. Tell-tale signs XHR requests are in-play.



If you open Developer Tools in your browser and reload the page then go to Network and then the XHR tab you'll see:



enter image description here



Most of the "real" data on the page is loaded dynamically. We can write httr calls that mimic the browser calls.



However



We first need to make one GET call to the main page to prime some cookies which will be carried over for us and then find a per-generated session token that's used to prevent abuse of the site. It's defined using JavaScript so we'll use the V8 package to evaluate it. We could have just use regular expressions to find the string. Do whatev you like.



library(httr)
library(rvest)
library(dplyr)
library(V8)

ctx <- v8() # we need this to eval some javascript

# Prime Cookies -----------------------------------------------------------

res <- httr::GET("https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1")

httr::cookies(res)
## domain flag path secure expiration name
## 1 .ecan.govt.nz TRUE / FALSE 2019-11-24 11:46:13 visid_incap_927063
## 2 .ecan.govt.nz TRUE / FALSE <NA> incap_ses_148_927063
## value
## 1 +p8XAM6uReGmEnVIdnaxoxWL+VsAAAAAQUIPAAAAAABjdOjQDbXt7PG3tpBpELha
## 2 nXJSYz8zbCRj8tGhzNANAhaL+VsAAAAA7JyOH7Gu4qeIb6KKk/iSYQ==
pg <- httr::content(res)

html_node(pg, xpath=".//script[contains(., '_monsido')]") %>%
html_text() %>%
ctx$eval()
## [1] "2"
monsido_token <- ctx$get("_monsido")[1,2]


Here's the searchlist (which is, indeed empty):



httr::VERB(
verb = "POST", url = "https://www.ecan.govt.nz/data/document-library/searchlist",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
`X-Requested-With` = "XMLHttpRequest",
TE = "Trailers"
), httr::set_cookies(
monsido = monsido_token
),
body = list(
name = "CRC000002.1",
pageSize = "999999"
),
encode = "form"
) -> res

httr::content(res)
## NULL ## <<=== this is OK as there is no response


Here's the "Consent Overview" section:



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentoverview/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
html_table() %>%
glimpse()
## List of 1
## $ :'data.frame': 5 obs. of 4 variables:
## ..$ X1: chr [1:5] "RMA Authorisation Number" "Consent Location" "To" "Commencement Date" ...
## ..$ X2: chr [1:5] "CRC000002.1" "Manuka Creek, KILLERMONT STATION" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X3: chr [1:5] "Client Name" "State" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X4: chr [1:5] "Killermont Station Limited" "Issued - Active" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...


Here are the "Consent Conditions":



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentconditions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="consentDetails">
## <ul class="unstyled-list">
## <li>
##
##
## <strong class="pull-left">1</strong> <div class="pad-left1">The rate at which wa


Here's the "Consent Related":



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentrelated/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body>
## <p>There are no related documents.</p>
##
##
##
##
##
## <div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead><tr>
## <th>Relationship</th>
## <th>Recor


Here's the "Workflow:



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res)
## {xml_document}
## <html>
## [1] <body><p>No workflow</p></body>


Here are the "Consent Flow Restrictions":



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentflowrestrictions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead>
## <th colspan="2">Low Flow Site</th>
## <th>Todays Flow <span class="lower">(m3/s)</span>
## </th>


You still need to parse HTML but now you can do it all with just plain R packages.






share|improve this answer


























  • Thanks, that's brilliant! Works perfectly, and now I'm on to banging my head against a wall with the text pattern matching. Can you explain to me briefly how to select the arguments for GET? They work perfectly in this case, but I don't think I could replicate it, and the help file in R is a little opaque.

    – TimM
    Nov 24 '18 at 21:54






  • 1





    If this isn't time-sensitive, lemme put this into GitHub by tomorrow and I'll drop a link here and we can iron it out it GitHub issues.

    – hrbrmstr
    Nov 24 '18 at 22:11











  • That would be amazing, thanks!

    – TimM
    Nov 24 '18 at 22:13











  • Cool. If you could put your initial comment to this answer as an issue in github.com/hrbrmstr/nz-ecan I'll drop a note on the morrow. Tonight is last night with my college freshman son home for the Thanksgiving break so I'll be able to crank on this tomorrow with zeal.

    – hrbrmstr
    Nov 24 '18 at 22:17














4












4








4







It's super disconcerting that most if not all SO R contributors default to "use a heavyweight third-party dependency" in curt "answers" when it comes to scraping. 99% of the time you don't need Selenium. You just need to exercise the little gray cells.



First, a big clue that the page loads content asynchronously is the wait-spinner that appears. The second one is in your snippet where the div actually has part of a selector name with ajax in it. Tell-tale signs XHR requests are in-play.



If you open Developer Tools in your browser and reload the page then go to Network and then the XHR tab you'll see:



enter image description here



Most of the "real" data on the page is loaded dynamically. We can write httr calls that mimic the browser calls.



However



We first need to make one GET call to the main page to prime some cookies which will be carried over for us and then find a per-generated session token that's used to prevent abuse of the site. It's defined using JavaScript so we'll use the V8 package to evaluate it. We could have just use regular expressions to find the string. Do whatev you like.



library(httr)
library(rvest)
library(dplyr)
library(V8)

ctx <- v8() # we need this to eval some javascript

# Prime Cookies -----------------------------------------------------------

res <- httr::GET("https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1")

httr::cookies(res)
## domain flag path secure expiration name
## 1 .ecan.govt.nz TRUE / FALSE 2019-11-24 11:46:13 visid_incap_927063
## 2 .ecan.govt.nz TRUE / FALSE <NA> incap_ses_148_927063
## value
## 1 +p8XAM6uReGmEnVIdnaxoxWL+VsAAAAAQUIPAAAAAABjdOjQDbXt7PG3tpBpELha
## 2 nXJSYz8zbCRj8tGhzNANAhaL+VsAAAAA7JyOH7Gu4qeIb6KKk/iSYQ==
pg <- httr::content(res)

html_node(pg, xpath=".//script[contains(., '_monsido')]") %>%
html_text() %>%
ctx$eval()
## [1] "2"
monsido_token <- ctx$get("_monsido")[1,2]


Here's the searchlist (which is, indeed empty):



httr::VERB(
verb = "POST", url = "https://www.ecan.govt.nz/data/document-library/searchlist",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
`X-Requested-With` = "XMLHttpRequest",
TE = "Trailers"
), httr::set_cookies(
monsido = monsido_token
),
body = list(
name = "CRC000002.1",
pageSize = "999999"
),
encode = "form"
) -> res

httr::content(res)
## NULL ## <<=== this is OK as there is no response


Here's the "Consent Overview" section:



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentoverview/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
html_table() %>%
glimpse()
## List of 1
## $ :'data.frame': 5 obs. of 4 variables:
## ..$ X1: chr [1:5] "RMA Authorisation Number" "Consent Location" "To" "Commencement Date" ...
## ..$ X2: chr [1:5] "CRC000002.1" "Manuka Creek, KILLERMONT STATION" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X3: chr [1:5] "Client Name" "State" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X4: chr [1:5] "Killermont Station Limited" "Issued - Active" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...


Here are the "Consent Conditions":



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentconditions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="consentDetails">
## <ul class="unstyled-list">
## <li>
##
##
## <strong class="pull-left">1</strong> <div class="pad-left1">The rate at which wa


Here's the "Consent Related":



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentrelated/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body>
## <p>There are no related documents.</p>
##
##
##
##
##
## <div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead><tr>
## <th>Relationship</th>
## <th>Recor


Here's the "Workflow:



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res)
## {xml_document}
## <html>
## [1] <body><p>No workflow</p></body>


Here are the "Consent Flow Restrictions":



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentflowrestrictions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead>
## <th colspan="2">Low Flow Site</th>
## <th>Todays Flow <span class="lower">(m3/s)</span>
## </th>


You still need to parse HTML but now you can do it all with just plain R packages.






share|improve this answer















It's super disconcerting that most if not all SO R contributors default to "use a heavyweight third-party dependency" in curt "answers" when it comes to scraping. 99% of the time you don't need Selenium. You just need to exercise the little gray cells.



First, a big clue that the page loads content asynchronously is the wait-spinner that appears. The second one is in your snippet where the div actually has part of a selector name with ajax in it. Tell-tale signs XHR requests are in-play.



If you open Developer Tools in your browser and reload the page then go to Network and then the XHR tab you'll see:



enter image description here



Most of the "real" data on the page is loaded dynamically. We can write httr calls that mimic the browser calls.



However



We first need to make one GET call to the main page to prime some cookies which will be carried over for us and then find a per-generated session token that's used to prevent abuse of the site. It's defined using JavaScript so we'll use the V8 package to evaluate it. We could have just use regular expressions to find the string. Do whatev you like.



library(httr)
library(rvest)
library(dplyr)
library(V8)

ctx <- v8() # we need this to eval some javascript

# Prime Cookies -----------------------------------------------------------

res <- httr::GET("https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1")

httr::cookies(res)
## domain flag path secure expiration name
## 1 .ecan.govt.nz TRUE / FALSE 2019-11-24 11:46:13 visid_incap_927063
## 2 .ecan.govt.nz TRUE / FALSE <NA> incap_ses_148_927063
## value
## 1 +p8XAM6uReGmEnVIdnaxoxWL+VsAAAAAQUIPAAAAAABjdOjQDbXt7PG3tpBpELha
## 2 nXJSYz8zbCRj8tGhzNANAhaL+VsAAAAA7JyOH7Gu4qeIb6KKk/iSYQ==
pg <- httr::content(res)

html_node(pg, xpath=".//script[contains(., '_monsido')]") %>%
html_text() %>%
ctx$eval()
## [1] "2"
monsido_token <- ctx$get("_monsido")[1,2]


Here's the searchlist (which is, indeed empty):



httr::VERB(
verb = "POST", url = "https://www.ecan.govt.nz/data/document-library/searchlist",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
`X-Requested-With` = "XMLHttpRequest",
TE = "Trailers"
), httr::set_cookies(
monsido = monsido_token
),
body = list(
name = "CRC000002.1",
pageSize = "999999"
),
encode = "form"
) -> res

httr::content(res)
## NULL ## <<=== this is OK as there is no response


Here's the "Consent Overview" section:



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentoverview/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
html_table() %>%
glimpse()
## List of 1
## $ :'data.frame': 5 obs. of 4 variables:
## ..$ X1: chr [1:5] "RMA Authorisation Number" "Consent Location" "To" "Commencement Date" ...
## ..$ X2: chr [1:5] "CRC000002.1" "Manuka Creek, KILLERMONT STATION" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X3: chr [1:5] "Client Name" "State" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X4: chr [1:5] "Killermont Station Limited" "Issued - Active" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...


Here are the "Consent Conditions":



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentconditions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="consentDetails">
## <ul class="unstyled-list">
## <li>
##
##
## <strong class="pull-left">1</strong> <div class="pad-left1">The rate at which wa


Here's the "Consent Related":



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentrelated/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body>
## <p>There are no related documents.</p>
##
##
##
##
##
## <div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead><tr>
## <th>Relationship</th>
## <th>Recor


Here's the "Workflow:



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res)
## {xml_document}
## <html>
## [1] <body><p>No workflow</p></body>


Here are the "Consent Flow Restrictions":



httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentflowrestrictions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res

httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead>
## <th colspan="2">Low Flow Site</th>
## <th>Todays Flow <span class="lower">(m3/s)</span>
## </th>


You still need to parse HTML but now you can do it all with just plain R packages.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 25 '18 at 12:30

























answered Nov 24 '18 at 17:38









hrbrmstrhrbrmstr

62.2k694155




62.2k694155













  • Thanks, that's brilliant! Works perfectly, and now I'm on to banging my head against a wall with the text pattern matching. Can you explain to me briefly how to select the arguments for GET? They work perfectly in this case, but I don't think I could replicate it, and the help file in R is a little opaque.

    – TimM
    Nov 24 '18 at 21:54






  • 1





    If this isn't time-sensitive, lemme put this into GitHub by tomorrow and I'll drop a link here and we can iron it out it GitHub issues.

    – hrbrmstr
    Nov 24 '18 at 22:11











  • That would be amazing, thanks!

    – TimM
    Nov 24 '18 at 22:13











  • Cool. If you could put your initial comment to this answer as an issue in github.com/hrbrmstr/nz-ecan I'll drop a note on the morrow. Tonight is last night with my college freshman son home for the Thanksgiving break so I'll be able to crank on this tomorrow with zeal.

    – hrbrmstr
    Nov 24 '18 at 22:17



















  • Thanks, that's brilliant! Works perfectly, and now I'm on to banging my head against a wall with the text pattern matching. Can you explain to me briefly how to select the arguments for GET? They work perfectly in this case, but I don't think I could replicate it, and the help file in R is a little opaque.

    – TimM
    Nov 24 '18 at 21:54






  • 1





    If this isn't time-sensitive, lemme put this into GitHub by tomorrow and I'll drop a link here and we can iron it out it GitHub issues.

    – hrbrmstr
    Nov 24 '18 at 22:11











  • That would be amazing, thanks!

    – TimM
    Nov 24 '18 at 22:13











  • Cool. If you could put your initial comment to this answer as an issue in github.com/hrbrmstr/nz-ecan I'll drop a note on the morrow. Tonight is last night with my college freshman son home for the Thanksgiving break so I'll be able to crank on this tomorrow with zeal.

    – hrbrmstr
    Nov 24 '18 at 22:17

















Thanks, that's brilliant! Works perfectly, and now I'm on to banging my head against a wall with the text pattern matching. Can you explain to me briefly how to select the arguments for GET? They work perfectly in this case, but I don't think I could replicate it, and the help file in R is a little opaque.

– TimM
Nov 24 '18 at 21:54





Thanks, that's brilliant! Works perfectly, and now I'm on to banging my head against a wall with the text pattern matching. Can you explain to me briefly how to select the arguments for GET? They work perfectly in this case, but I don't think I could replicate it, and the help file in R is a little opaque.

– TimM
Nov 24 '18 at 21:54




1




1





If this isn't time-sensitive, lemme put this into GitHub by tomorrow and I'll drop a link here and we can iron it out it GitHub issues.

– hrbrmstr
Nov 24 '18 at 22:11





If this isn't time-sensitive, lemme put this into GitHub by tomorrow and I'll drop a link here and we can iron it out it GitHub issues.

– hrbrmstr
Nov 24 '18 at 22:11













That would be amazing, thanks!

– TimM
Nov 24 '18 at 22:13





That would be amazing, thanks!

– TimM
Nov 24 '18 at 22:13













Cool. If you could put your initial comment to this answer as an issue in github.com/hrbrmstr/nz-ecan I'll drop a note on the morrow. Tonight is last night with my college freshman son home for the Thanksgiving break so I'll be able to crank on this tomorrow with zeal.

– hrbrmstr
Nov 24 '18 at 22:17





Cool. If you could put your initial comment to this answer as an issue in github.com/hrbrmstr/nz-ecan I'll drop a note on the morrow. Tonight is last night with my college freshman son home for the Thanksgiving break so I'll be able to crank on this tomorrow with zeal.

– hrbrmstr
Nov 24 '18 at 22:17




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53457078%2fextracting-html-text-using-r-cant-access-some-nodes%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Xamarin.form Move up view when keyboard appear

Post-Redirect-Get with Spring WebFlux and Thymeleaf

Anylogic : not able to use stopDelay()