Getting Final HTML with Javascript rendered Java as String
I want to fetch data from an HTML page(scrape it). But it contains reviews in javascript. In normal java url fetch I am only getting the HTML(actual one) without Javascript executed. I want the final page with Javascript executed.
Example :- http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp
This page has comments as a facebook plugin which are fetched as Javascript.
Also similar to this even on this.
http://www.imdb.com/title/tt0848228/reviews
What should I do?
java javascript web-scraping
add a comment |
I want to fetch data from an HTML page(scrape it). But it contains reviews in javascript. In normal java url fetch I am only getting the HTML(actual one) without Javascript executed. I want the final page with Javascript executed.
Example :- http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp
This page has comments as a facebook plugin which are fetched as Javascript.
Also similar to this even on this.
http://www.imdb.com/title/tt0848228/reviews
What should I do?
java javascript web-scraping
1
Your only real option for doing things like that in general is to harness a web browser as a component for your own software. Have the browser fetch the page and simulate whatever interactions are necessary for the JavaScript to do what it does, then examine the DOM.
– Pointy
Jun 3 '12 at 17:27
There should be a way to implement the facebook API to fetch the comments from that post as well, together with the rest of the page contents.
– Fabrício Matté
Jun 3 '12 at 17:30
add a comment |
I want to fetch data from an HTML page(scrape it). But it contains reviews in javascript. In normal java url fetch I am only getting the HTML(actual one) without Javascript executed. I want the final page with Javascript executed.
Example :- http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp
This page has comments as a facebook plugin which are fetched as Javascript.
Also similar to this even on this.
http://www.imdb.com/title/tt0848228/reviews
What should I do?
java javascript web-scraping
I want to fetch data from an HTML page(scrape it). But it contains reviews in javascript. In normal java url fetch I am only getting the HTML(actual one) without Javascript executed. I want the final page with Javascript executed.
Example :- http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp
This page has comments as a facebook plugin which are fetched as Javascript.
Also similar to this even on this.
http://www.imdb.com/title/tt0848228/reviews
What should I do?
java javascript web-scraping
java javascript web-scraping
edited Jun 3 '12 at 17:25
Pointy
314k44452512
314k44452512
asked Jun 3 '12 at 17:21
KillerTheLord
6728
6728
1
Your only real option for doing things like that in general is to harness a web browser as a component for your own software. Have the browser fetch the page and simulate whatever interactions are necessary for the JavaScript to do what it does, then examine the DOM.
– Pointy
Jun 3 '12 at 17:27
There should be a way to implement the facebook API to fetch the comments from that post as well, together with the rest of the page contents.
– Fabrício Matté
Jun 3 '12 at 17:30
add a comment |
1
Your only real option for doing things like that in general is to harness a web browser as a component for your own software. Have the browser fetch the page and simulate whatever interactions are necessary for the JavaScript to do what it does, then examine the DOM.
– Pointy
Jun 3 '12 at 17:27
There should be a way to implement the facebook API to fetch the comments from that post as well, together with the rest of the page contents.
– Fabrício Matté
Jun 3 '12 at 17:30
1
1
Your only real option for doing things like that in general is to harness a web browser as a component for your own software. Have the browser fetch the page and simulate whatever interactions are necessary for the JavaScript to do what it does, then examine the DOM.
– Pointy
Jun 3 '12 at 17:27
Your only real option for doing things like that in general is to harness a web browser as a component for your own software. Have the browser fetch the page and simulate whatever interactions are necessary for the JavaScript to do what it does, then examine the DOM.
– Pointy
Jun 3 '12 at 17:27
There should be a way to implement the facebook API to fetch the comments from that post as well, together with the rest of the page contents.
– Fabrício Matté
Jun 3 '12 at 17:30
There should be a way to implement the facebook API to fetch the comments from that post as well, together with the rest of the page contents.
– Fabrício Matté
Jun 3 '12 at 17:30
add a comment |
2 Answers
2
active
oldest
votes
Use phantomjs: http://phantomjs.org
var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
// Where you want to save it
page.render("screenshoot.png")
// You can access its content using jQuery
var fbcomments = page.evaluate(function(){
return $(".fb-comments iframe").contents().find(".postContainer")
})
},10000)
You have to use the option in phantom --web-security=no
to allow cross-domain interaction (ie for facebook iframe)
To communicate with other applications from phantomjs you can use a web server or make a POST request: https://github.com/ariya/phantomjs/blob/master/examples/post.js
oh man, that is awesome.
– goat
Jun 3 '12 at 17:47
@Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java
– KillerTheLord
Jun 3 '12 at 17:55
4
Is a good thing that you don't want to do it with a potato; man... that would be hard!
– Ivan Castellanos
Jun 3 '12 at 23:05
@IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?
– Bizmate
Jun 30 '16 at 0:25
add a comment |
You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.
UPDATE: You were asking for example? You don't have to do anything extra for doing that:
Example:
WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));
UPDATE 2: You can get iframe as follows:
HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();
Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit
But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions
– Freak
Nov 27 '13 at 12:32
1
Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)
– Konrad G
Jul 20 '16 at 8:36
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f10872382%2fgetting-final-html-with-javascript-rendered-java-as-string%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Use phantomjs: http://phantomjs.org
var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
// Where you want to save it
page.render("screenshoot.png")
// You can access its content using jQuery
var fbcomments = page.evaluate(function(){
return $(".fb-comments iframe").contents().find(".postContainer")
})
},10000)
You have to use the option in phantom --web-security=no
to allow cross-domain interaction (ie for facebook iframe)
To communicate with other applications from phantomjs you can use a web server or make a POST request: https://github.com/ariya/phantomjs/blob/master/examples/post.js
oh man, that is awesome.
– goat
Jun 3 '12 at 17:47
@Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java
– KillerTheLord
Jun 3 '12 at 17:55
4
Is a good thing that you don't want to do it with a potato; man... that would be hard!
– Ivan Castellanos
Jun 3 '12 at 23:05
@IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?
– Bizmate
Jun 30 '16 at 0:25
add a comment |
Use phantomjs: http://phantomjs.org
var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
// Where you want to save it
page.render("screenshoot.png")
// You can access its content using jQuery
var fbcomments = page.evaluate(function(){
return $(".fb-comments iframe").contents().find(".postContainer")
})
},10000)
You have to use the option in phantom --web-security=no
to allow cross-domain interaction (ie for facebook iframe)
To communicate with other applications from phantomjs you can use a web server or make a POST request: https://github.com/ariya/phantomjs/blob/master/examples/post.js
oh man, that is awesome.
– goat
Jun 3 '12 at 17:47
@Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java
– KillerTheLord
Jun 3 '12 at 17:55
4
Is a good thing that you don't want to do it with a potato; man... that would be hard!
– Ivan Castellanos
Jun 3 '12 at 23:05
@IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?
– Bizmate
Jun 30 '16 at 0:25
add a comment |
Use phantomjs: http://phantomjs.org
var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
// Where you want to save it
page.render("screenshoot.png")
// You can access its content using jQuery
var fbcomments = page.evaluate(function(){
return $(".fb-comments iframe").contents().find(".postContainer")
})
},10000)
You have to use the option in phantom --web-security=no
to allow cross-domain interaction (ie for facebook iframe)
To communicate with other applications from phantomjs you can use a web server or make a POST request: https://github.com/ariya/phantomjs/blob/master/examples/post.js
Use phantomjs: http://phantomjs.org
var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
// Where you want to save it
page.render("screenshoot.png")
// You can access its content using jQuery
var fbcomments = page.evaluate(function(){
return $(".fb-comments iframe").contents().find(".postContainer")
})
},10000)
You have to use the option in phantom --web-security=no
to allow cross-domain interaction (ie for facebook iframe)
To communicate with other applications from phantomjs you can use a web server or make a POST request: https://github.com/ariya/phantomjs/blob/master/examples/post.js
answered Jun 3 '12 at 17:31
Ivan Castellanos
6,16813033
6,16813033
oh man, that is awesome.
– goat
Jun 3 '12 at 17:47
@Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java
– KillerTheLord
Jun 3 '12 at 17:55
4
Is a good thing that you don't want to do it with a potato; man... that would be hard!
– Ivan Castellanos
Jun 3 '12 at 23:05
@IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?
– Bizmate
Jun 30 '16 at 0:25
add a comment |
oh man, that is awesome.
– goat
Jun 3 '12 at 17:47
@Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java
– KillerTheLord
Jun 3 '12 at 17:55
4
Is a good thing that you don't want to do it with a potato; man... that would be hard!
– Ivan Castellanos
Jun 3 '12 at 23:05
@IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?
– Bizmate
Jun 30 '16 at 0:25
oh man, that is awesome.
– goat
Jun 3 '12 at 17:47
oh man, that is awesome.
– goat
Jun 3 '12 at 17:47
@Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java
– KillerTheLord
Jun 3 '12 at 17:55
@Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java
– KillerTheLord
Jun 3 '12 at 17:55
4
4
Is a good thing that you don't want to do it with a potato; man... that would be hard!
– Ivan Castellanos
Jun 3 '12 at 23:05
Is a good thing that you don't want to do it with a potato; man... that would be hard!
– Ivan Castellanos
Jun 3 '12 at 23:05
@IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?
– Bizmate
Jun 30 '16 at 0:25
@IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?
– Bizmate
Jun 30 '16 at 0:25
add a comment |
You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.
UPDATE: You were asking for example? You don't have to do anything extra for doing that:
Example:
WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));
UPDATE 2: You can get iframe as follows:
HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();
Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit
But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions
– Freak
Nov 27 '13 at 12:32
1
Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)
– Konrad G
Jul 20 '16 at 8:36
add a comment |
You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.
UPDATE: You were asking for example? You don't have to do anything extra for doing that:
Example:
WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));
UPDATE 2: You can get iframe as follows:
HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();
Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit
But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions
– Freak
Nov 27 '13 at 12:32
1
Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)
– Konrad G
Jul 20 '16 at 8:36
add a comment |
You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.
UPDATE: You were asking for example? You don't have to do anything extra for doing that:
Example:
WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));
UPDATE 2: You can get iframe as follows:
HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();
Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit
You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.
UPDATE: You were asking for example? You don't have to do anything extra for doing that:
Example:
WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));
UPDATE 2: You can get iframe as follows:
HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();
Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit
answered Jun 4 '12 at 6:38
user517491
But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions
– Freak
Nov 27 '13 at 12:32
1
Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)
– Konrad G
Jul 20 '16 at 8:36
add a comment |
But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions
– Freak
Nov 27 '13 at 12:32
1
Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)
– Konrad G
Jul 20 '16 at 8:36
But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions
– Freak
Nov 27 '13 at 12:32
But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions
– Freak
Nov 27 '13 at 12:32
1
1
Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)
– Konrad G
Jul 20 '16 at 8:36
Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)
– Konrad G
Jul 20 '16 at 8:36
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f10872382%2fgetting-final-html-with-javascript-rendered-java-as-string%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Your only real option for doing things like that in general is to harness a web browser as a component for your own software. Have the browser fetch the page and simulate whatever interactions are necessary for the JavaScript to do what it does, then examine the DOM.
– Pointy
Jun 3 '12 at 17:27
There should be a way to implement the facebook API to fetch the comments from that post as well, together with the rest of the page contents.
– Fabrício Matté
Jun 3 '12 at 17:30