Getting Final HTML with Javascript rendered Java as String












9














I want to fetch data from an HTML page(scrape it). But it contains reviews in javascript. In normal java url fetch I am only getting the HTML(actual one) without Javascript executed. I want the final page with Javascript executed.



Example :- http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp



This page has comments as a facebook plugin which are fetched as Javascript.



Also similar to this even on this.
http://www.imdb.com/title/tt0848228/reviews



What should I do?










share|improve this question




















  • 1




    Your only real option for doing things like that in general is to harness a web browser as a component for your own software. Have the browser fetch the page and simulate whatever interactions are necessary for the JavaScript to do what it does, then examine the DOM.
    – Pointy
    Jun 3 '12 at 17:27










  • There should be a way to implement the facebook API to fetch the comments from that post as well, together with the rest of the page contents.
    – Fabrício Matté
    Jun 3 '12 at 17:30


















9














I want to fetch data from an HTML page(scrape it). But it contains reviews in javascript. In normal java url fetch I am only getting the HTML(actual one) without Javascript executed. I want the final page with Javascript executed.



Example :- http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp



This page has comments as a facebook plugin which are fetched as Javascript.



Also similar to this even on this.
http://www.imdb.com/title/tt0848228/reviews



What should I do?










share|improve this question




















  • 1




    Your only real option for doing things like that in general is to harness a web browser as a component for your own software. Have the browser fetch the page and simulate whatever interactions are necessary for the JavaScript to do what it does, then examine the DOM.
    – Pointy
    Jun 3 '12 at 17:27










  • There should be a way to implement the facebook API to fetch the comments from that post as well, together with the rest of the page contents.
    – Fabrício Matté
    Jun 3 '12 at 17:30
















9












9








9


6





I want to fetch data from an HTML page(scrape it). But it contains reviews in javascript. In normal java url fetch I am only getting the HTML(actual one) without Javascript executed. I want the final page with Javascript executed.



Example :- http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp



This page has comments as a facebook plugin which are fetched as Javascript.



Also similar to this even on this.
http://www.imdb.com/title/tt0848228/reviews



What should I do?










share|improve this question















I want to fetch data from an HTML page(scrape it). But it contains reviews in javascript. In normal java url fetch I am only getting the HTML(actual one) without Javascript executed. I want the final page with Javascript executed.



Example :- http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp



This page has comments as a facebook plugin which are fetched as Javascript.



Also similar to this even on this.
http://www.imdb.com/title/tt0848228/reviews



What should I do?







java javascript web-scraping






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jun 3 '12 at 17:25









Pointy

314k44452512




314k44452512










asked Jun 3 '12 at 17:21









KillerTheLord

6728




6728








  • 1




    Your only real option for doing things like that in general is to harness a web browser as a component for your own software. Have the browser fetch the page and simulate whatever interactions are necessary for the JavaScript to do what it does, then examine the DOM.
    – Pointy
    Jun 3 '12 at 17:27










  • There should be a way to implement the facebook API to fetch the comments from that post as well, together with the rest of the page contents.
    – Fabrício Matté
    Jun 3 '12 at 17:30
















  • 1




    Your only real option for doing things like that in general is to harness a web browser as a component for your own software. Have the browser fetch the page and simulate whatever interactions are necessary for the JavaScript to do what it does, then examine the DOM.
    – Pointy
    Jun 3 '12 at 17:27










  • There should be a way to implement the facebook API to fetch the comments from that post as well, together with the rest of the page contents.
    – Fabrício Matté
    Jun 3 '12 at 17:30










1




1




Your only real option for doing things like that in general is to harness a web browser as a component for your own software. Have the browser fetch the page and simulate whatever interactions are necessary for the JavaScript to do what it does, then examine the DOM.
– Pointy
Jun 3 '12 at 17:27




Your only real option for doing things like that in general is to harness a web browser as a component for your own software. Have the browser fetch the page and simulate whatever interactions are necessary for the JavaScript to do what it does, then examine the DOM.
– Pointy
Jun 3 '12 at 17:27












There should be a way to implement the facebook API to fetch the comments from that post as well, together with the rest of the page contents.
– Fabrício Matté
Jun 3 '12 at 17:30






There should be a way to implement the facebook API to fetch the comments from that post as well, together with the rest of the page contents.
– Fabrício Matté
Jun 3 '12 at 17:30














2 Answers
2






active

oldest

votes


















7














Use phantomjs: http://phantomjs.org



var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
// Where you want to save it
page.render("screenshoot.png")
// You can access its content using jQuery
var fbcomments = page.evaluate(function(){
return $(".fb-comments iframe").contents().find(".postContainer")
})
},10000)


You have to use the option in phantom --web-security=no to allow cross-domain interaction (ie for facebook iframe)



To communicate with other applications from phantomjs you can use a web server or make a POST request: https://github.com/ariya/phantomjs/blob/master/examples/post.js






share|improve this answer





















  • oh man, that is awesome.
    – goat
    Jun 3 '12 at 17:47










  • @Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java
    – KillerTheLord
    Jun 3 '12 at 17:55






  • 4




    Is a good thing that you don't want to do it with a potato; man... that would be hard!
    – Ivan Castellanos
    Jun 3 '12 at 23:05












  • @IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?
    – Bizmate
    Jun 30 '16 at 0:25



















4














You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.



UPDATE: You were asking for example? You don't have to do anything extra for doing that:



Example:



WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));


UPDATE 2: You can get iframe as follows:



HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();


Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit






share|improve this answer





















  • But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions
    – Freak
    Nov 27 '13 at 12:32






  • 1




    Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)
    – Konrad G
    Jul 20 '16 at 8:36











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f10872382%2fgetting-final-html-with-javascript-rendered-java-as-string%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









7














Use phantomjs: http://phantomjs.org



var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
// Where you want to save it
page.render("screenshoot.png")
// You can access its content using jQuery
var fbcomments = page.evaluate(function(){
return $(".fb-comments iframe").contents().find(".postContainer")
})
},10000)


You have to use the option in phantom --web-security=no to allow cross-domain interaction (ie for facebook iframe)



To communicate with other applications from phantomjs you can use a web server or make a POST request: https://github.com/ariya/phantomjs/blob/master/examples/post.js






share|improve this answer





















  • oh man, that is awesome.
    – goat
    Jun 3 '12 at 17:47










  • @Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java
    – KillerTheLord
    Jun 3 '12 at 17:55






  • 4




    Is a good thing that you don't want to do it with a potato; man... that would be hard!
    – Ivan Castellanos
    Jun 3 '12 at 23:05












  • @IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?
    – Bizmate
    Jun 30 '16 at 0:25
















7














Use phantomjs: http://phantomjs.org



var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
// Where you want to save it
page.render("screenshoot.png")
// You can access its content using jQuery
var fbcomments = page.evaluate(function(){
return $(".fb-comments iframe").contents().find(".postContainer")
})
},10000)


You have to use the option in phantom --web-security=no to allow cross-domain interaction (ie for facebook iframe)



To communicate with other applications from phantomjs you can use a web server or make a POST request: https://github.com/ariya/phantomjs/blob/master/examples/post.js






share|improve this answer





















  • oh man, that is awesome.
    – goat
    Jun 3 '12 at 17:47










  • @Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java
    – KillerTheLord
    Jun 3 '12 at 17:55






  • 4




    Is a good thing that you don't want to do it with a potato; man... that would be hard!
    – Ivan Castellanos
    Jun 3 '12 at 23:05












  • @IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?
    – Bizmate
    Jun 30 '16 at 0:25














7












7








7






Use phantomjs: http://phantomjs.org



var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
// Where you want to save it
page.render("screenshoot.png")
// You can access its content using jQuery
var fbcomments = page.evaluate(function(){
return $(".fb-comments iframe").contents().find(".postContainer")
})
},10000)


You have to use the option in phantom --web-security=no to allow cross-domain interaction (ie for facebook iframe)



To communicate with other applications from phantomjs you can use a web server or make a POST request: https://github.com/ariya/phantomjs/blob/master/examples/post.js






share|improve this answer












Use phantomjs: http://phantomjs.org



var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
// Where you want to save it
page.render("screenshoot.png")
// You can access its content using jQuery
var fbcomments = page.evaluate(function(){
return $(".fb-comments iframe").contents().find(".postContainer")
})
},10000)


You have to use the option in phantom --web-security=no to allow cross-domain interaction (ie for facebook iframe)



To communicate with other applications from phantomjs you can use a web server or make a POST request: https://github.com/ariya/phantomjs/blob/master/examples/post.js







share|improve this answer












share|improve this answer



share|improve this answer










answered Jun 3 '12 at 17:31









Ivan Castellanos

6,16813033




6,16813033












  • oh man, that is awesome.
    – goat
    Jun 3 '12 at 17:47










  • @Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java
    – KillerTheLord
    Jun 3 '12 at 17:55






  • 4




    Is a good thing that you don't want to do it with a potato; man... that would be hard!
    – Ivan Castellanos
    Jun 3 '12 at 23:05












  • @IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?
    – Bizmate
    Jun 30 '16 at 0:25


















  • oh man, that is awesome.
    – goat
    Jun 3 '12 at 17:47










  • @Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java
    – KillerTheLord
    Jun 3 '12 at 17:55






  • 4




    Is a good thing that you don't want to do it with a potato; man... that would be hard!
    – Ivan Castellanos
    Jun 3 '12 at 23:05












  • @IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?
    – Bizmate
    Jun 30 '16 at 0:25
















oh man, that is awesome.
– goat
Jun 3 '12 at 17:47




oh man, that is awesome.
– goat
Jun 3 '12 at 17:47












@Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java
– KillerTheLord
Jun 3 '12 at 17:55




@Ivan I want to do this in Java not javascript :P. The scraping has to be done in Java
– KillerTheLord
Jun 3 '12 at 17:55




4




4




Is a good thing that you don't want to do it with a potato; man... that would be hard!
– Ivan Castellanos
Jun 3 '12 at 23:05






Is a good thing that you don't want to do it with a potato; man... that would be hard!
– Ivan Castellanos
Jun 3 '12 at 23:05














@IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?
– Bizmate
Jun 30 '16 at 0:25




@IvanCastellanos while i agree this should work i dont get the rendered HTML on some specific sites. In the example the site renders elements like 'SITE_BACKGROUND' inside another element but phantom never sees it. See GIST gist.github.com/bizmate/db23887a7c5b066afafe2cc05acdd4ff . Any idea why this times out instead of getting the rendered html?
– Bizmate
Jun 30 '16 at 0:25













4














You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.



UPDATE: You were asking for example? You don't have to do anything extra for doing that:



Example:



WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));


UPDATE 2: You can get iframe as follows:



HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();


Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit






share|improve this answer





















  • But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions
    – Freak
    Nov 27 '13 at 12:32






  • 1




    Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)
    – Konrad G
    Jul 20 '16 at 8:36
















4














You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.



UPDATE: You were asking for example? You don't have to do anything extra for doing that:



Example:



WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));


UPDATE 2: You can get iframe as follows:



HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();


Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit






share|improve this answer





















  • But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions
    – Freak
    Nov 27 '13 at 12:32






  • 1




    Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)
    – Konrad G
    Jul 20 '16 at 8:36














4












4








4






You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.



UPDATE: You were asking for example? You don't have to do anything extra for doing that:



Example:



WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));


UPDATE 2: You can get iframe as follows:



HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();


Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit






share|improve this answer












You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.



UPDATE: You were asking for example? You don't have to do anything extra for doing that:



Example:



WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));


UPDATE 2: You can get iframe as follows:



HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();


Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit







share|improve this answer












share|improve this answer



share|improve this answer










answered Jun 4 '12 at 6:38







user517491



















  • But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions
    – Freak
    Nov 27 '13 at 12:32






  • 1




    Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)
    – Konrad G
    Jul 20 '16 at 8:36


















  • But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions
    – Freak
    Nov 27 '13 at 12:32






  • 1




    Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)
    – Konrad G
    Jul 20 '16 at 8:36
















But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions
– Freak
Nov 27 '13 at 12:32




But it will be problematic for the url's if the page have some 404 sources like if the page contain any JS file which is not present on that location then This API will throw Exceptions
– Freak
Nov 27 '13 at 12:32




1




1




Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)
– Konrad G
Jul 20 '16 at 8:36




Unfortunately the library you suggested is just super mega sloooow (~40s to render page, that renders in 1s on normal browser!)
– Konrad G
Jul 20 '16 at 8:36


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f10872382%2fgetting-final-html-with-javascript-rendered-java-as-string%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud

Zucchini