How to scrape HTML rendered by JavaScript

up vote
0
down vote

favorite

I need to write an automated scraper that can take care of websites that are rendered by JavaScript (like YouTube) or just simply use some JavaScript somewhere in their HTML to generate some content (like generating copyright year) and therefore downloading their HTML source make no sense as it won't be the final code (with what users will see).

I use Python with Selenium and WebDriver, so that I can execute JavaScript on a given website. My code for that purpose is:

def execute_javascript_on_website(self, js_command):

   driver = webdriver.Firefox(firefox_options = self.webdriver_options, executable_path = os.path.dirname(os.path.abspath(__file__)) + '/executables/geckodriver')

   driver.get(self.url)



  try:

     return driver.execute_script(js_command)



  except Exception as exception_message:

     pass



  finally:

     driver.close()

Where js_command = "return document.documentElement.outerHTML;".

By this code I'm able to get the source code, but not the rendered one. I can do js_command = "return document;" (as I would do in console), but than I will get <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="5a784804-f623-3041-9840-03f13ce83f53", element="585b43a1-f3b2-1e4a-b348-4ddaf2944550")> object that has the HTML but it's not possible to get it out of it.

Does anyone know about the way how to get HTML rendered by JavaScript (ideally in form of string), using Selenium? Or some other technique that would do it?

PS.: I also tried WebDriver wait, but it didn’t help, I still got HTML with unredered JavaScript.

PPS.: I need to get whole HTML code (whole html tag) with JavaScript rendered in it (as it is for example when inspecting in browsers inspector). Or at least to get DOM of the website in which JavaScript is already rendered.

edited Nov 12 at 18:14

asked Nov 11 at 13:39

Michal

186

1

I suspect you want to scrape with a scraper, not scrap with a scrapper.
– Terry Jan Reedy
Nov 11 at 13:48

Oh yes, a typo thanks.
– Michal
Nov 11 at 13:50

access the element using selenium methods
– Corey Goldberg
Nov 11 at 15:18

add a comment |

up vote
0
down vote

favorite

I use Python with Selenium and WebDriver, so that I can execute JavaScript on a given website. My code for that purpose is:

def execute_javascript_on_website(self, js_command):

   driver = webdriver.Firefox(firefox_options = self.webdriver_options, executable_path = os.path.dirname(os.path.abspath(__file__)) + '/executables/geckodriver')

   driver.get(self.url)



  try:

     return driver.execute_script(js_command)



  except Exception as exception_message:

     pass



  finally:

     driver.close()

Where js_command = "return document.documentElement.outerHTML;".

Does anyone know about the way how to get HTML rendered by JavaScript (ideally in form of string), using Selenium? Or some other technique that would do it?

PS.: I also tried WebDriver wait, but it didn’t help, I still got HTML with unredered JavaScript.

edited Nov 12 at 18:14

asked Nov 11 at 13:39

Michal

186

1

I suspect you want to scrape with a scraper, not scrap with a scrapper.
– Terry Jan Reedy
Nov 11 at 13:48

Oh yes, a typo thanks.
– Michal
Nov 11 at 13:50

access the element using selenium methods
– Corey Goldberg
Nov 11 at 15:18

add a comment |

up vote
0
down vote

favorite

I use Python with Selenium and WebDriver, so that I can execute JavaScript on a given website. My code for that purpose is:

def execute_javascript_on_website(self, js_command):

   driver = webdriver.Firefox(firefox_options = self.webdriver_options, executable_path = os.path.dirname(os.path.abspath(__file__)) + '/executables/geckodriver')

   driver.get(self.url)



  try:

     return driver.execute_script(js_command)



  except Exception as exception_message:

     pass



  finally:

     driver.close()

Where js_command = "return document.documentElement.outerHTML;".

Does anyone know about the way how to get HTML rendered by JavaScript (ideally in form of string), using Selenium? Or some other technique that would do it?

PS.: I also tried WebDriver wait, but it didn’t help, I still got HTML with unredered JavaScript.

edited Nov 12 at 18:14

asked Nov 11 at 13:39

Michal

186

I use Python with Selenium and WebDriver, so that I can execute JavaScript on a given website. My code for that purpose is:

def execute_javascript_on_website(self, js_command):

   driver = webdriver.Firefox(firefox_options = self.webdriver_options, executable_path = os.path.dirname(os.path.abspath(__file__)) + '/executables/geckodriver')

   driver.get(self.url)



  try:

     return driver.execute_script(js_command)



  except Exception as exception_message:

     pass



  finally:

     driver.close()

Where js_command = "return document.documentElement.outerHTML;".

Does anyone know about the way how to get HTML rendered by JavaScript (ideally in form of string), using Selenium? Or some other technique that would do it?

PS.: I also tried WebDriver wait, but it didn’t help, I still got HTML with unredered JavaScript.

python selenium-webdriver web-scraping

edited Nov 12 at 18:14

asked Nov 11 at 13:39

Michal

186

edited Nov 12 at 18:14

asked Nov 11 at 13:39

Michal

186

edited Nov 12 at 18:14

asked Nov 11 at 13:39

Michal

186

asked Nov 11 at 13:39

Michal

186

asked Nov 11 at 13:39

Michal

186

1

I suspect you want to scrape with a scraper, not scrap with a scrapper.
– Terry Jan Reedy
Nov 11 at 13:48

Oh yes, a typo thanks.
– Michal
Nov 11 at 13:50

access the element using selenium methods
– Corey Goldberg
Nov 11 at 15:18

add a comment |

1

I suspect you want to scrape with a scraper, not scrap with a scrapper.
– Terry Jan Reedy
Nov 11 at 13:48

Oh yes, a typo thanks.
– Michal
Nov 11 at 13:50

access the element using selenium methods
– Corey Goldberg
Nov 11 at 15:18

I suspect you want to scrape with a scraper, not scrap with a scrapper.
– Terry Jan Reedy
Nov 11 at 13:48

Oh yes, a typo thanks.
– Michal
Nov 11 at 13:50

access the element using selenium methods
– Corey Goldberg
Nov 11 at 15:18

add a comment |

2 Answers
2

active

oldest

votes

up vote
0
down vote

driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

answered Nov 11 at 13:51

Rumpelstiltskin Koriat

1113

No (don't know if I'm doing something wrong), this still returns unrendered HTML. I mean that when I execute the code with this JS line I still get the string that has inside <div>Copyright © 2000-<script>document.write(new Date().getFullYear())</script></div> instead of just the computed year.
– Michal
Nov 11 at 13:58

can you share the url you're trying to scrape?
– Rumpelstiltskin Koriat
Nov 11 at 14:09

It’s either macrumors.com or youtube.com. Ideally I would need it to be working for both adresses.
– Michal
Nov 11 at 17:17

While this code may answer the question, it is better to explain how to solve the problem and provide the code as an example or reference. Code-only answers can be confusing and lack context.
– Robert Columbia
Nov 12 at 3:36

That script will be there because it's part of the DOM. If the script added something to the DOM, that thing will be there too.
– pguardiario
Nov 12 at 5:08

|
show 1 more comment

up vote
0
down vote

accepted

I've looked into it and I have to admit that JavaScript in @Rumpelstiltskin Koriat's answer works. The current year is present in the returned HTML string, it's placed after the script tag (that as @pguardiario mentioned it has to be there, as it's just HTML tag). I've also found out that in this case of simple JavaScript code from script tags, the WebriverWait is not even needed to obtain the HTML string with a rendered JavaScript code. Apparently I've somehow manged to overlook the rendered by JavaScript string I was so eagerly looking for.

What I've also found (as @Corey Goldberg suggested) is that Selenium methods also works well, while looking better than pure JavaScript line: driver.find_element_by_tag_name('html').get_attribute('innerHTML'). It then returns a string and not any webelement.

On the other hand, when there is a need to scrape a whole HTML of the Angular powered website, it's necessary to locate ideally (at least in the case of YouTube website) it's tag with id="content" (and then include this locating at the beginning of all XPaths used later in the code - simulating that we have a whole HTML) or some tag inside this one. WebriverWait was also not needed here as well.
But when locating just HTML tag or yt-app tag or any other tag outside of the one with id="content" an HTML with unrendered JavaScript is returned then. HTML in the Angular generated websites is mixed with Agular's own tags (that browsers apparently ignores).

answered Nov 24 at 12:49

Michal

186

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53249312%2fhow-to-scrape-html-rendered-by-javascript%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
0
down vote

driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

answered Nov 11 at 13:51

Rumpelstiltskin Koriat

1113

No (don't know if I'm doing something wrong), this still returns unrendered HTML. I mean that when I execute the code with this JS line I still get the string that has inside <div>Copyright © 2000-<script>document.write(new Date().getFullYear())</script></div> instead of just the computed year.
– Michal
Nov 11 at 13:58

can you share the url you're trying to scrape?
– Rumpelstiltskin Koriat
Nov 11 at 14:09

It’s either macrumors.com or youtube.com. Ideally I would need it to be working for both adresses.
– Michal
Nov 11 at 17:17

While this code may answer the question, it is better to explain how to solve the problem and provide the code as an example or reference. Code-only answers can be confusing and lack context.
– Robert Columbia
Nov 12 at 3:36

That script will be there because it's part of the DOM. If the script added something to the DOM, that thing will be there too.
– pguardiario
Nov 12 at 5:08

|
show 1 more comment

up vote
0
down vote

driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

answered Nov 11 at 13:51

Rumpelstiltskin Koriat

1113

No (don't know if I'm doing something wrong), this still returns unrendered HTML. I mean that when I execute the code with this JS line I still get the string that has inside <div>Copyright © 2000-<script>document.write(new Date().getFullYear())</script></div> instead of just the computed year.
– Michal
Nov 11 at 13:58

can you share the url you're trying to scrape?
– Rumpelstiltskin Koriat
Nov 11 at 14:09

It’s either macrumors.com or youtube.com. Ideally I would need it to be working for both adresses.
– Michal
Nov 11 at 17:17

While this code may answer the question, it is better to explain how to solve the problem and provide the code as an example or reference. Code-only answers can be confusing and lack context.
– Robert Columbia
Nov 12 at 3:36

That script will be there because it's part of the DOM. If the script added something to the DOM, that thing will be there too.
– pguardiario
Nov 12 at 5:08

|
show 1 more comment

up vote
0
down vote

driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

answered Nov 11 at 13:51

Rumpelstiltskin Koriat

1113

driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

answered Nov 11 at 13:51

Rumpelstiltskin Koriat

1113

answered Nov 11 at 13:51

Rumpelstiltskin Koriat

1113

answered Nov 11 at 13:51

Rumpelstiltskin Koriat

1113

answered Nov 11 at 13:51

Rumpelstiltskin Koriat

1113

No (don't know if I'm doing something wrong), this still returns unrendered HTML. I mean that when I execute the code with this JS line I still get the string that has inside <div>Copyright © 2000-<script>document.write(new Date().getFullYear())</script></div> instead of just the computed year.
– Michal
Nov 11 at 13:58

can you share the url you're trying to scrape?
– Rumpelstiltskin Koriat
Nov 11 at 14:09

It’s either macrumors.com or youtube.com. Ideally I would need it to be working for both adresses.
– Michal
Nov 11 at 17:17

While this code may answer the question, it is better to explain how to solve the problem and provide the code as an example or reference. Code-only answers can be confusing and lack context.
– Robert Columbia
Nov 12 at 3:36

That script will be there because it's part of the DOM. If the script added something to the DOM, that thing will be there too.
– pguardiario
Nov 12 at 5:08

|
show 1 more comment

No (don't know if I'm doing something wrong), this still returns unrendered HTML. I mean that when I execute the code with this JS line I still get the string that has inside <div>Copyright © 2000-<script>document.write(new Date().getFullYear())</script></div> instead of just the computed year.
– Michal
Nov 11 at 13:58

can you share the url you're trying to scrape?
– Rumpelstiltskin Koriat
Nov 11 at 14:09

It’s either macrumors.com or youtube.com. Ideally I would need it to be working for both adresses.
– Michal
Nov 11 at 17:17

While this code may answer the question, it is better to explain how to solve the problem and provide the code as an example or reference. Code-only answers can be confusing and lack context.
– Robert Columbia
Nov 12 at 3:36

That script will be there because it's part of the DOM. If the script added something to the DOM, that thing will be there too.
– pguardiario
Nov 12 at 5:08

No (don't know if I'm doing something wrong), this still returns unrendered HTML. I mean that when I execute the code with this JS line I still get the string that has inside <div>Copyright © 2000-<script>document.write(new Date().getFullYear())</script></div> instead of just the computed year.
– Michal
Nov 11 at 13:58

can you share the url you're trying to scrape?
– Rumpelstiltskin Koriat
Nov 11 at 14:09

It’s either macrumors.com or youtube.com. Ideally I would need it to be working for both adresses.
– Michal
Nov 11 at 17:17

While this code may answer the question, it is better to explain how to solve the problem and provide the code as an example or reference. Code-only answers can be confusing and lack context.
– Robert Columbia
Nov 12 at 3:36

That script will be there because it's part of the DOM. If the script added something to the DOM, that thing will be there too.
– pguardiario
Nov 12 at 5:08

|
show 1 more comment

up vote
0
down vote

accepted

answered Nov 24 at 12:49

Michal

186

add a comment |

up vote
0
down vote

accepted

answered Nov 24 at 12:49

Michal

186

add a comment |

up vote
0
down vote

accepted

answered Nov 24 at 12:49

Michal

186

answered Nov 24 at 12:49

Michal

186

answered Nov 24 at 12:49

Michal

186

answered Nov 24 at 12:49

Michal

186

answered Nov 24 at 12:49

Michal

186

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky