How to scrape HTML rendered by JavaScript
up vote
0
down vote
favorite
I need to write an automated scraper that can take care of websites that are rendered by JavaScript (like YouTube) or just simply use some JavaScript somewhere in their HTML to generate some content (like generating copyright year) and therefore downloading their HTML source make no sense as it won't be the final code (with what users will see).
I use Python with Selenium and WebDriver, so that I can execute JavaScript on a given website. My code for that purpose is:
def execute_javascript_on_website(self, js_command):
driver = webdriver.Firefox(firefox_options = self.webdriver_options, executable_path = os.path.dirname(os.path.abspath(__file__)) + '/executables/geckodriver')
driver.get(self.url)
try:
return driver.execute_script(js_command)
except Exception as exception_message:
pass
finally:
driver.close()
Where js_command = "return document.documentElement.outerHTML;".
By this code I'm able to get the source code, but not the rendered one. I can do js_command = "return document;" (as I would do in console), but than I will get <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="5a784804-f623-3041-9840-03f13ce83f53", element="585b43a1-f3b2-1e4a-b348-4ddaf2944550")> object that has the HTML but it's not possible to get it out of it.
Does anyone know about the way how to get HTML rendered by JavaScript (ideally in form of string), using Selenium? Or some other technique that would do it?
PS.: I also tried WebDriver wait, but it didn’t help, I still got HTML with unredered JavaScript.
PPS.: I need to get whole HTML code (whole html tag) with JavaScript rendered in it (as it is for example when inspecting in browsers inspector). Or at least to get DOM of the website in which JavaScript is already rendered.
python selenium-webdriver web-scraping
add a comment |
up vote
0
down vote
favorite
I need to write an automated scraper that can take care of websites that are rendered by JavaScript (like YouTube) or just simply use some JavaScript somewhere in their HTML to generate some content (like generating copyright year) and therefore downloading their HTML source make no sense as it won't be the final code (with what users will see).
I use Python with Selenium and WebDriver, so that I can execute JavaScript on a given website. My code for that purpose is:
def execute_javascript_on_website(self, js_command):
driver = webdriver.Firefox(firefox_options = self.webdriver_options, executable_path = os.path.dirname(os.path.abspath(__file__)) + '/executables/geckodriver')
driver.get(self.url)
try:
return driver.execute_script(js_command)
except Exception as exception_message:
pass
finally:
driver.close()
Where js_command = "return document.documentElement.outerHTML;".
By this code I'm able to get the source code, but not the rendered one. I can do js_command = "return document;" (as I would do in console), but than I will get <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="5a784804-f623-3041-9840-03f13ce83f53", element="585b43a1-f3b2-1e4a-b348-4ddaf2944550")> object that has the HTML but it's not possible to get it out of it.
Does anyone know about the way how to get HTML rendered by JavaScript (ideally in form of string), using Selenium? Or some other technique that would do it?
PS.: I also tried WebDriver wait, but it didn’t help, I still got HTML with unredered JavaScript.
PPS.: I need to get whole HTML code (whole html tag) with JavaScript rendered in it (as it is for example when inspecting in browsers inspector). Or at least to get DOM of the website in which JavaScript is already rendered.
python selenium-webdriver web-scraping
1
I suspect you want to scrape with a scraper, not scrap with a scrapper.
– Terry Jan Reedy
Nov 11 at 13:48
Oh yes, a typo thanks.
– Michal
Nov 11 at 13:50
access the element using selenium methods
– Corey Goldberg
Nov 11 at 15:18
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I need to write an automated scraper that can take care of websites that are rendered by JavaScript (like YouTube) or just simply use some JavaScript somewhere in their HTML to generate some content (like generating copyright year) and therefore downloading their HTML source make no sense as it won't be the final code (with what users will see).
I use Python with Selenium and WebDriver, so that I can execute JavaScript on a given website. My code for that purpose is:
def execute_javascript_on_website(self, js_command):
driver = webdriver.Firefox(firefox_options = self.webdriver_options, executable_path = os.path.dirname(os.path.abspath(__file__)) + '/executables/geckodriver')
driver.get(self.url)
try:
return driver.execute_script(js_command)
except Exception as exception_message:
pass
finally:
driver.close()
Where js_command = "return document.documentElement.outerHTML;".
By this code I'm able to get the source code, but not the rendered one. I can do js_command = "return document;" (as I would do in console), but than I will get <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="5a784804-f623-3041-9840-03f13ce83f53", element="585b43a1-f3b2-1e4a-b348-4ddaf2944550")> object that has the HTML but it's not possible to get it out of it.
Does anyone know about the way how to get HTML rendered by JavaScript (ideally in form of string), using Selenium? Or some other technique that would do it?
PS.: I also tried WebDriver wait, but it didn’t help, I still got HTML with unredered JavaScript.
PPS.: I need to get whole HTML code (whole html tag) with JavaScript rendered in it (as it is for example when inspecting in browsers inspector). Or at least to get DOM of the website in which JavaScript is already rendered.
python selenium-webdriver web-scraping
I need to write an automated scraper that can take care of websites that are rendered by JavaScript (like YouTube) or just simply use some JavaScript somewhere in their HTML to generate some content (like generating copyright year) and therefore downloading their HTML source make no sense as it won't be the final code (with what users will see).
I use Python with Selenium and WebDriver, so that I can execute JavaScript on a given website. My code for that purpose is:
def execute_javascript_on_website(self, js_command):
driver = webdriver.Firefox(firefox_options = self.webdriver_options, executable_path = os.path.dirname(os.path.abspath(__file__)) + '/executables/geckodriver')
driver.get(self.url)
try:
return driver.execute_script(js_command)
except Exception as exception_message:
pass
finally:
driver.close()
Where js_command = "return document.documentElement.outerHTML;".
By this code I'm able to get the source code, but not the rendered one. I can do js_command = "return document;" (as I would do in console), but than I will get <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="5a784804-f623-3041-9840-03f13ce83f53", element="585b43a1-f3b2-1e4a-b348-4ddaf2944550")> object that has the HTML but it's not possible to get it out of it.
Does anyone know about the way how to get HTML rendered by JavaScript (ideally in form of string), using Selenium? Or some other technique that would do it?
PS.: I also tried WebDriver wait, but it didn’t help, I still got HTML with unredered JavaScript.
PPS.: I need to get whole HTML code (whole html tag) with JavaScript rendered in it (as it is for example when inspecting in browsers inspector). Or at least to get DOM of the website in which JavaScript is already rendered.
python selenium-webdriver web-scraping
python selenium-webdriver web-scraping
edited Nov 12 at 18:14
asked Nov 11 at 13:39
Michal
186
186
1
I suspect you want to scrape with a scraper, not scrap with a scrapper.
– Terry Jan Reedy
Nov 11 at 13:48
Oh yes, a typo thanks.
– Michal
Nov 11 at 13:50
access the element using selenium methods
– Corey Goldberg
Nov 11 at 15:18
add a comment |
1
I suspect you want to scrape with a scraper, not scrap with a scrapper.
– Terry Jan Reedy
Nov 11 at 13:48
Oh yes, a typo thanks.
– Michal
Nov 11 at 13:50
access the element using selenium methods
– Corey Goldberg
Nov 11 at 15:18
1
1
I suspect you want to scrape with a scraper, not scrap with a scrapper.
– Terry Jan Reedy
Nov 11 at 13:48
I suspect you want to scrape with a scraper, not scrap with a scrapper.
– Terry Jan Reedy
Nov 11 at 13:48
Oh yes, a typo thanks.
– Michal
Nov 11 at 13:50
Oh yes, a typo thanks.
– Michal
Nov 11 at 13:50
access the element using selenium methods
– Corey Goldberg
Nov 11 at 15:18
access the element using selenium methods
– Corey Goldberg
Nov 11 at 15:18
add a comment |
2 Answers
2
active
oldest
votes
up vote
0
down vote
driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
No (don't know if I'm doing something wrong), this still returns unrendered HTML. I mean that when I execute the code with this JS line I still get the string that has inside<div>Copyright © 2000-<script>document.write(new Date().getFullYear())</script></div>instead of just the computed year.
– Michal
Nov 11 at 13:58
can you share the url you're trying to scrape?
– Rumpelstiltskin Koriat
Nov 11 at 14:09
It’s either macrumors.com or youtube.com. Ideally I would need it to be working for both adresses.
– Michal
Nov 11 at 17:17
While this code may answer the question, it is better to explain how to solve the problem and provide the code as an example or reference. Code-only answers can be confusing and lack context.
– Robert Columbia
Nov 12 at 3:36
That script will be there because it's part of the DOM. If the script added something to the DOM, that thing will be there too.
– pguardiario
Nov 12 at 5:08
|
show 1 more comment
up vote
0
down vote
accepted
I've looked into it and I have to admit that JavaScript in @Rumpelstiltskin Koriat's answer works. The current year is present in the returned HTML string, it's placed after the script tag (that as @pguardiario mentioned it has to be there, as it's just HTML tag). I've also found out that in this case of simple JavaScript code from script tags, the WebriverWait is not even needed to obtain the HTML string with a rendered JavaScript code. Apparently I've somehow manged to overlook the rendered by JavaScript string I was so eagerly looking for.
What I've also found (as @Corey Goldberg suggested) is that Selenium methods also works well, while looking better than pure JavaScript line: driver.find_element_by_tag_name('html').get_attribute('innerHTML'). It then returns a string and not any webelement.
On the other hand, when there is a need to scrape a whole HTML of the Angular powered website, it's necessary to locate ideally (at least in the case of YouTube website) it's tag with id="content" (and then include this locating at the beginning of all XPaths used later in the code - simulating that we have a whole HTML) or some tag inside this one. WebriverWait was also not needed here as well.
But when locating just HTML tag or yt-app tag or any other tag outside of the one with id="content" an HTML with unrendered JavaScript is returned then. HTML in the Angular generated websites is mixed with Agular's own tags (that browsers apparently ignores).
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
No (don't know if I'm doing something wrong), this still returns unrendered HTML. I mean that when I execute the code with this JS line I still get the string that has inside<div>Copyright © 2000-<script>document.write(new Date().getFullYear())</script></div>instead of just the computed year.
– Michal
Nov 11 at 13:58
can you share the url you're trying to scrape?
– Rumpelstiltskin Koriat
Nov 11 at 14:09
It’s either macrumors.com or youtube.com. Ideally I would need it to be working for both adresses.
– Michal
Nov 11 at 17:17
While this code may answer the question, it is better to explain how to solve the problem and provide the code as an example or reference. Code-only answers can be confusing and lack context.
– Robert Columbia
Nov 12 at 3:36
That script will be there because it's part of the DOM. If the script added something to the DOM, that thing will be there too.
– pguardiario
Nov 12 at 5:08
|
show 1 more comment
up vote
0
down vote
driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
No (don't know if I'm doing something wrong), this still returns unrendered HTML. I mean that when I execute the code with this JS line I still get the string that has inside<div>Copyright © 2000-<script>document.write(new Date().getFullYear())</script></div>instead of just the computed year.
– Michal
Nov 11 at 13:58
can you share the url you're trying to scrape?
– Rumpelstiltskin Koriat
Nov 11 at 14:09
It’s either macrumors.com or youtube.com. Ideally I would need it to be working for both adresses.
– Michal
Nov 11 at 17:17
While this code may answer the question, it is better to explain how to solve the problem and provide the code as an example or reference. Code-only answers can be confusing and lack context.
– Robert Columbia
Nov 12 at 3:36
That script will be there because it's part of the DOM. If the script added something to the DOM, that thing will be there too.
– pguardiario
Nov 12 at 5:08
|
show 1 more comment
up vote
0
down vote
up vote
0
down vote
driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
answered Nov 11 at 13:51
Rumpelstiltskin Koriat
1113
1113
No (don't know if I'm doing something wrong), this still returns unrendered HTML. I mean that when I execute the code with this JS line I still get the string that has inside<div>Copyright © 2000-<script>document.write(new Date().getFullYear())</script></div>instead of just the computed year.
– Michal
Nov 11 at 13:58
can you share the url you're trying to scrape?
– Rumpelstiltskin Koriat
Nov 11 at 14:09
It’s either macrumors.com or youtube.com. Ideally I would need it to be working for both adresses.
– Michal
Nov 11 at 17:17
While this code may answer the question, it is better to explain how to solve the problem and provide the code as an example or reference. Code-only answers can be confusing and lack context.
– Robert Columbia
Nov 12 at 3:36
That script will be there because it's part of the DOM. If the script added something to the DOM, that thing will be there too.
– pguardiario
Nov 12 at 5:08
|
show 1 more comment
No (don't know if I'm doing something wrong), this still returns unrendered HTML. I mean that when I execute the code with this JS line I still get the string that has inside<div>Copyright © 2000-<script>document.write(new Date().getFullYear())</script></div>instead of just the computed year.
– Michal
Nov 11 at 13:58
can you share the url you're trying to scrape?
– Rumpelstiltskin Koriat
Nov 11 at 14:09
It’s either macrumors.com or youtube.com. Ideally I would need it to be working for both adresses.
– Michal
Nov 11 at 17:17
While this code may answer the question, it is better to explain how to solve the problem and provide the code as an example or reference. Code-only answers can be confusing and lack context.
– Robert Columbia
Nov 12 at 3:36
That script will be there because it's part of the DOM. If the script added something to the DOM, that thing will be there too.
– pguardiario
Nov 12 at 5:08
No (don't know if I'm doing something wrong), this still returns unrendered HTML. I mean that when I execute the code with this JS line I still get the string that has inside
<div>Copyright © 2000-<script>document.write(new Date().getFullYear())</script></div> instead of just the computed year.– Michal
Nov 11 at 13:58
No (don't know if I'm doing something wrong), this still returns unrendered HTML. I mean that when I execute the code with this JS line I still get the string that has inside
<div>Copyright © 2000-<script>document.write(new Date().getFullYear())</script></div> instead of just the computed year.– Michal
Nov 11 at 13:58
can you share the url you're trying to scrape?
– Rumpelstiltskin Koriat
Nov 11 at 14:09
can you share the url you're trying to scrape?
– Rumpelstiltskin Koriat
Nov 11 at 14:09
It’s either macrumors.com or youtube.com. Ideally I would need it to be working for both adresses.
– Michal
Nov 11 at 17:17
It’s either macrumors.com or youtube.com. Ideally I would need it to be working for both adresses.
– Michal
Nov 11 at 17:17
While this code may answer the question, it is better to explain how to solve the problem and provide the code as an example or reference. Code-only answers can be confusing and lack context.
– Robert Columbia
Nov 12 at 3:36
While this code may answer the question, it is better to explain how to solve the problem and provide the code as an example or reference. Code-only answers can be confusing and lack context.
– Robert Columbia
Nov 12 at 3:36
That script will be there because it's part of the DOM. If the script added something to the DOM, that thing will be there too.
– pguardiario
Nov 12 at 5:08
That script will be there because it's part of the DOM. If the script added something to the DOM, that thing will be there too.
– pguardiario
Nov 12 at 5:08
|
show 1 more comment
up vote
0
down vote
accepted
I've looked into it and I have to admit that JavaScript in @Rumpelstiltskin Koriat's answer works. The current year is present in the returned HTML string, it's placed after the script tag (that as @pguardiario mentioned it has to be there, as it's just HTML tag). I've also found out that in this case of simple JavaScript code from script tags, the WebriverWait is not even needed to obtain the HTML string with a rendered JavaScript code. Apparently I've somehow manged to overlook the rendered by JavaScript string I was so eagerly looking for.
What I've also found (as @Corey Goldberg suggested) is that Selenium methods also works well, while looking better than pure JavaScript line: driver.find_element_by_tag_name('html').get_attribute('innerHTML'). It then returns a string and not any webelement.
On the other hand, when there is a need to scrape a whole HTML of the Angular powered website, it's necessary to locate ideally (at least in the case of YouTube website) it's tag with id="content" (and then include this locating at the beginning of all XPaths used later in the code - simulating that we have a whole HTML) or some tag inside this one. WebriverWait was also not needed here as well.
But when locating just HTML tag or yt-app tag or any other tag outside of the one with id="content" an HTML with unrendered JavaScript is returned then. HTML in the Angular generated websites is mixed with Agular's own tags (that browsers apparently ignores).
add a comment |
up vote
0
down vote
accepted
I've looked into it and I have to admit that JavaScript in @Rumpelstiltskin Koriat's answer works. The current year is present in the returned HTML string, it's placed after the script tag (that as @pguardiario mentioned it has to be there, as it's just HTML tag). I've also found out that in this case of simple JavaScript code from script tags, the WebriverWait is not even needed to obtain the HTML string with a rendered JavaScript code. Apparently I've somehow manged to overlook the rendered by JavaScript string I was so eagerly looking for.
What I've also found (as @Corey Goldberg suggested) is that Selenium methods also works well, while looking better than pure JavaScript line: driver.find_element_by_tag_name('html').get_attribute('innerHTML'). It then returns a string and not any webelement.
On the other hand, when there is a need to scrape a whole HTML of the Angular powered website, it's necessary to locate ideally (at least in the case of YouTube website) it's tag with id="content" (and then include this locating at the beginning of all XPaths used later in the code - simulating that we have a whole HTML) or some tag inside this one. WebriverWait was also not needed here as well.
But when locating just HTML tag or yt-app tag or any other tag outside of the one with id="content" an HTML with unrendered JavaScript is returned then. HTML in the Angular generated websites is mixed with Agular's own tags (that browsers apparently ignores).
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
I've looked into it and I have to admit that JavaScript in @Rumpelstiltskin Koriat's answer works. The current year is present in the returned HTML string, it's placed after the script tag (that as @pguardiario mentioned it has to be there, as it's just HTML tag). I've also found out that in this case of simple JavaScript code from script tags, the WebriverWait is not even needed to obtain the HTML string with a rendered JavaScript code. Apparently I've somehow manged to overlook the rendered by JavaScript string I was so eagerly looking for.
What I've also found (as @Corey Goldberg suggested) is that Selenium methods also works well, while looking better than pure JavaScript line: driver.find_element_by_tag_name('html').get_attribute('innerHTML'). It then returns a string and not any webelement.
On the other hand, when there is a need to scrape a whole HTML of the Angular powered website, it's necessary to locate ideally (at least in the case of YouTube website) it's tag with id="content" (and then include this locating at the beginning of all XPaths used later in the code - simulating that we have a whole HTML) or some tag inside this one. WebriverWait was also not needed here as well.
But when locating just HTML tag or yt-app tag or any other tag outside of the one with id="content" an HTML with unrendered JavaScript is returned then. HTML in the Angular generated websites is mixed with Agular's own tags (that browsers apparently ignores).
I've looked into it and I have to admit that JavaScript in @Rumpelstiltskin Koriat's answer works. The current year is present in the returned HTML string, it's placed after the script tag (that as @pguardiario mentioned it has to be there, as it's just HTML tag). I've also found out that in this case of simple JavaScript code from script tags, the WebriverWait is not even needed to obtain the HTML string with a rendered JavaScript code. Apparently I've somehow manged to overlook the rendered by JavaScript string I was so eagerly looking for.
What I've also found (as @Corey Goldberg suggested) is that Selenium methods also works well, while looking better than pure JavaScript line: driver.find_element_by_tag_name('html').get_attribute('innerHTML'). It then returns a string and not any webelement.
On the other hand, when there is a need to scrape a whole HTML of the Angular powered website, it's necessary to locate ideally (at least in the case of YouTube website) it's tag with id="content" (and then include this locating at the beginning of all XPaths used later in the code - simulating that we have a whole HTML) or some tag inside this one. WebriverWait was also not needed here as well.
But when locating just HTML tag or yt-app tag or any other tag outside of the one with id="content" an HTML with unrendered JavaScript is returned then. HTML in the Angular generated websites is mixed with Agular's own tags (that browsers apparently ignores).
answered Nov 24 at 12:49
Michal
186
186
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53249312%2fhow-to-scrape-html-rendered-by-javascript%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
I suspect you want to scrape with a scraper, not scrap with a scrapper.
– Terry Jan Reedy
Nov 11 at 13:48
Oh yes, a typo thanks.
– Michal
Nov 11 at 13:50
access the element using selenium methods
– Corey Goldberg
Nov 11 at 15:18