Only href=“#”, no onclick(), how do I load this in script?
I'm in the process of writing a scraper for the articles on the site https://www.welt.de. I'd also like to include the comments. However, when loading the page, not all comments are loaded automatically. Instead one has to click on a link to load more comments, until at some point, all are loaded.
Eg: https://www.welt.de/finanzen/immobilien/article183878020/Bundesbank-sieht-im-Immobilienboom-ein-Stabilitaetsrisiko.html
When you scroll down, there appears a surface "MEHR KOMMENTARE ANZEIGEN" (German for 'show more comments').
This link looks like:
<div href="#" style="text-align: center; height: 44px; cursor: pointer;">
<a style="font-size: 0.6875rem; font-family: ffmark, "Helvetica Neue", Helvetica, Arial, sans-serif; font-weight: 800; color: rgb(0, 57, 91); line-height: 5;">
<span style="font-size: 0.6875rem; font-family: ffmark, "Helvetica Neue", Helvetica, Arial, sans-serif; font-weight: 500; margin-right: 0.625rem; text-align: right; color: rgb(120, 120, 120);">
MEHR KOMMENTARE ANZEIGEN
<span style="width: 14px; height: 8px; margin: 0px 0px 0px 0.625rem; padding-top: 0px; display: inline-block; vertical-align: initial;">
<svg viewBox="0 0 15 9" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g stroke="none" stroke-width="1" fill="none" fill-rule="evenodd">
<g transform="translate(-608.000000, -4318.000000)" fill="#787878">
<polygon transform="translate(615.205882, 4322.852941) rotate(-90.000000) translate(-615.205882, -4322.852941) " points="618.264706 4315.79412 611.205882 4322.85353 618.264706 4329.91176 619.205882 4328.97059 613.088824 4322.85353 619.205882 4316.73529">
</polygon>
</g>
</g>
</svg>
</span>
</span>
</a>
</div>
However, I do not know how to load this link in a script?
I understand that href="#"
is used when a link is handled by javascript and that it is bad style, as it is only used to change the appearance of the mouse, for which there are other methods.
But where is the onClick() method? Kinda dumbfoundead here...
javascript html web-scraping web-crawler href
add a comment |
I'm in the process of writing a scraper for the articles on the site https://www.welt.de. I'd also like to include the comments. However, when loading the page, not all comments are loaded automatically. Instead one has to click on a link to load more comments, until at some point, all are loaded.
Eg: https://www.welt.de/finanzen/immobilien/article183878020/Bundesbank-sieht-im-Immobilienboom-ein-Stabilitaetsrisiko.html
When you scroll down, there appears a surface "MEHR KOMMENTARE ANZEIGEN" (German for 'show more comments').
This link looks like:
<div href="#" style="text-align: center; height: 44px; cursor: pointer;">
<a style="font-size: 0.6875rem; font-family: ffmark, "Helvetica Neue", Helvetica, Arial, sans-serif; font-weight: 800; color: rgb(0, 57, 91); line-height: 5;">
<span style="font-size: 0.6875rem; font-family: ffmark, "Helvetica Neue", Helvetica, Arial, sans-serif; font-weight: 500; margin-right: 0.625rem; text-align: right; color: rgb(120, 120, 120);">
MEHR KOMMENTARE ANZEIGEN
<span style="width: 14px; height: 8px; margin: 0px 0px 0px 0.625rem; padding-top: 0px; display: inline-block; vertical-align: initial;">
<svg viewBox="0 0 15 9" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g stroke="none" stroke-width="1" fill="none" fill-rule="evenodd">
<g transform="translate(-608.000000, -4318.000000)" fill="#787878">
<polygon transform="translate(615.205882, 4322.852941) rotate(-90.000000) translate(-615.205882, -4322.852941) " points="618.264706 4315.79412 611.205882 4322.85353 618.264706 4329.91176 619.205882 4328.97059 613.088824 4322.85353 619.205882 4316.73529">
</polygon>
</g>
</g>
</svg>
</span>
</span>
</a>
</div>
However, I do not know how to load this link in a script?
I understand that href="#"
is used when a link is handled by javascript and that it is bad style, as it is only used to change the appearance of the mouse, for which there are other methods.
But where is the onClick() method? Kinda dumbfoundead here...
javascript html web-scraping web-crawler href
If there's noonclick
then I'd guess that a click handler is registered somewhere in the JavaScript the page loads. Any idea what JavaScript frameworks (if any) the page uses?
– phuzi
Nov 15 '18 at 14:44
There's like 20 different script files those pages load. All the event handlers will be there somewhere. But as elken shown below, if you are able to extract all the relevant API endpoints, using those will be way better than actually scraping the site. Be mindful of copyrights though, I'm not sure if they would or would not mind.
– Shilly
Nov 15 '18 at 14:47
When it comes to web scraping I'd personally recommend the use of a headless browser such as headless chrome because you can do things like programmatically click elements without having to sniff for event listeners. You can also do things like wait for the DOM to change or a network request to be made before proceeding. All of which sound like they'd benefit your use case. You can't do that with a content script. Which is what I assume you're using?
– Khauri McClain
Nov 15 '18 at 14:53
add a comment |
I'm in the process of writing a scraper for the articles on the site https://www.welt.de. I'd also like to include the comments. However, when loading the page, not all comments are loaded automatically. Instead one has to click on a link to load more comments, until at some point, all are loaded.
Eg: https://www.welt.de/finanzen/immobilien/article183878020/Bundesbank-sieht-im-Immobilienboom-ein-Stabilitaetsrisiko.html
When you scroll down, there appears a surface "MEHR KOMMENTARE ANZEIGEN" (German for 'show more comments').
This link looks like:
<div href="#" style="text-align: center; height: 44px; cursor: pointer;">
<a style="font-size: 0.6875rem; font-family: ffmark, "Helvetica Neue", Helvetica, Arial, sans-serif; font-weight: 800; color: rgb(0, 57, 91); line-height: 5;">
<span style="font-size: 0.6875rem; font-family: ffmark, "Helvetica Neue", Helvetica, Arial, sans-serif; font-weight: 500; margin-right: 0.625rem; text-align: right; color: rgb(120, 120, 120);">
MEHR KOMMENTARE ANZEIGEN
<span style="width: 14px; height: 8px; margin: 0px 0px 0px 0.625rem; padding-top: 0px; display: inline-block; vertical-align: initial;">
<svg viewBox="0 0 15 9" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g stroke="none" stroke-width="1" fill="none" fill-rule="evenodd">
<g transform="translate(-608.000000, -4318.000000)" fill="#787878">
<polygon transform="translate(615.205882, 4322.852941) rotate(-90.000000) translate(-615.205882, -4322.852941) " points="618.264706 4315.79412 611.205882 4322.85353 618.264706 4329.91176 619.205882 4328.97059 613.088824 4322.85353 619.205882 4316.73529">
</polygon>
</g>
</g>
</svg>
</span>
</span>
</a>
</div>
However, I do not know how to load this link in a script?
I understand that href="#"
is used when a link is handled by javascript and that it is bad style, as it is only used to change the appearance of the mouse, for which there are other methods.
But where is the onClick() method? Kinda dumbfoundead here...
javascript html web-scraping web-crawler href
I'm in the process of writing a scraper for the articles on the site https://www.welt.de. I'd also like to include the comments. However, when loading the page, not all comments are loaded automatically. Instead one has to click on a link to load more comments, until at some point, all are loaded.
Eg: https://www.welt.de/finanzen/immobilien/article183878020/Bundesbank-sieht-im-Immobilienboom-ein-Stabilitaetsrisiko.html
When you scroll down, there appears a surface "MEHR KOMMENTARE ANZEIGEN" (German for 'show more comments').
This link looks like:
<div href="#" style="text-align: center; height: 44px; cursor: pointer;">
<a style="font-size: 0.6875rem; font-family: ffmark, "Helvetica Neue", Helvetica, Arial, sans-serif; font-weight: 800; color: rgb(0, 57, 91); line-height: 5;">
<span style="font-size: 0.6875rem; font-family: ffmark, "Helvetica Neue", Helvetica, Arial, sans-serif; font-weight: 500; margin-right: 0.625rem; text-align: right; color: rgb(120, 120, 120);">
MEHR KOMMENTARE ANZEIGEN
<span style="width: 14px; height: 8px; margin: 0px 0px 0px 0.625rem; padding-top: 0px; display: inline-block; vertical-align: initial;">
<svg viewBox="0 0 15 9" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g stroke="none" stroke-width="1" fill="none" fill-rule="evenodd">
<g transform="translate(-608.000000, -4318.000000)" fill="#787878">
<polygon transform="translate(615.205882, 4322.852941) rotate(-90.000000) translate(-615.205882, -4322.852941) " points="618.264706 4315.79412 611.205882 4322.85353 618.264706 4329.91176 619.205882 4328.97059 613.088824 4322.85353 619.205882 4316.73529">
</polygon>
</g>
</g>
</svg>
</span>
</span>
</a>
</div>
However, I do not know how to load this link in a script?
I understand that href="#"
is used when a link is handled by javascript and that it is bad style, as it is only used to change the appearance of the mouse, for which there are other methods.
But where is the onClick() method? Kinda dumbfoundead here...
javascript html web-scraping web-crawler href
javascript html web-scraping web-crawler href
edited Nov 15 '18 at 15:32
Khauri McClain
2,2701414
2,2701414
asked Nov 15 '18 at 14:38
Thomas KaltfussThomas Kaltfuss
133
133
If there's noonclick
then I'd guess that a click handler is registered somewhere in the JavaScript the page loads. Any idea what JavaScript frameworks (if any) the page uses?
– phuzi
Nov 15 '18 at 14:44
There's like 20 different script files those pages load. All the event handlers will be there somewhere. But as elken shown below, if you are able to extract all the relevant API endpoints, using those will be way better than actually scraping the site. Be mindful of copyrights though, I'm not sure if they would or would not mind.
– Shilly
Nov 15 '18 at 14:47
When it comes to web scraping I'd personally recommend the use of a headless browser such as headless chrome because you can do things like programmatically click elements without having to sniff for event listeners. You can also do things like wait for the DOM to change or a network request to be made before proceeding. All of which sound like they'd benefit your use case. You can't do that with a content script. Which is what I assume you're using?
– Khauri McClain
Nov 15 '18 at 14:53
add a comment |
If there's noonclick
then I'd guess that a click handler is registered somewhere in the JavaScript the page loads. Any idea what JavaScript frameworks (if any) the page uses?
– phuzi
Nov 15 '18 at 14:44
There's like 20 different script files those pages load. All the event handlers will be there somewhere. But as elken shown below, if you are able to extract all the relevant API endpoints, using those will be way better than actually scraping the site. Be mindful of copyrights though, I'm not sure if they would or would not mind.
– Shilly
Nov 15 '18 at 14:47
When it comes to web scraping I'd personally recommend the use of a headless browser such as headless chrome because you can do things like programmatically click elements without having to sniff for event listeners. You can also do things like wait for the DOM to change or a network request to be made before proceeding. All of which sound like they'd benefit your use case. You can't do that with a content script. Which is what I assume you're using?
– Khauri McClain
Nov 15 '18 at 14:53
If there's no
onclick
then I'd guess that a click handler is registered somewhere in the JavaScript the page loads. Any idea what JavaScript frameworks (if any) the page uses?– phuzi
Nov 15 '18 at 14:44
If there's no
onclick
then I'd guess that a click handler is registered somewhere in the JavaScript the page loads. Any idea what JavaScript frameworks (if any) the page uses?– phuzi
Nov 15 '18 at 14:44
There's like 20 different script files those pages load. All the event handlers will be there somewhere. But as elken shown below, if you are able to extract all the relevant API endpoints, using those will be way better than actually scraping the site. Be mindful of copyrights though, I'm not sure if they would or would not mind.
– Shilly
Nov 15 '18 at 14:47
There's like 20 different script files those pages load. All the event handlers will be there somewhere. But as elken shown below, if you are able to extract all the relevant API endpoints, using those will be way better than actually scraping the site. Be mindful of copyrights though, I'm not sure if they would or would not mind.
– Shilly
Nov 15 '18 at 14:47
When it comes to web scraping I'd personally recommend the use of a headless browser such as headless chrome because you can do things like programmatically click elements without having to sniff for event listeners. You can also do things like wait for the DOM to change or a network request to be made before proceeding. All of which sound like they'd benefit your use case. You can't do that with a content script. Which is what I assume you're using?
– Khauri McClain
Nov 15 '18 at 14:53
When it comes to web scraping I'd personally recommend the use of a headless browser such as headless chrome because you can do things like programmatically click elements without having to sniff for event listeners. You can also do things like wait for the DOM to change or a network request to be made before proceeding. All of which sound like they'd benefit your use case. You can't do that with a content script. Which is what I assume you're using?
– Khauri McClain
Nov 15 '18 at 14:53
add a comment |
2 Answers
2
active
oldest
votes
Clicking that show comments twice gives me the following urls
https://api-co.la.welt.de/api/comments?document-id=183878020&created-cursor=2018-11-15T13:52:41.714&sort=NEWEST
https://api-co.la.welt.de/api/comments?document-id=183878020&created-cursor=2018-11-15T12:23:26.896&sort=NEWEST
Which returns the comments. So just use the post id you have and keep fiddling with created-cursor until you get all the comments?
EDIT:
Removing the creator-cursor parameter should give you all the comments
https://api-co.la.welt.de/api/comments?document-id=183878020
EDIT 2:
As someone else mentioned, this might not be a good idea without first contacting the owner of the site.
add a comment |
As far as finding the click handler:
If you inspect this element, you can see it has a click event handler calling something in communityweb.js:
This is almost certainly attached with javascript somewhere else (eg, document.getElementById('something').addEventListener("click", function(){ ... } );
)
If you want, you can follow through and see the code it's calling (be sure to use the 'pretty print' feature, as it's minified):
It gets complicated from there, but if you're determined enough you could step through in the debugger and see what's being called.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53321836%2fonly-href-no-onclick-how-do-i-load-this-in-script%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Clicking that show comments twice gives me the following urls
https://api-co.la.welt.de/api/comments?document-id=183878020&created-cursor=2018-11-15T13:52:41.714&sort=NEWEST
https://api-co.la.welt.de/api/comments?document-id=183878020&created-cursor=2018-11-15T12:23:26.896&sort=NEWEST
Which returns the comments. So just use the post id you have and keep fiddling with created-cursor until you get all the comments?
EDIT:
Removing the creator-cursor parameter should give you all the comments
https://api-co.la.welt.de/api/comments?document-id=183878020
EDIT 2:
As someone else mentioned, this might not be a good idea without first contacting the owner of the site.
add a comment |
Clicking that show comments twice gives me the following urls
https://api-co.la.welt.de/api/comments?document-id=183878020&created-cursor=2018-11-15T13:52:41.714&sort=NEWEST
https://api-co.la.welt.de/api/comments?document-id=183878020&created-cursor=2018-11-15T12:23:26.896&sort=NEWEST
Which returns the comments. So just use the post id you have and keep fiddling with created-cursor until you get all the comments?
EDIT:
Removing the creator-cursor parameter should give you all the comments
https://api-co.la.welt.de/api/comments?document-id=183878020
EDIT 2:
As someone else mentioned, this might not be a good idea without first contacting the owner of the site.
add a comment |
Clicking that show comments twice gives me the following urls
https://api-co.la.welt.de/api/comments?document-id=183878020&created-cursor=2018-11-15T13:52:41.714&sort=NEWEST
https://api-co.la.welt.de/api/comments?document-id=183878020&created-cursor=2018-11-15T12:23:26.896&sort=NEWEST
Which returns the comments. So just use the post id you have and keep fiddling with created-cursor until you get all the comments?
EDIT:
Removing the creator-cursor parameter should give you all the comments
https://api-co.la.welt.de/api/comments?document-id=183878020
EDIT 2:
As someone else mentioned, this might not be a good idea without first contacting the owner of the site.
Clicking that show comments twice gives me the following urls
https://api-co.la.welt.de/api/comments?document-id=183878020&created-cursor=2018-11-15T13:52:41.714&sort=NEWEST
https://api-co.la.welt.de/api/comments?document-id=183878020&created-cursor=2018-11-15T12:23:26.896&sort=NEWEST
Which returns the comments. So just use the post id you have and keep fiddling with created-cursor until you get all the comments?
EDIT:
Removing the creator-cursor parameter should give you all the comments
https://api-co.la.welt.de/api/comments?document-id=183878020
EDIT 2:
As someone else mentioned, this might not be a good idea without first contacting the owner of the site.
edited Nov 15 '18 at 14:49
answered Nov 15 '18 at 14:42
elkenelken
132110
132110
add a comment |
add a comment |
As far as finding the click handler:
If you inspect this element, you can see it has a click event handler calling something in communityweb.js:
This is almost certainly attached with javascript somewhere else (eg, document.getElementById('something').addEventListener("click", function(){ ... } );
)
If you want, you can follow through and see the code it's calling (be sure to use the 'pretty print' feature, as it's minified):
It gets complicated from there, but if you're determined enough you could step through in the debugger and see what's being called.
add a comment |
As far as finding the click handler:
If you inspect this element, you can see it has a click event handler calling something in communityweb.js:
This is almost certainly attached with javascript somewhere else (eg, document.getElementById('something').addEventListener("click", function(){ ... } );
)
If you want, you can follow through and see the code it's calling (be sure to use the 'pretty print' feature, as it's minified):
It gets complicated from there, but if you're determined enough you could step through in the debugger and see what's being called.
add a comment |
As far as finding the click handler:
If you inspect this element, you can see it has a click event handler calling something in communityweb.js:
This is almost certainly attached with javascript somewhere else (eg, document.getElementById('something').addEventListener("click", function(){ ... } );
)
If you want, you can follow through and see the code it's calling (be sure to use the 'pretty print' feature, as it's minified):
It gets complicated from there, but if you're determined enough you could step through in the debugger and see what's being called.
As far as finding the click handler:
If you inspect this element, you can see it has a click event handler calling something in communityweb.js:
This is almost certainly attached with javascript somewhere else (eg, document.getElementById('something').addEventListener("click", function(){ ... } );
)
If you want, you can follow through and see the code it's calling (be sure to use the 'pretty print' feature, as it's minified):
It gets complicated from there, but if you're determined enough you could step through in the debugger and see what's being called.
answered Nov 15 '18 at 14:51
gregmacgregmac
18.6k768101
18.6k768101
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53321836%2fonly-href-no-onclick-how-do-i-load-this-in-script%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
If there's no
onclick
then I'd guess that a click handler is registered somewhere in the JavaScript the page loads. Any idea what JavaScript frameworks (if any) the page uses?– phuzi
Nov 15 '18 at 14:44
There's like 20 different script files those pages load. All the event handlers will be there somewhere. But as elken shown below, if you are able to extract all the relevant API endpoints, using those will be way better than actually scraping the site. Be mindful of copyrights though, I'm not sure if they would or would not mind.
– Shilly
Nov 15 '18 at 14:47
When it comes to web scraping I'd personally recommend the use of a headless browser such as headless chrome because you can do things like programmatically click elements without having to sniff for event listeners. You can also do things like wait for the DOM to change or a network request to be made before proceeding. All of which sound like they'd benefit your use case. You can't do that with a content script. Which is what I assume you're using?
– Khauri McClain
Nov 15 '18 at 14:53