How to extract a movie-title from BoxOfficeMojo with Xpath
So I'm trying to import a movie title in Google Sheets using the ImportXML-function, which wants me to give it an Xpath-query. The problem is BoxOfficeMojo, which is structured in a very odd manner, which makes it impossible for me to figure out how to query.
Everything is working all fine and dandy in Google Sheets, but I'm simply unable to generate the correct Xpath-query, that gives me just the title of the movie, when providing the ImportXML-cell with a link.
This is what I have to work with:
A link to a given movie (eg. https://www.boxofficemojo.com/movies/?page=main&id=ateam.htm)
An ImportXML-statement in Google Sheets
(https://support.google.com/docs/answer/3093342?hl=en)- An Xpath-query (and this is where it gets tricky)
I've tried a number of different queries, including the one Chrome will generate for me when I inspect the site, but BoxOfficeMojo is structured in a weird way, which means I can't use any logic to create my query. I've been at it for a few hours now, and the closest I've come to getting the right result is this query:
//*[//table[@border = '0']]/td/font/b
Which gives me this result:
The A-Team
4
I cannot for the life of me figure out, how I filter out the title of the movie ("The A-Team" in this case).
For good measure, this is what Chrome suggests as the Xpath, when I inspect the site:
//*[@id="body"]/table[2]/tbody/tr/td/table[1]/tbody/tr/td[2]/font/b
The above query doesn't work, and just throws an error, as if the query returns nothing...
Possible Solution
I've seen some people who have successfully made an Xpath-query, where they search for a certain phrase, and return the content after this phrase (look at this: python: xpath returns empty list from boxofficemojo.com).
However, I cannot seem to figure out how I would search for the title of the movie, since it's not surrounded by any text.
The whole point of scraping something as simple as this title is, that the titles on BoxOfficeMojo changes throughout the year, and I need the exact title in order for me to scrape and compare data in Google Sheets.
Hopefully I've provided an Xpath-wizard in the crowd with enough information to be able to come up with a suggestion for me.
Thanks in advance!
xml xpath web-scraping google-sheets
|
show 2 more comments
So I'm trying to import a movie title in Google Sheets using the ImportXML-function, which wants me to give it an Xpath-query. The problem is BoxOfficeMojo, which is structured in a very odd manner, which makes it impossible for me to figure out how to query.
Everything is working all fine and dandy in Google Sheets, but I'm simply unable to generate the correct Xpath-query, that gives me just the title of the movie, when providing the ImportXML-cell with a link.
This is what I have to work with:
A link to a given movie (eg. https://www.boxofficemojo.com/movies/?page=main&id=ateam.htm)
An ImportXML-statement in Google Sheets
(https://support.google.com/docs/answer/3093342?hl=en)- An Xpath-query (and this is where it gets tricky)
I've tried a number of different queries, including the one Chrome will generate for me when I inspect the site, but BoxOfficeMojo is structured in a weird way, which means I can't use any logic to create my query. I've been at it for a few hours now, and the closest I've come to getting the right result is this query:
//*[//table[@border = '0']]/td/font/b
Which gives me this result:
The A-Team
4
I cannot for the life of me figure out, how I filter out the title of the movie ("The A-Team" in this case).
For good measure, this is what Chrome suggests as the Xpath, when I inspect the site:
//*[@id="body"]/table[2]/tbody/tr/td/table[1]/tbody/tr/td[2]/font/b
The above query doesn't work, and just throws an error, as if the query returns nothing...
Possible Solution
I've seen some people who have successfully made an Xpath-query, where they search for a certain phrase, and return the content after this phrase (look at this: python: xpath returns empty list from boxofficemojo.com).
However, I cannot seem to figure out how I would search for the title of the movie, since it's not surrounded by any text.
The whole point of scraping something as simple as this title is, that the titles on BoxOfficeMojo changes throughout the year, and I need the exact title in order for me to scrape and compare data in Google Sheets.
Hopefully I've provided an Xpath-wizard in the crowd with enough information to be able to come up with a suggestion for me.
Thanks in advance!
xml xpath web-scraping google-sheets
1
Can't help with the fineries for XPATH, butxpath_emitting_cmd ... args... | sed -n '2p'
will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!
– shellter
Nov 12 at 15:06
Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
– TheRecruit
Nov 12 at 15:11
Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
– shellter
Nov 12 at 15:14
2
Try(//table//font/b)[1]
or//font[@size="6"]/b
or//font[@face="Verdana"]/b
– Andersson
Nov 12 at 15:23
1
I've checked one page only. If The font size can be, for instance5
or6
, you can try//font[@size > 5]/b
. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as//font[@size > 5 and @face="Verdana"]/b
...
– Andersson
Nov 12 at 15:57
|
show 2 more comments
So I'm trying to import a movie title in Google Sheets using the ImportXML-function, which wants me to give it an Xpath-query. The problem is BoxOfficeMojo, which is structured in a very odd manner, which makes it impossible for me to figure out how to query.
Everything is working all fine and dandy in Google Sheets, but I'm simply unable to generate the correct Xpath-query, that gives me just the title of the movie, when providing the ImportXML-cell with a link.
This is what I have to work with:
A link to a given movie (eg. https://www.boxofficemojo.com/movies/?page=main&id=ateam.htm)
An ImportXML-statement in Google Sheets
(https://support.google.com/docs/answer/3093342?hl=en)- An Xpath-query (and this is where it gets tricky)
I've tried a number of different queries, including the one Chrome will generate for me when I inspect the site, but BoxOfficeMojo is structured in a weird way, which means I can't use any logic to create my query. I've been at it for a few hours now, and the closest I've come to getting the right result is this query:
//*[//table[@border = '0']]/td/font/b
Which gives me this result:
The A-Team
4
I cannot for the life of me figure out, how I filter out the title of the movie ("The A-Team" in this case).
For good measure, this is what Chrome suggests as the Xpath, when I inspect the site:
//*[@id="body"]/table[2]/tbody/tr/td/table[1]/tbody/tr/td[2]/font/b
The above query doesn't work, and just throws an error, as if the query returns nothing...
Possible Solution
I've seen some people who have successfully made an Xpath-query, where they search for a certain phrase, and return the content after this phrase (look at this: python: xpath returns empty list from boxofficemojo.com).
However, I cannot seem to figure out how I would search for the title of the movie, since it's not surrounded by any text.
The whole point of scraping something as simple as this title is, that the titles on BoxOfficeMojo changes throughout the year, and I need the exact title in order for me to scrape and compare data in Google Sheets.
Hopefully I've provided an Xpath-wizard in the crowd with enough information to be able to come up with a suggestion for me.
Thanks in advance!
xml xpath web-scraping google-sheets
So I'm trying to import a movie title in Google Sheets using the ImportXML-function, which wants me to give it an Xpath-query. The problem is BoxOfficeMojo, which is structured in a very odd manner, which makes it impossible for me to figure out how to query.
Everything is working all fine and dandy in Google Sheets, but I'm simply unable to generate the correct Xpath-query, that gives me just the title of the movie, when providing the ImportXML-cell with a link.
This is what I have to work with:
A link to a given movie (eg. https://www.boxofficemojo.com/movies/?page=main&id=ateam.htm)
An ImportXML-statement in Google Sheets
(https://support.google.com/docs/answer/3093342?hl=en)- An Xpath-query (and this is where it gets tricky)
I've tried a number of different queries, including the one Chrome will generate for me when I inspect the site, but BoxOfficeMojo is structured in a weird way, which means I can't use any logic to create my query. I've been at it for a few hours now, and the closest I've come to getting the right result is this query:
//*[//table[@border = '0']]/td/font/b
Which gives me this result:
The A-Team
4
I cannot for the life of me figure out, how I filter out the title of the movie ("The A-Team" in this case).
For good measure, this is what Chrome suggests as the Xpath, when I inspect the site:
//*[@id="body"]/table[2]/tbody/tr/td/table[1]/tbody/tr/td[2]/font/b
The above query doesn't work, and just throws an error, as if the query returns nothing...
Possible Solution
I've seen some people who have successfully made an Xpath-query, where they search for a certain phrase, and return the content after this phrase (look at this: python: xpath returns empty list from boxofficemojo.com).
However, I cannot seem to figure out how I would search for the title of the movie, since it's not surrounded by any text.
The whole point of scraping something as simple as this title is, that the titles on BoxOfficeMojo changes throughout the year, and I need the exact title in order for me to scrape and compare data in Google Sheets.
Hopefully I've provided an Xpath-wizard in the crowd with enough information to be able to come up with a suggestion for me.
Thanks in advance!
xml xpath web-scraping google-sheets
xml xpath web-scraping google-sheets
edited Nov 12 at 15:12
asked Nov 12 at 14:48
TheRecruit
578
578
1
Can't help with the fineries for XPATH, butxpath_emitting_cmd ... args... | sed -n '2p'
will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!
– shellter
Nov 12 at 15:06
Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
– TheRecruit
Nov 12 at 15:11
Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
– shellter
Nov 12 at 15:14
2
Try(//table//font/b)[1]
or//font[@size="6"]/b
or//font[@face="Verdana"]/b
– Andersson
Nov 12 at 15:23
1
I've checked one page only. If The font size can be, for instance5
or6
, you can try//font[@size > 5]/b
. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as//font[@size > 5 and @face="Verdana"]/b
...
– Andersson
Nov 12 at 15:57
|
show 2 more comments
1
Can't help with the fineries for XPATH, butxpath_emitting_cmd ... args... | sed -n '2p'
will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!
– shellter
Nov 12 at 15:06
Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
– TheRecruit
Nov 12 at 15:11
Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
– shellter
Nov 12 at 15:14
2
Try(//table//font/b)[1]
or//font[@size="6"]/b
or//font[@face="Verdana"]/b
– Andersson
Nov 12 at 15:23
1
I've checked one page only. If The font size can be, for instance5
or6
, you can try//font[@size > 5]/b
. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as//font[@size > 5 and @face="Verdana"]/b
...
– Andersson
Nov 12 at 15:57
1
1
Can't help with the fineries for XPATH, but
xpath_emitting_cmd ... args... | sed -n '2p'
will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!– shellter
Nov 12 at 15:06
Can't help with the fineries for XPATH, but
xpath_emitting_cmd ... args... | sed -n '2p'
will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!– shellter
Nov 12 at 15:06
Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
– TheRecruit
Nov 12 at 15:11
Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
– TheRecruit
Nov 12 at 15:11
Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
– shellter
Nov 12 at 15:14
Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
– shellter
Nov 12 at 15:14
2
2
Try
(//table//font/b)[1]
or //font[@size="6"]/b
or //font[@face="Verdana"]/b
– Andersson
Nov 12 at 15:23
Try
(//table//font/b)[1]
or //font[@size="6"]/b
or //font[@face="Verdana"]/b
– Andersson
Nov 12 at 15:23
1
1
I've checked one page only. If The font size can be, for instance
5
or 6
, you can try //font[@size > 5]/b
. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as //font[@size > 5 and @face="Verdana"]/b
...– Andersson
Nov 12 at 15:57
I've checked one page only. If The font size can be, for instance
5
or 6
, you can try //font[@size > 5]/b
. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as //font[@size > 5 and @face="Verdana"]/b
...– Andersson
Nov 12 at 15:57
|
show 2 more comments
2 Answers
2
active
oldest
votes
I tried this xpaths for a few movies and worked fine
//font[@face="Verdana" and @size="6"]/b
Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
– TheRecruit
Nov 12 at 16:08
Oh yeah because the font size is 5... I noticed that the movie title is always in the secondtd
so this one should work//table//td[position()=2]/font[@face="Verdana"]/b
– Sagun Shrestha
Nov 13 at 6:05
add a comment |
As Andersson kindly suggested, using this query:
//font[@size="6"]/b
Did the trick :)
EDIT:
It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.
This solution however, seems to be working for all movies, and will return a single cell with the titles only:
=JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53264614%2fhow-to-extract-a-movie-title-from-boxofficemojo-with-xpath%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I tried this xpaths for a few movies and worked fine
//font[@face="Verdana" and @size="6"]/b
Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
– TheRecruit
Nov 12 at 16:08
Oh yeah because the font size is 5... I noticed that the movie title is always in the secondtd
so this one should work//table//td[position()=2]/font[@face="Verdana"]/b
– Sagun Shrestha
Nov 13 at 6:05
add a comment |
I tried this xpaths for a few movies and worked fine
//font[@face="Verdana" and @size="6"]/b
Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
– TheRecruit
Nov 12 at 16:08
Oh yeah because the font size is 5... I noticed that the movie title is always in the secondtd
so this one should work//table//td[position()=2]/font[@face="Verdana"]/b
– Sagun Shrestha
Nov 13 at 6:05
add a comment |
I tried this xpaths for a few movies and worked fine
//font[@face="Verdana" and @size="6"]/b
I tried this xpaths for a few movies and worked fine
//font[@face="Verdana" and @size="6"]/b
answered Nov 12 at 15:38
Sagun Shrestha
688413
688413
Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
– TheRecruit
Nov 12 at 16:08
Oh yeah because the font size is 5... I noticed that the movie title is always in the secondtd
so this one should work//table//td[position()=2]/font[@face="Verdana"]/b
– Sagun Shrestha
Nov 13 at 6:05
add a comment |
Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
– TheRecruit
Nov 12 at 16:08
Oh yeah because the font size is 5... I noticed that the movie title is always in the secondtd
so this one should work//table//td[position()=2]/font[@face="Verdana"]/b
– Sagun Shrestha
Nov 13 at 6:05
Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
– TheRecruit
Nov 12 at 16:08
Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
– TheRecruit
Nov 12 at 16:08
Oh yeah because the font size is 5... I noticed that the movie title is always in the second
td
so this one should work //table//td[position()=2]/font[@face="Verdana"]/b
– Sagun Shrestha
Nov 13 at 6:05
Oh yeah because the font size is 5... I noticed that the movie title is always in the second
td
so this one should work //table//td[position()=2]/font[@face="Verdana"]/b
– Sagun Shrestha
Nov 13 at 6:05
add a comment |
As Andersson kindly suggested, using this query:
//font[@size="6"]/b
Did the trick :)
EDIT:
It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.
This solution however, seems to be working for all movies, and will return a single cell with the titles only:
=JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))
add a comment |
As Andersson kindly suggested, using this query:
//font[@size="6"]/b
Did the trick :)
EDIT:
It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.
This solution however, seems to be working for all movies, and will return a single cell with the titles only:
=JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))
add a comment |
As Andersson kindly suggested, using this query:
//font[@size="6"]/b
Did the trick :)
EDIT:
It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.
This solution however, seems to be working for all movies, and will return a single cell with the titles only:
=JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))
As Andersson kindly suggested, using this query:
//font[@size="6"]/b
Did the trick :)
EDIT:
It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.
This solution however, seems to be working for all movies, and will return a single cell with the titles only:
=JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))
edited Nov 12 at 16:25
answered Nov 12 at 15:37
TheRecruit
578
578
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53264614%2fhow-to-extract-a-movie-title-from-boxofficemojo-with-xpath%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Can't help with the fineries for XPATH, but
xpath_emitting_cmd ... args... | sed -n '2p'
will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!– shellter
Nov 12 at 15:06
Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
– TheRecruit
Nov 12 at 15:11
Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
– shellter
Nov 12 at 15:14
2
Try
(//table//font/b)[1]
or//font[@size="6"]/b
or//font[@face="Verdana"]/b
– Andersson
Nov 12 at 15:23
1
I've checked one page only. If The font size can be, for instance
5
or6
, you can try//font[@size > 5]/b
. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as//font[@size > 5 and @face="Verdana"]/b
...– Andersson
Nov 12 at 15:57