How to extract a movie-title from BoxOfficeMojo with Xpath

So I'm trying to import a movie title in Google Sheets using the ImportXML-function, which wants me to give it an Xpath-query. The problem is BoxOfficeMojo, which is structured in a very odd manner, which makes it impossible for me to figure out how to query.

Everything is working all fine and dandy in Google Sheets, but I'm simply unable to generate the correct Xpath-query, that gives me just the title of the movie, when providing the ImportXML-cell with a link.

This is what I have to work with:

A link to a given movie (eg. https://www.boxofficemojo.com/movies/?page=main&id=ateam.htm)

An ImportXML-statement in Google Sheets
(https://support.google.com/docs/answer/3093342?hl=en)

An Xpath-query (and this is where it gets tricky)

I've tried a number of different queries, including the one Chrome will generate for me when I inspect the site, but BoxOfficeMojo is structured in a weird way, which means I can't use any logic to create my query. I've been at it for a few hours now, and the closest I've come to getting the right result is this query:

//*[//table[@border = '0']]/td/font/b

Which gives me this result:

The A-Team



4

I cannot for the life of me figure out, how I filter out the title of the movie ("The A-Team" in this case).

For good measure, this is what Chrome suggests as the Xpath, when I inspect the site:

//*[@id="body"]/table[2]/tbody/tr/td/table[1]/tbody/tr/td[2]/font/b

The above query doesn't work, and just throws an error, as if the query returns nothing...

Possible Solution
I've seen some people who have successfully made an Xpath-query, where they search for a certain phrase, and return the content after this phrase (look at this: python: xpath returns empty list from boxofficemojo.com).

However, I cannot seem to figure out how I would search for the title of the movie, since it's not surrounded by any text.

The whole point of scraping something as simple as this title is, that the titles on BoxOfficeMojo changes throughout the year, and I need the exact title in order for me to scrape and compare data in Google Sheets.

Hopefully I've provided an Xpath-wizard in the crowd with enough information to be able to come up with a suggestion for me.

Thanks in advance!

edited Nov 12 at 15:12

asked Nov 12 at 14:48

TheRecruit

578

1

Can't help with the fineries for XPATH, but xpath_emitting_cmd ... args... | sed -n '2p' will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!
– shellter
Nov 12 at 15:06

Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
– TheRecruit
Nov 12 at 15:11

Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
– shellter
Nov 12 at 15:14

2

Try (//table//font/b)[1] or //font[@size="6"]/b or //font[@face="Verdana"]/b
– Andersson
Nov 12 at 15:23

1

I've checked one page only. If The font size can be, for instance 5 or 6, you can try //font[@size > 5]/b. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as //font[@size > 5 and @face="Verdana"]/b...
– Andersson
Nov 12 at 15:57

|
show 2 more comments

This is what I have to work with:

A link to a given movie (eg. https://www.boxofficemojo.com/movies/?page=main&id=ateam.htm)

An ImportXML-statement in Google Sheets
(https://support.google.com/docs/answer/3093342?hl=en)

An Xpath-query (and this is where it gets tricky)

//*[//table[@border = '0']]/td/font/b

Which gives me this result:

The A-Team



4

I cannot for the life of me figure out, how I filter out the title of the movie ("The A-Team" in this case).

For good measure, this is what Chrome suggests as the Xpath, when I inspect the site:

//*[@id="body"]/table[2]/tbody/tr/td/table[1]/tbody/tr/td[2]/font/b

The above query doesn't work, and just throws an error, as if the query returns nothing...

However, I cannot seem to figure out how I would search for the title of the movie, since it's not surrounded by any text.

Hopefully I've provided an Xpath-wizard in the crowd with enough information to be able to come up with a suggestion for me.

Thanks in advance!

edited Nov 12 at 15:12

asked Nov 12 at 14:48

TheRecruit

578

1

Can't help with the fineries for XPATH, but xpath_emitting_cmd ... args... | sed -n '2p' will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!
– shellter
Nov 12 at 15:06

Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
– TheRecruit
Nov 12 at 15:11

Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
– shellter
Nov 12 at 15:14

2

Try (//table//font/b)[1] or //font[@size="6"]/b or //font[@face="Verdana"]/b
– Andersson
Nov 12 at 15:23

1

I've checked one page only. If The font size can be, for instance 5 or 6, you can try //font[@size > 5]/b. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as //font[@size > 5 and @face="Verdana"]/b...
– Andersson
Nov 12 at 15:57

|
show 2 more comments

This is what I have to work with:

A link to a given movie (eg. https://www.boxofficemojo.com/movies/?page=main&id=ateam.htm)

An ImportXML-statement in Google Sheets
(https://support.google.com/docs/answer/3093342?hl=en)

An Xpath-query (and this is where it gets tricky)

//*[//table[@border = '0']]/td/font/b

Which gives me this result:

The A-Team



4

I cannot for the life of me figure out, how I filter out the title of the movie ("The A-Team" in this case).

For good measure, this is what Chrome suggests as the Xpath, when I inspect the site:

//*[@id="body"]/table[2]/tbody/tr/td/table[1]/tbody/tr/td[2]/font/b

The above query doesn't work, and just throws an error, as if the query returns nothing...

However, I cannot seem to figure out how I would search for the title of the movie, since it's not surrounded by any text.

Hopefully I've provided an Xpath-wizard in the crowd with enough information to be able to come up with a suggestion for me.

Thanks in advance!

edited Nov 12 at 15:12

asked Nov 12 at 14:48

TheRecruit

578

This is what I have to work with:

A link to a given movie (eg. https://www.boxofficemojo.com/movies/?page=main&id=ateam.htm)

An ImportXML-statement in Google Sheets
(https://support.google.com/docs/answer/3093342?hl=en)

An Xpath-query (and this is where it gets tricky)

//*[//table[@border = '0']]/td/font/b

Which gives me this result:

The A-Team



4

I cannot for the life of me figure out, how I filter out the title of the movie ("The A-Team" in this case).

For good measure, this is what Chrome suggests as the Xpath, when I inspect the site:

//*[@id="body"]/table[2]/tbody/tr/td/table[1]/tbody/tr/td[2]/font/b

The above query doesn't work, and just throws an error, as if the query returns nothing...

However, I cannot seem to figure out how I would search for the title of the movie, since it's not surrounded by any text.

Hopefully I've provided an Xpath-wizard in the crowd with enough information to be able to come up with a suggestion for me.

Thanks in advance!

xml xpath web-scraping google-sheets

edited Nov 12 at 15:12

asked Nov 12 at 14:48

TheRecruit

578

edited Nov 12 at 15:12

asked Nov 12 at 14:48

TheRecruit

578

edited Nov 12 at 15:12

asked Nov 12 at 14:48

TheRecruit

578

asked Nov 12 at 14:48

TheRecruit

578

asked Nov 12 at 14:48

TheRecruit

578

1

Can't help with the fineries for XPATH, but xpath_emitting_cmd ... args... | sed -n '2p' will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!
– shellter
Nov 12 at 15:06

Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
– TheRecruit
Nov 12 at 15:11

Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
– shellter
Nov 12 at 15:14

2

Try (//table//font/b)[1] or //font[@size="6"]/b or //font[@face="Verdana"]/b
– Andersson
Nov 12 at 15:23

1

I've checked one page only. If The font size can be, for instance 5 or 6, you can try //font[@size > 5]/b. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as //font[@size > 5 and @face="Verdana"]/b...
– Andersson
Nov 12 at 15:57

|
show 2 more comments

1

Can't help with the fineries for XPATH, but xpath_emitting_cmd ... args... | sed -n '2p' will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!
– shellter
Nov 12 at 15:06

Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
– TheRecruit
Nov 12 at 15:11

Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
– shellter
Nov 12 at 15:14

2

Try (//table//font/b)[1] or //font[@size="6"]/b or //font[@face="Verdana"]/b
– Andersson
Nov 12 at 15:23

1

I've checked one page only. If The font size can be, for instance 5 or 6, you can try //font[@size > 5]/b. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as //font[@size > 5 and @face="Verdana"]/b...
– Andersson
Nov 12 at 15:57

Can't help with the fineries for XPATH, but xpath_emitting_cmd ... args... | sed -n '2p' will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!
– shellter
Nov 12 at 15:06

Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
– TheRecruit
Nov 12 at 15:11

Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
– shellter
Nov 12 at 15:14

Try (//table//font/b)[1] or //font[@size="6"]/b or //font[@face="Verdana"]/b
– Andersson
Nov 12 at 15:23

I've checked one page only. If The font size can be, for instance 5 or 6, you can try //font[@size > 5]/b. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as //font[@size > 5 and @face="Verdana"]/b...
– Andersson
Nov 12 at 15:57

|
show 2 more comments

2 Answers
2

active

oldest

votes

I tried this xpaths for a few movies and worked fine

//font[@face="Verdana" and @size="6"]/b

answered Nov 12 at 15:38

Sagun Shrestha

688413

Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
– TheRecruit
Nov 12 at 16:08

Oh yeah because the font size is 5... I noticed that the movie title is always in the second td so this one should work //table//td[position()=2]/font[@face="Verdana"]/b
– Sagun Shrestha
Nov 13 at 6:05

add a comment |

As Andersson kindly suggested, using this query:

//font[@size="6"]/b

Did the trick :)

EDIT:

It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.

This solution however, seems to be working for all movies, and will return a single cell with the titles only:

=JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))

edited Nov 12 at 16:25

answered Nov 12 at 15:37

TheRecruit

578

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53264614%2fhow-to-extract-a-movie-title-from-boxofficemojo-with-xpath%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

I tried this xpaths for a few movies and worked fine

//font[@face="Verdana" and @size="6"]/b

answered Nov 12 at 15:38

Sagun Shrestha

688413

Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
– TheRecruit
Nov 12 at 16:08

Oh yeah because the font size is 5... I noticed that the movie title is always in the second td so this one should work //table//td[position()=2]/font[@face="Verdana"]/b
– Sagun Shrestha
Nov 13 at 6:05

add a comment |

I tried this xpaths for a few movies and worked fine

//font[@face="Verdana" and @size="6"]/b

answered Nov 12 at 15:38

Sagun Shrestha

688413

Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
– TheRecruit
Nov 12 at 16:08

Oh yeah because the font size is 5... I noticed that the movie title is always in the second td so this one should work //table//td[position()=2]/font[@face="Verdana"]/b
– Sagun Shrestha
Nov 13 at 6:05

add a comment |

I tried this xpaths for a few movies and worked fine

//font[@face="Verdana" and @size="6"]/b

answered Nov 12 at 15:38

Sagun Shrestha

688413

I tried this xpaths for a few movies and worked fine

//font[@face="Verdana" and @size="6"]/b

answered Nov 12 at 15:38

Sagun Shrestha

688413

answered Nov 12 at 15:38

Sagun Shrestha

688413

answered Nov 12 at 15:38

Sagun Shrestha

688413

answered Nov 12 at 15:38

Sagun Shrestha

688413

Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
– TheRecruit
Nov 12 at 16:08

Oh yeah because the font size is 5... I noticed that the movie title is always in the second td so this one should work //table//td[position()=2]/font[@face="Verdana"]/b
– Sagun Shrestha
Nov 13 at 6:05

add a comment |

Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
– TheRecruit
Nov 12 at 16:08

Oh yeah because the font size is 5... I noticed that the movie title is always in the second td so this one should work //table//td[position()=2]/font[@face="Verdana"]/b
– Sagun Shrestha
Nov 13 at 6:05

Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
– TheRecruit
Nov 12 at 16:08

Oh yeah because the font size is 5... I noticed that the movie title is always in the second td so this one should work //table//td[position()=2]/font[@face="Verdana"]/b
– Sagun Shrestha
Nov 13 at 6:05

add a comment |

As Andersson kindly suggested, using this query:

//font[@size="6"]/b

Did the trick :)

EDIT:

It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.

This solution however, seems to be working for all movies, and will return a single cell with the titles only:

=JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))

edited Nov 12 at 16:25

answered Nov 12 at 15:37

TheRecruit

578

add a comment |

As Andersson kindly suggested, using this query:

//font[@size="6"]/b

Did the trick :)

EDIT:

It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.

This solution however, seems to be working for all movies, and will return a single cell with the titles only:

=JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))

edited Nov 12 at 16:25

answered Nov 12 at 15:37

TheRecruit

578

add a comment |

As Andersson kindly suggested, using this query:

//font[@size="6"]/b

Did the trick :)

EDIT:

It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.

This solution however, seems to be working for all movies, and will return a single cell with the titles only:

=JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))

edited Nov 12 at 16:25

answered Nov 12 at 15:37

TheRecruit

578

As Andersson kindly suggested, using this query:

//font[@size="6"]/b

Did the trick :)

EDIT:

It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.

This solution however, seems to be working for all movies, and will return a single cell with the titles only:

=JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))

edited Nov 12 at 16:25

answered Nov 12 at 15:37

TheRecruit

578

edited Nov 12 at 16:25

answered Nov 12 at 15:37

TheRecruit

578

answered Nov 12 at 15:37

TheRecruit

578

answered Nov 12 at 15:37

TheRecruit

578

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky