How to extract a movie-title from BoxOfficeMojo with Xpath












2














So I'm trying to import a movie title in Google Sheets using the ImportXML-function, which wants me to give it an Xpath-query. The problem is BoxOfficeMojo, which is structured in a very odd manner, which makes it impossible for me to figure out how to query.



Everything is working all fine and dandy in Google Sheets, but I'm simply unable to generate the correct Xpath-query, that gives me just the title of the movie, when providing the ImportXML-cell with a link.



This is what I have to work with:




  • A link to a given movie (eg. https://www.boxofficemojo.com/movies/?page=main&id=ateam.htm)


  • An ImportXML-statement in Google Sheets
    (https://support.google.com/docs/answer/3093342?hl=en)


  • An Xpath-query (and this is where it gets tricky)


I've tried a number of different queries, including the one Chrome will generate for me when I inspect the site, but BoxOfficeMojo is structured in a weird way, which means I can't use any logic to create my query. I've been at it for a few hours now, and the closest I've come to getting the right result is this query:



//*[//table[@border = '0']]/td/font/b


Which gives me this result:



The A-Team

4


I cannot for the life of me figure out, how I filter out the title of the movie ("The A-Team" in this case).



For good measure, this is what Chrome suggests as the Xpath, when I inspect the site:



//*[@id="body"]/table[2]/tbody/tr/td/table[1]/tbody/tr/td[2]/font/b


The above query doesn't work, and just throws an error, as if the query returns nothing...



Possible Solution
I've seen some people who have successfully made an Xpath-query, where they search for a certain phrase, and return the content after this phrase (look at this: python: xpath returns empty list from boxofficemojo.com).



However, I cannot seem to figure out how I would search for the title of the movie, since it's not surrounded by any text.



The whole point of scraping something as simple as this title is, that the titles on BoxOfficeMojo changes throughout the year, and I need the exact title in order for me to scrape and compare data in Google Sheets.



Hopefully I've provided an Xpath-wizard in the crowd with enough information to be able to come up with a suggestion for me.



Thanks in advance!










share|improve this question




















  • 1




    Can't help with the fineries for XPATH, but xpath_emitting_cmd ... args... | sed -n '2p' will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!
    – shellter
    Nov 12 at 15:06












  • Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
    – TheRecruit
    Nov 12 at 15:11










  • Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
    – shellter
    Nov 12 at 15:14






  • 2




    Try (//table//font/b)[1] or //font[@size="6"]/b or //font[@face="Verdana"]/b
    – Andersson
    Nov 12 at 15:23








  • 1




    I've checked one page only. If The font size can be, for instance 5 or 6, you can try //font[@size > 5]/b. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as //font[@size > 5 and @face="Verdana"]/b...
    – Andersson
    Nov 12 at 15:57


















2














So I'm trying to import a movie title in Google Sheets using the ImportXML-function, which wants me to give it an Xpath-query. The problem is BoxOfficeMojo, which is structured in a very odd manner, which makes it impossible for me to figure out how to query.



Everything is working all fine and dandy in Google Sheets, but I'm simply unable to generate the correct Xpath-query, that gives me just the title of the movie, when providing the ImportXML-cell with a link.



This is what I have to work with:




  • A link to a given movie (eg. https://www.boxofficemojo.com/movies/?page=main&id=ateam.htm)


  • An ImportXML-statement in Google Sheets
    (https://support.google.com/docs/answer/3093342?hl=en)


  • An Xpath-query (and this is where it gets tricky)


I've tried a number of different queries, including the one Chrome will generate for me when I inspect the site, but BoxOfficeMojo is structured in a weird way, which means I can't use any logic to create my query. I've been at it for a few hours now, and the closest I've come to getting the right result is this query:



//*[//table[@border = '0']]/td/font/b


Which gives me this result:



The A-Team

4


I cannot for the life of me figure out, how I filter out the title of the movie ("The A-Team" in this case).



For good measure, this is what Chrome suggests as the Xpath, when I inspect the site:



//*[@id="body"]/table[2]/tbody/tr/td/table[1]/tbody/tr/td[2]/font/b


The above query doesn't work, and just throws an error, as if the query returns nothing...



Possible Solution
I've seen some people who have successfully made an Xpath-query, where they search for a certain phrase, and return the content after this phrase (look at this: python: xpath returns empty list from boxofficemojo.com).



However, I cannot seem to figure out how I would search for the title of the movie, since it's not surrounded by any text.



The whole point of scraping something as simple as this title is, that the titles on BoxOfficeMojo changes throughout the year, and I need the exact title in order for me to scrape and compare data in Google Sheets.



Hopefully I've provided an Xpath-wizard in the crowd with enough information to be able to come up with a suggestion for me.



Thanks in advance!










share|improve this question




















  • 1




    Can't help with the fineries for XPATH, but xpath_emitting_cmd ... args... | sed -n '2p' will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!
    – shellter
    Nov 12 at 15:06












  • Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
    – TheRecruit
    Nov 12 at 15:11










  • Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
    – shellter
    Nov 12 at 15:14






  • 2




    Try (//table//font/b)[1] or //font[@size="6"]/b or //font[@face="Verdana"]/b
    – Andersson
    Nov 12 at 15:23








  • 1




    I've checked one page only. If The font size can be, for instance 5 or 6, you can try //font[@size > 5]/b. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as //font[@size > 5 and @face="Verdana"]/b...
    – Andersson
    Nov 12 at 15:57
















2












2








2







So I'm trying to import a movie title in Google Sheets using the ImportXML-function, which wants me to give it an Xpath-query. The problem is BoxOfficeMojo, which is structured in a very odd manner, which makes it impossible for me to figure out how to query.



Everything is working all fine and dandy in Google Sheets, but I'm simply unable to generate the correct Xpath-query, that gives me just the title of the movie, when providing the ImportXML-cell with a link.



This is what I have to work with:




  • A link to a given movie (eg. https://www.boxofficemojo.com/movies/?page=main&id=ateam.htm)


  • An ImportXML-statement in Google Sheets
    (https://support.google.com/docs/answer/3093342?hl=en)


  • An Xpath-query (and this is where it gets tricky)


I've tried a number of different queries, including the one Chrome will generate for me when I inspect the site, but BoxOfficeMojo is structured in a weird way, which means I can't use any logic to create my query. I've been at it for a few hours now, and the closest I've come to getting the right result is this query:



//*[//table[@border = '0']]/td/font/b


Which gives me this result:



The A-Team

4


I cannot for the life of me figure out, how I filter out the title of the movie ("The A-Team" in this case).



For good measure, this is what Chrome suggests as the Xpath, when I inspect the site:



//*[@id="body"]/table[2]/tbody/tr/td/table[1]/tbody/tr/td[2]/font/b


The above query doesn't work, and just throws an error, as if the query returns nothing...



Possible Solution
I've seen some people who have successfully made an Xpath-query, where they search for a certain phrase, and return the content after this phrase (look at this: python: xpath returns empty list from boxofficemojo.com).



However, I cannot seem to figure out how I would search for the title of the movie, since it's not surrounded by any text.



The whole point of scraping something as simple as this title is, that the titles on BoxOfficeMojo changes throughout the year, and I need the exact title in order for me to scrape and compare data in Google Sheets.



Hopefully I've provided an Xpath-wizard in the crowd with enough information to be able to come up with a suggestion for me.



Thanks in advance!










share|improve this question















So I'm trying to import a movie title in Google Sheets using the ImportXML-function, which wants me to give it an Xpath-query. The problem is BoxOfficeMojo, which is structured in a very odd manner, which makes it impossible for me to figure out how to query.



Everything is working all fine and dandy in Google Sheets, but I'm simply unable to generate the correct Xpath-query, that gives me just the title of the movie, when providing the ImportXML-cell with a link.



This is what I have to work with:




  • A link to a given movie (eg. https://www.boxofficemojo.com/movies/?page=main&id=ateam.htm)


  • An ImportXML-statement in Google Sheets
    (https://support.google.com/docs/answer/3093342?hl=en)


  • An Xpath-query (and this is where it gets tricky)


I've tried a number of different queries, including the one Chrome will generate for me when I inspect the site, but BoxOfficeMojo is structured in a weird way, which means I can't use any logic to create my query. I've been at it for a few hours now, and the closest I've come to getting the right result is this query:



//*[//table[@border = '0']]/td/font/b


Which gives me this result:



The A-Team

4


I cannot for the life of me figure out, how I filter out the title of the movie ("The A-Team" in this case).



For good measure, this is what Chrome suggests as the Xpath, when I inspect the site:



//*[@id="body"]/table[2]/tbody/tr/td/table[1]/tbody/tr/td[2]/font/b


The above query doesn't work, and just throws an error, as if the query returns nothing...



Possible Solution
I've seen some people who have successfully made an Xpath-query, where they search for a certain phrase, and return the content after this phrase (look at this: python: xpath returns empty list from boxofficemojo.com).



However, I cannot seem to figure out how I would search for the title of the movie, since it's not surrounded by any text.



The whole point of scraping something as simple as this title is, that the titles on BoxOfficeMojo changes throughout the year, and I need the exact title in order for me to scrape and compare data in Google Sheets.



Hopefully I've provided an Xpath-wizard in the crowd with enough information to be able to come up with a suggestion for me.



Thanks in advance!







xml xpath web-scraping google-sheets






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 12 at 15:12

























asked Nov 12 at 14:48









TheRecruit

578




578








  • 1




    Can't help with the fineries for XPATH, but xpath_emitting_cmd ... args... | sed -n '2p' will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!
    – shellter
    Nov 12 at 15:06












  • Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
    – TheRecruit
    Nov 12 at 15:11










  • Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
    – shellter
    Nov 12 at 15:14






  • 2




    Try (//table//font/b)[1] or //font[@size="6"]/b or //font[@face="Verdana"]/b
    – Andersson
    Nov 12 at 15:23








  • 1




    I've checked one page only. If The font size can be, for instance 5 or 6, you can try //font[@size > 5]/b. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as //font[@size > 5 and @face="Verdana"]/b...
    – Andersson
    Nov 12 at 15:57
















  • 1




    Can't help with the fineries for XPATH, but xpath_emitting_cmd ... args... | sed -n '2p' will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!
    – shellter
    Nov 12 at 15:06












  • Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
    – TheRecruit
    Nov 12 at 15:11










  • Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
    – shellter
    Nov 12 at 15:14






  • 2




    Try (//table//font/b)[1] or //font[@size="6"]/b or //font[@face="Verdana"]/b
    – Andersson
    Nov 12 at 15:23








  • 1




    I've checked one page only. If The font size can be, for instance 5 or 6, you can try //font[@size > 5]/b. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as //font[@size > 5 and @face="Verdana"]/b...
    – Andersson
    Nov 12 at 15:57










1




1




Can't help with the fineries for XPATH, but xpath_emitting_cmd ... args... | sed -n '2p' will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!
– shellter
Nov 12 at 15:06






Can't help with the fineries for XPATH, but xpath_emitting_cmd ... args... | sed -n '2p' will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck!
– shellter
Nov 12 at 15:06














Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
– TheRecruit
Nov 12 at 15:11




Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way.
– TheRecruit
Nov 12 at 15:11












Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
– shellter
Nov 12 at 15:14




Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck.
– shellter
Nov 12 at 15:14




2




2




Try (//table//font/b)[1] or //font[@size="6"]/b or //font[@face="Verdana"]/b
– Andersson
Nov 12 at 15:23






Try (//table//font/b)[1] or //font[@size="6"]/b or //font[@face="Verdana"]/b
– Andersson
Nov 12 at 15:23






1




1




I've checked one page only. If The font size can be, for instance 5 or 6, you can try //font[@size > 5]/b. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as //font[@size > 5 and @face="Verdana"]/b...
– Andersson
Nov 12 at 15:57






I've checked one page only. If The font size can be, for instance 5 or 6, you can try //font[@size > 5]/b. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as //font[@size > 5 and @face="Verdana"]/b...
– Andersson
Nov 12 at 15:57














2 Answers
2






active

oldest

votes


















2














I tried this xpaths for a few movies and worked fine



//font[@face="Verdana" and @size="6"]/b





share|improve this answer





















  • Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
    – TheRecruit
    Nov 12 at 16:08












  • Oh yeah because the font size is 5... I noticed that the movie title is always in the second td so this one should work //table//td[position()=2]/font[@face="Verdana"]/b
    – Sagun Shrestha
    Nov 13 at 6:05



















2














As Andersson kindly suggested, using this query:



//font[@size="6"]/b


Did the trick :)



EDIT:



It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.



This solution however, seems to be working for all movies, and will return a single cell with the titles only:



=JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53264614%2fhow-to-extract-a-movie-title-from-boxofficemojo-with-xpath%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    I tried this xpaths for a few movies and worked fine



    //font[@face="Verdana" and @size="6"]/b





    share|improve this answer





















    • Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
      – TheRecruit
      Nov 12 at 16:08












    • Oh yeah because the font size is 5... I noticed that the movie title is always in the second td so this one should work //table//td[position()=2]/font[@face="Verdana"]/b
      – Sagun Shrestha
      Nov 13 at 6:05
















    2














    I tried this xpaths for a few movies and worked fine



    //font[@face="Verdana" and @size="6"]/b





    share|improve this answer





















    • Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
      – TheRecruit
      Nov 12 at 16:08












    • Oh yeah because the font size is 5... I noticed that the movie title is always in the second td so this one should work //table//td[position()=2]/font[@face="Verdana"]/b
      – Sagun Shrestha
      Nov 13 at 6:05














    2












    2








    2






    I tried this xpaths for a few movies and worked fine



    //font[@face="Verdana" and @size="6"]/b





    share|improve this answer












    I tried this xpaths for a few movies and worked fine



    //font[@face="Verdana" and @size="6"]/b






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 12 at 15:38









    Sagun Shrestha

    688413




    688413












    • Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
      – TheRecruit
      Nov 12 at 16:08












    • Oh yeah because the font size is 5... I noticed that the movie title is always in the second td so this one should work //table//td[position()=2]/font[@face="Verdana"]/b
      – Sagun Shrestha
      Nov 13 at 6:05


















    • Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
      – TheRecruit
      Nov 12 at 16:08












    • Oh yeah because the font size is 5... I noticed that the movie title is always in the second td so this one should work //table//td[position()=2]/font[@face="Verdana"]/b
      – Sagun Shrestha
      Nov 13 at 6:05
















    Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
    – TheRecruit
    Nov 12 at 16:08






    Yes, it works for alot of movies. But try this one for instance, and you'll get an error (OR you will get the import spanning multiple cells): boxofficemojo.com/movies/?id=animatedspider-man.htm
    – TheRecruit
    Nov 12 at 16:08














    Oh yeah because the font size is 5... I noticed that the movie title is always in the second td so this one should work //table//td[position()=2]/font[@face="Verdana"]/b
    – Sagun Shrestha
    Nov 13 at 6:05




    Oh yeah because the font size is 5... I noticed that the movie title is always in the second td so this one should work //table//td[position()=2]/font[@face="Verdana"]/b
    – Sagun Shrestha
    Nov 13 at 6:05













    2














    As Andersson kindly suggested, using this query:



    //font[@size="6"]/b


    Did the trick :)



    EDIT:



    It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.



    This solution however, seems to be working for all movies, and will return a single cell with the titles only:



    =JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))





    share|improve this answer




























      2














      As Andersson kindly suggested, using this query:



      //font[@size="6"]/b


      Did the trick :)



      EDIT:



      It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.



      This solution however, seems to be working for all movies, and will return a single cell with the titles only:



      =JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))





      share|improve this answer


























        2












        2








        2






        As Andersson kindly suggested, using this query:



        //font[@size="6"]/b


        Did the trick :)



        EDIT:



        It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.



        This solution however, seems to be working for all movies, and will return a single cell with the titles only:



        =JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))





        share|improve this answer














        As Andersson kindly suggested, using this query:



        //font[@size="6"]/b


        Did the trick :)



        EDIT:



        It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.



        This solution however, seems to be working for all movies, and will return a single cell with the titles only:



        =JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 12 at 16:25

























        answered Nov 12 at 15:37









        TheRecruit

        578




        578






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53264614%2fhow-to-extract-a-movie-title-from-boxofficemojo-with-xpath%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Bressuire

            Vorschmack

            Quarantine