Can't create an xpath capable of meeting certain condition

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I've created a script which is able to extract the links ending with .html extention available under class tableFile from a webpage. The script can do it's job. However, my intention at this point is to get only those .html links which have EX- in its type field. I'm looking for any pure xpath solution (by not using .getparent() or something).

Link to that site

Script I've tried with so far:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)



for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item:

        print(item)

When I try to get the links meeting above condition with the below approach, I get an error:

for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

        print(item)

Error I get:

if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'

This is how the files look like:

enter image description here

asked Nov 17 '18 at 7:11

robots.txt

313118

add a comment |

Link to that site

Script I've tried with so far:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)



for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item:

        print(item)

When I try to get the links meeting above condition with the below approach, I get an error:

for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

        print(item)

Error I get:

if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'

This is how the files look like:

enter image description here

asked Nov 17 '18 at 7:11

robots.txt

313118

add a comment |

Link to that site

Script I've tried with so far:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)



for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item:

        print(item)

When I try to get the links meeting above condition with the below approach, I get an error:

for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

        print(item)

Error I get:

if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'

This is how the files look like:

enter image description here

asked Nov 17 '18 at 7:11

robots.txt

313118

Link to that site

Script I've tried with so far:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)



for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item:

        print(item)

When I try to get the links meeting above condition with the below approach, I get an error:

for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

        print(item)

Error I get:

if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'

This is how the files look like:

enter image description here

python python-3.x xpath web-scraping

asked Nov 17 '18 at 7:11

robots.txt

313118

asked Nov 17 '18 at 7:11

robots.txt

313118

asked Nov 17 '18 at 7:11

robots.txt

313118

asked Nov 17 '18 at 7:11

robots.txt

313118

asked Nov 17 '18 at 7:11

robots.txt

313118

add a comment |

3 Answers
3

active

oldest

votes

If you need pure XPath solution, you can use below:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)

for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):

    print(item)





/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm

answered Nov 17 '18 at 7:39

Andersson

39.2k113769

Apology for the delayed response @sir Andersson. Thanks for your effective solution.

– robots.txt
Nov 17 '18 at 10:41

Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.

– robots.txt
Nov 17 '18 at 13:17

1

@robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath

– Andersson
Nov 17 '18 at 13:36

add a comment |

It looks like you want:

//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]

There's a lot of different ways to do this with xpath. Css is probalby much simpler.

answered Nov 17 '18 at 10:25

pguardiario

37.1k1080118

add a comment |

Here is a way using dataframes and pandas

import pandas as pd

tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"

results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]

print(results)

edited Nov 17 '18 at 11:38

answered Nov 17 '18 at 7:47

QHarr

39.2k82345

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53349062%2fcant-create-an-xpath-capable-of-meeting-certain-condition%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

If you need pure XPath solution, you can use below:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)

for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):

    print(item)





/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm

answered Nov 17 '18 at 7:39

Andersson

39.2k113769

Apology for the delayed response @sir Andersson. Thanks for your effective solution.

– robots.txt
Nov 17 '18 at 10:41

Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.

– robots.txt
Nov 17 '18 at 13:17

1

@robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath

– Andersson
Nov 17 '18 at 13:36

add a comment |

If you need pure XPath solution, you can use below:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)

for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):

    print(item)





/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm

answered Nov 17 '18 at 7:39

Andersson

39.2k113769

Apology for the delayed response @sir Andersson. Thanks for your effective solution.

– robots.txt
Nov 17 '18 at 10:41

Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.

– robots.txt
Nov 17 '18 at 13:17

1

@robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath

– Andersson
Nov 17 '18 at 13:36

add a comment |

If you need pure XPath solution, you can use below:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)

for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):

    print(item)





/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm

answered Nov 17 '18 at 7:39

Andersson

39.2k113769

If you need pure XPath solution, you can use below:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)

for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):

    print(item)





/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm

answered Nov 17 '18 at 7:39

Andersson

39.2k113769

answered Nov 17 '18 at 7:39

Andersson

39.2k113769

answered Nov 17 '18 at 7:39

Andersson

39.2k113769

answered Nov 17 '18 at 7:39

Andersson

39.2k113769

Apology for the delayed response @sir Andersson. Thanks for your effective solution.

– robots.txt
Nov 17 '18 at 10:41

Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.

– robots.txt
Nov 17 '18 at 13:17

1

@robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath

– Andersson
Nov 17 '18 at 13:36

add a comment |

Apology for the delayed response @sir Andersson. Thanks for your effective solution.

– robots.txt
Nov 17 '18 at 10:41

Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.

– robots.txt
Nov 17 '18 at 13:17

1

@robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath

– Andersson
Nov 17 '18 at 13:36

Apology for the delayed response @sir Andersson. Thanks for your effective solution.

– robots.txt
Nov 17 '18 at 10:41

Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.

– robots.txt
Nov 17 '18 at 13:17

@robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath

– Andersson
Nov 17 '18 at 13:36

add a comment |

It looks like you want:

//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]

There's a lot of different ways to do this with xpath. Css is probalby much simpler.

answered Nov 17 '18 at 10:25

pguardiario

37.1k1080118

add a comment |

It looks like you want:

//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]

There's a lot of different ways to do this with xpath. Css is probalby much simpler.

answered Nov 17 '18 at 10:25

pguardiario

37.1k1080118

add a comment |

It looks like you want:

//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]

There's a lot of different ways to do this with xpath. Css is probalby much simpler.

answered Nov 17 '18 at 10:25

pguardiario

37.1k1080118

It looks like you want:

//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]

There's a lot of different ways to do this with xpath. Css is probalby much simpler.

answered Nov 17 '18 at 10:25

pguardiario

37.1k1080118

answered Nov 17 '18 at 10:25

pguardiario

37.1k1080118

answered Nov 17 '18 at 10:25

pguardiario

37.1k1080118

answered Nov 17 '18 at 10:25

pguardiario

37.1k1080118

add a comment |

Here is a way using dataframes and pandas

import pandas as pd

tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"

results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]

print(results)

edited Nov 17 '18 at 11:38

answered Nov 17 '18 at 7:47

QHarr

39.2k82345

add a comment |

Here is a way using dataframes and pandas

import pandas as pd

tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"

results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]

print(results)

edited Nov 17 '18 at 11:38

answered Nov 17 '18 at 7:47

QHarr

39.2k82345

add a comment |

Here is a way using dataframes and pandas

import pandas as pd

tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"

results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]

print(results)

edited Nov 17 '18 at 11:38

answered Nov 17 '18 at 7:47

QHarr

39.2k82345

Here is a way using dataframes and pandas

import pandas as pd

tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"

results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]

print(results)

edited Nov 17 '18 at 11:38

answered Nov 17 '18 at 7:47

QHarr

39.2k82345

edited Nov 17 '18 at 11:38

answered Nov 17 '18 at 7:47

QHarr

39.2k82345

answered Nov 17 '18 at 7:47

QHarr

39.2k82345

answered Nov 17 '18 at 7:47

QHarr

39.2k82345

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

4X1885nbLK0q,4qiO,gwi1bf5,UBFCZidJsW6Sq8PpYgLKb2q2vxE4Hr,SBKNfxvSakUra65rshFtLoAKP36rptyNU04TpC

搜尋此網誌

Vfrdtyky