Can't create an xpath capable of meeting certain condition
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I've created a script which is able to extract the links ending with .html
extention available under class tableFile
from a webpage. The script can do it's job. However, my intention at this point is to get only those .html
links which have EX-
in its type field. I'm looking for any pure xpath solution (by not using .getparent()
or something).
Link to that site
Script I've tried with so far:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item:
print(item)
When I try to get the links meeting above condition with the below approach, I get an error:
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
print(item)
Error I get:
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'
This is how the files look like:
python python-3.x xpath web-scraping
add a comment |
I've created a script which is able to extract the links ending with .html
extention available under class tableFile
from a webpage. The script can do it's job. However, my intention at this point is to get only those .html
links which have EX-
in its type field. I'm looking for any pure xpath solution (by not using .getparent()
or something).
Link to that site
Script I've tried with so far:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item:
print(item)
When I try to get the links meeting above condition with the below approach, I get an error:
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
print(item)
Error I get:
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'
This is how the files look like:
python python-3.x xpath web-scraping
add a comment |
I've created a script which is able to extract the links ending with .html
extention available under class tableFile
from a webpage. The script can do it's job. However, my intention at this point is to get only those .html
links which have EX-
in its type field. I'm looking for any pure xpath solution (by not using .getparent()
or something).
Link to that site
Script I've tried with so far:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item:
print(item)
When I try to get the links meeting above condition with the below approach, I get an error:
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
print(item)
Error I get:
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'
This is how the files look like:
python python-3.x xpath web-scraping
I've created a script which is able to extract the links ending with .html
extention available under class tableFile
from a webpage. The script can do it's job. However, my intention at this point is to get only those .html
links which have EX-
in its type field. I'm looking for any pure xpath solution (by not using .getparent()
or something).
Link to that site
Script I've tried with so far:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item:
print(item)
When I try to get the links meeting above condition with the below approach, I get an error:
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
print(item)
Error I get:
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'
This is how the files look like:
python python-3.x xpath web-scraping
python python-3.x xpath web-scraping
asked Nov 17 '18 at 7:11
robots.txtrobots.txt
313118
313118
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
If you need pure XPath solution, you can use below:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
print(item)
/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm
Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
Nov 17 '18 at 10:41
Is there any way to do the same using.cssselect()
@sir Andersson? I hope you will take a look in your spare time.
– robots.txt
Nov 17 '18 at 13:17
1
@robots.txt , you can try[link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')]
, but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
Nov 17 '18 at 13:36
add a comment |
It looks like you want:
//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]
There's a lot of different ways to do this with xpath. Css is probalby much simpler.
add a comment |
Here is a way using dataframes and pandas
import pandas as pd
tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
print(results)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53349062%2fcant-create-an-xpath-capable-of-meeting-certain-condition%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you need pure XPath solution, you can use below:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
print(item)
/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm
Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
Nov 17 '18 at 10:41
Is there any way to do the same using.cssselect()
@sir Andersson? I hope you will take a look in your spare time.
– robots.txt
Nov 17 '18 at 13:17
1
@robots.txt , you can try[link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')]
, but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
Nov 17 '18 at 13:36
add a comment |
If you need pure XPath solution, you can use below:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
print(item)
/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm
Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
Nov 17 '18 at 10:41
Is there any way to do the same using.cssselect()
@sir Andersson? I hope you will take a look in your spare time.
– robots.txt
Nov 17 '18 at 13:17
1
@robots.txt , you can try[link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')]
, but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
Nov 17 '18 at 13:36
add a comment |
If you need pure XPath solution, you can use below:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
print(item)
/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm
If you need pure XPath solution, you can use below:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
print(item)
/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm
answered Nov 17 '18 at 7:39
AnderssonAndersson
39.2k113769
39.2k113769
Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
Nov 17 '18 at 10:41
Is there any way to do the same using.cssselect()
@sir Andersson? I hope you will take a look in your spare time.
– robots.txt
Nov 17 '18 at 13:17
1
@robots.txt , you can try[link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')]
, but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
Nov 17 '18 at 13:36
add a comment |
Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
Nov 17 '18 at 10:41
Is there any way to do the same using.cssselect()
@sir Andersson? I hope you will take a look in your spare time.
– robots.txt
Nov 17 '18 at 13:17
1
@robots.txt , you can try[link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')]
, but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
Nov 17 '18 at 13:36
Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
Nov 17 '18 at 10:41
Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
Nov 17 '18 at 10:41
Is there any way to do the same using
.cssselect()
@sir Andersson? I hope you will take a look in your spare time.– robots.txt
Nov 17 '18 at 13:17
Is there any way to do the same using
.cssselect()
@sir Andersson? I hope you will take a look in your spare time.– robots.txt
Nov 17 '18 at 13:17
1
1
@robots.txt , you can try
[link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')]
, but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath– Andersson
Nov 17 '18 at 13:36
@robots.txt , you can try
[link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')]
, but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath– Andersson
Nov 17 '18 at 13:36
add a comment |
It looks like you want:
//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]
There's a lot of different ways to do this with xpath. Css is probalby much simpler.
add a comment |
It looks like you want:
//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]
There's a lot of different ways to do this with xpath. Css is probalby much simpler.
add a comment |
It looks like you want:
//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]
There's a lot of different ways to do this with xpath. Css is probalby much simpler.
It looks like you want:
//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]
There's a lot of different ways to do this with xpath. Css is probalby much simpler.
answered Nov 17 '18 at 10:25
pguardiariopguardiario
37.1k1080118
37.1k1080118
add a comment |
add a comment |
Here is a way using dataframes and pandas
import pandas as pd
tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
print(results)
add a comment |
Here is a way using dataframes and pandas
import pandas as pd
tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
print(results)
add a comment |
Here is a way using dataframes and pandas
import pandas as pd
tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
print(results)
Here is a way using dataframes and pandas
import pandas as pd
tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
print(results)
edited Nov 17 '18 at 11:38
answered Nov 17 '18 at 7:47
QHarrQHarr
39.2k82345
39.2k82345
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53349062%2fcant-create-an-xpath-capable-of-meeting-certain-condition%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown