Is it possible to access https pages through a proxy with Scrapy?
I can successfully access http pages through a proxy in Scrapy, but I cannot access https sites. I've researched the topic, but it's still unclear to me. Is it possible to access https pages through a proxy with Scrapy? Do I need to patch anything? Or add some custom code? If it can be confirmed this is a standard feature, I can follow up with more details. Hopefully this is an easy one.
Edited:
Here is what I added to the settings file:
DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 'test_website.middlewares.ProxyMiddleware': 100}
PROXIES = [{'ip_port': 'us-il.proxymesh.com:31280', 'user_pass': 'username:password'}]
Here is the code for my spider:
import scrapy
class TestSpider(scrapy.Spider):
name = "test_spider"
allowed_domains = "ipify.org"
start_urls = ["https://api.ipify.org"]
def parse(self, response):
with open('test.html', 'wb') as f:
f.write(response.body)
Here is the middlewares file:
import base64
import random
from settings import PROXIES
class ProxyMiddleware(object):
def process_request(self, request, spider):
proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
encoded_user_pass = base64.encodestring(proxy['user_pass'])
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
else:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
Here is the log file:
2015-08-12 20:15:50 [scrapy] INFO: Scrapy 1.0.3 started (bot: test_website)
2015-08-12 20:15:50 [scrapy] INFO: Optional features available: ssl, http11
2015-08-12 20:15:50 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'test_website.spiders', 'SPIDER_MODULES': ['test_website.spiders'], 'LOG_STDOUT': True, 'LOG_FILE': 'log.txt', 'BOT_NAME': 'test_website'}
2015-08-12 20:15:51 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-08-12 20:15:53 [scrapy] INFO: Enabled downloader middlewares: ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-12 20:15:53 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-12 20:15:53 [scrapy] INFO: Enabled item pipelines:
2015-08-12 20:15:53 [scrapy] INFO: Spider opened
2015-08-12 20:15:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-08-12 20:15:53 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 1 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 2 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] DEBUG: Gave up retrying <GET https://api.ipify.org> (failed 3 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] ERROR: Error downloading <GET https://api.ipify.org>: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] INFO: Closing spider (finished)
2015-08-12 20:15:53 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,
'downloader/request_bytes': 819,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 943000),
'log_count/DEBUG': 4,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 38000)}
2015-08-12 20:15:53 [scrapy] INFO: Spider closed (finished)
My spider works successfully if the 's' in 'https' is removed or I disable the proxy. I can access the https page through the proxy through my browser.
python https scrapy proxies
add a comment |
I can successfully access http pages through a proxy in Scrapy, but I cannot access https sites. I've researched the topic, but it's still unclear to me. Is it possible to access https pages through a proxy with Scrapy? Do I need to patch anything? Or add some custom code? If it can be confirmed this is a standard feature, I can follow up with more details. Hopefully this is an easy one.
Edited:
Here is what I added to the settings file:
DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 'test_website.middlewares.ProxyMiddleware': 100}
PROXIES = [{'ip_port': 'us-il.proxymesh.com:31280', 'user_pass': 'username:password'}]
Here is the code for my spider:
import scrapy
class TestSpider(scrapy.Spider):
name = "test_spider"
allowed_domains = "ipify.org"
start_urls = ["https://api.ipify.org"]
def parse(self, response):
with open('test.html', 'wb') as f:
f.write(response.body)
Here is the middlewares file:
import base64
import random
from settings import PROXIES
class ProxyMiddleware(object):
def process_request(self, request, spider):
proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
encoded_user_pass = base64.encodestring(proxy['user_pass'])
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
else:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
Here is the log file:
2015-08-12 20:15:50 [scrapy] INFO: Scrapy 1.0.3 started (bot: test_website)
2015-08-12 20:15:50 [scrapy] INFO: Optional features available: ssl, http11
2015-08-12 20:15:50 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'test_website.spiders', 'SPIDER_MODULES': ['test_website.spiders'], 'LOG_STDOUT': True, 'LOG_FILE': 'log.txt', 'BOT_NAME': 'test_website'}
2015-08-12 20:15:51 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-08-12 20:15:53 [scrapy] INFO: Enabled downloader middlewares: ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-12 20:15:53 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-12 20:15:53 [scrapy] INFO: Enabled item pipelines:
2015-08-12 20:15:53 [scrapy] INFO: Spider opened
2015-08-12 20:15:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-08-12 20:15:53 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 1 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 2 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] DEBUG: Gave up retrying <GET https://api.ipify.org> (failed 3 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] ERROR: Error downloading <GET https://api.ipify.org>: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] INFO: Closing spider (finished)
2015-08-12 20:15:53 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,
'downloader/request_bytes': 819,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 943000),
'log_count/DEBUG': 4,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 38000)}
2015-08-12 20:15:53 [scrapy] INFO: Spider closed (finished)
My spider works successfully if the 's' in 'https' is removed or I disable the proxy. I can access the https page through the proxy through my browser.
python https scrapy proxies
add a comment |
I can successfully access http pages through a proxy in Scrapy, but I cannot access https sites. I've researched the topic, but it's still unclear to me. Is it possible to access https pages through a proxy with Scrapy? Do I need to patch anything? Or add some custom code? If it can be confirmed this is a standard feature, I can follow up with more details. Hopefully this is an easy one.
Edited:
Here is what I added to the settings file:
DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 'test_website.middlewares.ProxyMiddleware': 100}
PROXIES = [{'ip_port': 'us-il.proxymesh.com:31280', 'user_pass': 'username:password'}]
Here is the code for my spider:
import scrapy
class TestSpider(scrapy.Spider):
name = "test_spider"
allowed_domains = "ipify.org"
start_urls = ["https://api.ipify.org"]
def parse(self, response):
with open('test.html', 'wb') as f:
f.write(response.body)
Here is the middlewares file:
import base64
import random
from settings import PROXIES
class ProxyMiddleware(object):
def process_request(self, request, spider):
proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
encoded_user_pass = base64.encodestring(proxy['user_pass'])
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
else:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
Here is the log file:
2015-08-12 20:15:50 [scrapy] INFO: Scrapy 1.0.3 started (bot: test_website)
2015-08-12 20:15:50 [scrapy] INFO: Optional features available: ssl, http11
2015-08-12 20:15:50 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'test_website.spiders', 'SPIDER_MODULES': ['test_website.spiders'], 'LOG_STDOUT': True, 'LOG_FILE': 'log.txt', 'BOT_NAME': 'test_website'}
2015-08-12 20:15:51 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-08-12 20:15:53 [scrapy] INFO: Enabled downloader middlewares: ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-12 20:15:53 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-12 20:15:53 [scrapy] INFO: Enabled item pipelines:
2015-08-12 20:15:53 [scrapy] INFO: Spider opened
2015-08-12 20:15:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-08-12 20:15:53 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 1 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 2 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] DEBUG: Gave up retrying <GET https://api.ipify.org> (failed 3 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] ERROR: Error downloading <GET https://api.ipify.org>: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] INFO: Closing spider (finished)
2015-08-12 20:15:53 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,
'downloader/request_bytes': 819,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 943000),
'log_count/DEBUG': 4,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 38000)}
2015-08-12 20:15:53 [scrapy] INFO: Spider closed (finished)
My spider works successfully if the 's' in 'https' is removed or I disable the proxy. I can access the https page through the proxy through my browser.
python https scrapy proxies
I can successfully access http pages through a proxy in Scrapy, but I cannot access https sites. I've researched the topic, but it's still unclear to me. Is it possible to access https pages through a proxy with Scrapy? Do I need to patch anything? Or add some custom code? If it can be confirmed this is a standard feature, I can follow up with more details. Hopefully this is an easy one.
Edited:
Here is what I added to the settings file:
DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 'test_website.middlewares.ProxyMiddleware': 100}
PROXIES = [{'ip_port': 'us-il.proxymesh.com:31280', 'user_pass': 'username:password'}]
Here is the code for my spider:
import scrapy
class TestSpider(scrapy.Spider):
name = "test_spider"
allowed_domains = "ipify.org"
start_urls = ["https://api.ipify.org"]
def parse(self, response):
with open('test.html', 'wb') as f:
f.write(response.body)
Here is the middlewares file:
import base64
import random
from settings import PROXIES
class ProxyMiddleware(object):
def process_request(self, request, spider):
proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
encoded_user_pass = base64.encodestring(proxy['user_pass'])
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
else:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
Here is the log file:
2015-08-12 20:15:50 [scrapy] INFO: Scrapy 1.0.3 started (bot: test_website)
2015-08-12 20:15:50 [scrapy] INFO: Optional features available: ssl, http11
2015-08-12 20:15:50 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'test_website.spiders', 'SPIDER_MODULES': ['test_website.spiders'], 'LOG_STDOUT': True, 'LOG_FILE': 'log.txt', 'BOT_NAME': 'test_website'}
2015-08-12 20:15:51 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-08-12 20:15:53 [scrapy] INFO: Enabled downloader middlewares: ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-12 20:15:53 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-12 20:15:53 [scrapy] INFO: Enabled item pipelines:
2015-08-12 20:15:53 [scrapy] INFO: Spider opened
2015-08-12 20:15:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-08-12 20:15:53 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 1 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 2 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] DEBUG: Gave up retrying <GET https://api.ipify.org> (failed 3 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] ERROR: Error downloading <GET https://api.ipify.org>: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] INFO: Closing spider (finished)
2015-08-12 20:15:53 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,
'downloader/request_bytes': 819,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 943000),
'log_count/DEBUG': 4,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 38000)}
2015-08-12 20:15:53 [scrapy] INFO: Spider closed (finished)
My spider works successfully if the 's' in 'https' is removed or I disable the proxy. I can access the https page through the proxy through my browser.
python https scrapy proxies
python https scrapy proxies
edited Aug 13 '15 at 2:16
patrick.s
asked Aug 12 '15 at 19:37
patrick.spatrick.s
184
184
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
I got this because of using base64.encodestring
instead of base64.b64encode
in the proxy middleware. Try changing it.
add a comment |
I think it's possible.
If you're setting the proxy through Request.meta
it should just work.
If you're setting the proxy with the http_proxy
environment variable, you might also need to set https_proxy
.
It might be the case, however, that your proxy does not support HTTPS.
It would be easier to help you if you posted the error you are getting.
add a comment |
The Scrapy is automatically bypass the https ssl.
like @Aminah Nuraini said, just using base64.encodestring instead of base64.b64encode in the proxy middleware
add following code in middlewares.py
import base64
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "<PROXY_SERVER>:<PROXY_PROT>"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "<PROXY_USERNAME>:<PROXY_PASSWORD>"
# setup basic authentication for the proxy
encoded_user_pass = base64.b64encode(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
Add the proxy in settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'<YOUR_PROJECT>.middlewares.ProxyMiddleware': 100,
}
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f31973947%2fis-it-possible-to-access-https-pages-through-a-proxy-with-scrapy%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
I got this because of using base64.encodestring
instead of base64.b64encode
in the proxy middleware. Try changing it.
add a comment |
I got this because of using base64.encodestring
instead of base64.b64encode
in the proxy middleware. Try changing it.
add a comment |
I got this because of using base64.encodestring
instead of base64.b64encode
in the proxy middleware. Try changing it.
I got this because of using base64.encodestring
instead of base64.b64encode
in the proxy middleware. Try changing it.
answered Sep 14 '16 at 1:32
Aminah NurainiAminah Nuraini
6,52744562
6,52744562
add a comment |
add a comment |
I think it's possible.
If you're setting the proxy through Request.meta
it should just work.
If you're setting the proxy with the http_proxy
environment variable, you might also need to set https_proxy
.
It might be the case, however, that your proxy does not support HTTPS.
It would be easier to help you if you posted the error you are getting.
add a comment |
I think it's possible.
If you're setting the proxy through Request.meta
it should just work.
If you're setting the proxy with the http_proxy
environment variable, you might also need to set https_proxy
.
It might be the case, however, that your proxy does not support HTTPS.
It would be easier to help you if you posted the error you are getting.
add a comment |
I think it's possible.
If you're setting the proxy through Request.meta
it should just work.
If you're setting the proxy with the http_proxy
environment variable, you might also need to set https_proxy
.
It might be the case, however, that your proxy does not support HTTPS.
It would be easier to help you if you posted the error you are getting.
I think it's possible.
If you're setting the proxy through Request.meta
it should just work.
If you're setting the proxy with the http_proxy
environment variable, you might also need to set https_proxy
.
It might be the case, however, that your proxy does not support HTTPS.
It would be easier to help you if you posted the error you are getting.
edited Aug 12 '15 at 23:54
answered Aug 12 '15 at 22:46
Artur GasparArtur Gaspar
3,39111725
3,39111725
add a comment |
add a comment |
The Scrapy is automatically bypass the https ssl.
like @Aminah Nuraini said, just using base64.encodestring instead of base64.b64encode in the proxy middleware
add following code in middlewares.py
import base64
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "<PROXY_SERVER>:<PROXY_PROT>"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "<PROXY_USERNAME>:<PROXY_PASSWORD>"
# setup basic authentication for the proxy
encoded_user_pass = base64.b64encode(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
Add the proxy in settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'<YOUR_PROJECT>.middlewares.ProxyMiddleware': 100,
}
add a comment |
The Scrapy is automatically bypass the https ssl.
like @Aminah Nuraini said, just using base64.encodestring instead of base64.b64encode in the proxy middleware
add following code in middlewares.py
import base64
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "<PROXY_SERVER>:<PROXY_PROT>"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "<PROXY_USERNAME>:<PROXY_PASSWORD>"
# setup basic authentication for the proxy
encoded_user_pass = base64.b64encode(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
Add the proxy in settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'<YOUR_PROJECT>.middlewares.ProxyMiddleware': 100,
}
add a comment |
The Scrapy is automatically bypass the https ssl.
like @Aminah Nuraini said, just using base64.encodestring instead of base64.b64encode in the proxy middleware
add following code in middlewares.py
import base64
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "<PROXY_SERVER>:<PROXY_PROT>"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "<PROXY_USERNAME>:<PROXY_PASSWORD>"
# setup basic authentication for the proxy
encoded_user_pass = base64.b64encode(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
Add the proxy in settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'<YOUR_PROJECT>.middlewares.ProxyMiddleware': 100,
}
The Scrapy is automatically bypass the https ssl.
like @Aminah Nuraini said, just using base64.encodestring instead of base64.b64encode in the proxy middleware
add following code in middlewares.py
import base64
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "<PROXY_SERVER>:<PROXY_PROT>"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "<PROXY_USERNAME>:<PROXY_PASSWORD>"
# setup basic authentication for the proxy
encoded_user_pass = base64.b64encode(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
Add the proxy in settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'<YOUR_PROJECT>.middlewares.ProxyMiddleware': 100,
}
edited Nov 13 '18 at 12:18
Suraj Rao
22.8k75469
22.8k75469
answered Nov 13 '18 at 12:13
KenKen
214
214
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f31973947%2fis-it-possible-to-access-https-pages-through-a-proxy-with-scrapy%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown