Is it possible to access https pages through a proxy with Scrapy?

I can successfully access http pages through a proxy in Scrapy, but I cannot access https sites. I've researched the topic, but it's still unclear to me. Is it possible to access https pages through a proxy with Scrapy? Do I need to patch anything? Or add some custom code? If it can be confirmed this is a standard feature, I can follow up with more details. Hopefully this is an easy one.

Edited:

Here is what I added to the settings file:

DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 'test_website.middlewares.ProxyMiddleware': 100}

PROXIES = [{'ip_port': 'us-il.proxymesh.com:31280', 'user_pass': 'username:password'}]

Here is the code for my spider:

import scrapy



class TestSpider(scrapy.Spider):

    name = "test_spider"

    allowed_domains = "ipify.org"

     start_urls = ["https://api.ipify.org"]



    def parse(self, response):

        with open('test.html', 'wb') as f:

            f.write(response.body)

Here is the middlewares file:

import base64

import random

from settings import PROXIES



class ProxyMiddleware(object):

    def process_request(self, request, spider):

        proxy = random.choice(PROXIES)

        if proxy['user_pass'] is not None:

            request.meta['proxy'] = "http://%s" % proxy['ip_port']

            encoded_user_pass = base64.encodestring(proxy['user_pass'])

            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass            

        else:

            request.meta['proxy'] = "http://%s" % proxy['ip_port']

Here is the log file:

2015-08-12 20:15:50 [scrapy] INFO: Scrapy 1.0.3 started (bot: test_website)

2015-08-12 20:15:50 [scrapy] INFO: Optional features available: ssl, http11

2015-08-12 20:15:50 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'test_website.spiders', 'SPIDER_MODULES': ['test_website.spiders'], 'LOG_STDOUT': True, 'LOG_FILE': 'log.txt', 'BOT_NAME': 'test_website'}

2015-08-12 20:15:51 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState

2015-08-12 20:15:53 [scrapy] INFO: Enabled downloader middlewares: ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats

2015-08-12 20:15:53 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

2015-08-12 20:15:53 [scrapy] INFO: Enabled item pipelines: 

2015-08-12 20:15:53 [scrapy] INFO: Spider opened

2015-08-12 20:15:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2015-08-12 20:15:53 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023

2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 1 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 2 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] DEBUG: Gave up retrying <GET https://api.ipify.org> (failed 3 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] ERROR: Error downloading <GET https://api.ipify.org>: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] INFO: Closing spider (finished)

2015-08-12 20:15:53 [scrapy] INFO: Dumping Scrapy stats:

{'downloader/exception_count': 3,

 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,

 'downloader/request_bytes': 819,

 'downloader/request_count': 3,

 'downloader/request_method_count/GET': 3,

 'finish_reason': 'finished',

 'finish_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 943000),

 'log_count/DEBUG': 4,

 'log_count/ERROR': 1,

 'log_count/INFO': 7,

 'scheduler/dequeued': 3,

 'scheduler/dequeued/memory': 3,

 'scheduler/enqueued': 3,

 'scheduler/enqueued/memory': 3,

 'start_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 38000)}

2015-08-12 20:15:53 [scrapy] INFO: Spider closed (finished)

My spider works successfully if the 's' in 'https' is removed or I disable the proxy. I can access the https page through the proxy through my browser.

edited Aug 13 '15 at 2:16

asked Aug 12 '15 at 19:37

patrick.s

184

add a comment |

Edited:

Here is what I added to the settings file:

DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 'test_website.middlewares.ProxyMiddleware': 100}

PROXIES = [{'ip_port': 'us-il.proxymesh.com:31280', 'user_pass': 'username:password'}]

Here is the code for my spider:

import scrapy



class TestSpider(scrapy.Spider):

    name = "test_spider"

    allowed_domains = "ipify.org"

     start_urls = ["https://api.ipify.org"]



    def parse(self, response):

        with open('test.html', 'wb') as f:

            f.write(response.body)

Here is the middlewares file:

import base64

import random

from settings import PROXIES



class ProxyMiddleware(object):

    def process_request(self, request, spider):

        proxy = random.choice(PROXIES)

        if proxy['user_pass'] is not None:

            request.meta['proxy'] = "http://%s" % proxy['ip_port']

            encoded_user_pass = base64.encodestring(proxy['user_pass'])

            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass            

        else:

            request.meta['proxy'] = "http://%s" % proxy['ip_port']

Here is the log file:

2015-08-12 20:15:50 [scrapy] INFO: Scrapy 1.0.3 started (bot: test_website)

2015-08-12 20:15:50 [scrapy] INFO: Optional features available: ssl, http11

2015-08-12 20:15:50 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'test_website.spiders', 'SPIDER_MODULES': ['test_website.spiders'], 'LOG_STDOUT': True, 'LOG_FILE': 'log.txt', 'BOT_NAME': 'test_website'}

2015-08-12 20:15:51 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState

2015-08-12 20:15:53 [scrapy] INFO: Enabled downloader middlewares: ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats

2015-08-12 20:15:53 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

2015-08-12 20:15:53 [scrapy] INFO: Enabled item pipelines: 

2015-08-12 20:15:53 [scrapy] INFO: Spider opened

2015-08-12 20:15:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2015-08-12 20:15:53 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023

2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 1 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 2 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] DEBUG: Gave up retrying <GET https://api.ipify.org> (failed 3 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] ERROR: Error downloading <GET https://api.ipify.org>: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] INFO: Closing spider (finished)

2015-08-12 20:15:53 [scrapy] INFO: Dumping Scrapy stats:

{'downloader/exception_count': 3,

 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,

 'downloader/request_bytes': 819,

 'downloader/request_count': 3,

 'downloader/request_method_count/GET': 3,

 'finish_reason': 'finished',

 'finish_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 943000),

 'log_count/DEBUG': 4,

 'log_count/ERROR': 1,

 'log_count/INFO': 7,

 'scheduler/dequeued': 3,

 'scheduler/dequeued/memory': 3,

 'scheduler/enqueued': 3,

 'scheduler/enqueued/memory': 3,

 'start_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 38000)}

2015-08-12 20:15:53 [scrapy] INFO: Spider closed (finished)

My spider works successfully if the 's' in 'https' is removed or I disable the proxy. I can access the https page through the proxy through my browser.

edited Aug 13 '15 at 2:16

asked Aug 12 '15 at 19:37

patrick.s

184

add a comment |

Edited:

Here is what I added to the settings file:

DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 'test_website.middlewares.ProxyMiddleware': 100}

PROXIES = [{'ip_port': 'us-il.proxymesh.com:31280', 'user_pass': 'username:password'}]

Here is the code for my spider:

import scrapy



class TestSpider(scrapy.Spider):

    name = "test_spider"

    allowed_domains = "ipify.org"

     start_urls = ["https://api.ipify.org"]



    def parse(self, response):

        with open('test.html', 'wb') as f:

            f.write(response.body)

Here is the middlewares file:

import base64

import random

from settings import PROXIES



class ProxyMiddleware(object):

    def process_request(self, request, spider):

        proxy = random.choice(PROXIES)

        if proxy['user_pass'] is not None:

            request.meta['proxy'] = "http://%s" % proxy['ip_port']

            encoded_user_pass = base64.encodestring(proxy['user_pass'])

            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass            

        else:

            request.meta['proxy'] = "http://%s" % proxy['ip_port']

Here is the log file:

2015-08-12 20:15:50 [scrapy] INFO: Scrapy 1.0.3 started (bot: test_website)

2015-08-12 20:15:50 [scrapy] INFO: Optional features available: ssl, http11

2015-08-12 20:15:50 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'test_website.spiders', 'SPIDER_MODULES': ['test_website.spiders'], 'LOG_STDOUT': True, 'LOG_FILE': 'log.txt', 'BOT_NAME': 'test_website'}

2015-08-12 20:15:51 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState

2015-08-12 20:15:53 [scrapy] INFO: Enabled downloader middlewares: ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats

2015-08-12 20:15:53 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

2015-08-12 20:15:53 [scrapy] INFO: Enabled item pipelines: 

2015-08-12 20:15:53 [scrapy] INFO: Spider opened

2015-08-12 20:15:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2015-08-12 20:15:53 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023

2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 1 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 2 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] DEBUG: Gave up retrying <GET https://api.ipify.org> (failed 3 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] ERROR: Error downloading <GET https://api.ipify.org>: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] INFO: Closing spider (finished)

2015-08-12 20:15:53 [scrapy] INFO: Dumping Scrapy stats:

{'downloader/exception_count': 3,

 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,

 'downloader/request_bytes': 819,

 'downloader/request_count': 3,

 'downloader/request_method_count/GET': 3,

 'finish_reason': 'finished',

 'finish_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 943000),

 'log_count/DEBUG': 4,

 'log_count/ERROR': 1,

 'log_count/INFO': 7,

 'scheduler/dequeued': 3,

 'scheduler/dequeued/memory': 3,

 'scheduler/enqueued': 3,

 'scheduler/enqueued/memory': 3,

 'start_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 38000)}

2015-08-12 20:15:53 [scrapy] INFO: Spider closed (finished)

My spider works successfully if the 's' in 'https' is removed or I disable the proxy. I can access the https page through the proxy through my browser.

edited Aug 13 '15 at 2:16

asked Aug 12 '15 at 19:37

patrick.s

184

Edited:

Here is what I added to the settings file:

DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 'test_website.middlewares.ProxyMiddleware': 100}

PROXIES = [{'ip_port': 'us-il.proxymesh.com:31280', 'user_pass': 'username:password'}]

Here is the code for my spider:

import scrapy



class TestSpider(scrapy.Spider):

    name = "test_spider"

    allowed_domains = "ipify.org"

     start_urls = ["https://api.ipify.org"]



    def parse(self, response):

        with open('test.html', 'wb') as f:

            f.write(response.body)

Here is the middlewares file:

import base64

import random

from settings import PROXIES



class ProxyMiddleware(object):

    def process_request(self, request, spider):

        proxy = random.choice(PROXIES)

        if proxy['user_pass'] is not None:

            request.meta['proxy'] = "http://%s" % proxy['ip_port']

            encoded_user_pass = base64.encodestring(proxy['user_pass'])

            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass            

        else:

            request.meta['proxy'] = "http://%s" % proxy['ip_port']

Here is the log file:

2015-08-12 20:15:50 [scrapy] INFO: Scrapy 1.0.3 started (bot: test_website)

2015-08-12 20:15:50 [scrapy] INFO: Optional features available: ssl, http11

2015-08-12 20:15:50 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'test_website.spiders', 'SPIDER_MODULES': ['test_website.spiders'], 'LOG_STDOUT': True, 'LOG_FILE': 'log.txt', 'BOT_NAME': 'test_website'}

2015-08-12 20:15:51 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState

2015-08-12 20:15:53 [scrapy] INFO: Enabled downloader middlewares: ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats

2015-08-12 20:15:53 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

2015-08-12 20:15:53 [scrapy] INFO: Enabled item pipelines: 

2015-08-12 20:15:53 [scrapy] INFO: Spider opened

2015-08-12 20:15:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2015-08-12 20:15:53 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023

2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 1 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 2 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] DEBUG: Gave up retrying <GET https://api.ipify.org> (failed 3 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] ERROR: Error downloading <GET https://api.ipify.org>: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2015-08-12 20:15:53 [scrapy] INFO: Closing spider (finished)

2015-08-12 20:15:53 [scrapy] INFO: Dumping Scrapy stats:

{'downloader/exception_count': 3,

 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,

 'downloader/request_bytes': 819,

 'downloader/request_count': 3,

 'downloader/request_method_count/GET': 3,

 'finish_reason': 'finished',

 'finish_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 943000),

 'log_count/DEBUG': 4,

 'log_count/ERROR': 1,

 'log_count/INFO': 7,

 'scheduler/dequeued': 3,

 'scheduler/dequeued/memory': 3,

 'scheduler/enqueued': 3,

 'scheduler/enqueued/memory': 3,

 'start_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 38000)}

2015-08-12 20:15:53 [scrapy] INFO: Spider closed (finished)

My spider works successfully if the 's' in 'https' is removed or I disable the proxy. I can access the https page through the proxy through my browser.

python https scrapy proxies

edited Aug 13 '15 at 2:16

asked Aug 12 '15 at 19:37

patrick.s

184

edited Aug 13 '15 at 2:16

asked Aug 12 '15 at 19:37

patrick.s

184

edited Aug 13 '15 at 2:16

asked Aug 12 '15 at 19:37

patrick.s

184

asked Aug 12 '15 at 19:37

patrick.s

184

asked Aug 12 '15 at 19:37

patrick.s

184

add a comment |

3 Answers
3

active

oldest

votes

I got this because of using base64.encodestring instead of base64.b64encode in the proxy middleware. Try changing it.

answered Sep 14 '16 at 1:32

Aminah Nuraini

6,52744562

add a comment |

I think it's possible.

If you're setting the proxy through Request.meta it should just work.
If you're setting the proxy with the http_proxy environment variable, you might also need to set https_proxy.

It might be the case, however, that your proxy does not support HTTPS.

It would be easier to help you if you posted the error you are getting.

edited Aug 12 '15 at 23:54

answered Aug 12 '15 at 22:46

Artur Gaspar

3,39111725

add a comment |

The Scrapy is automatically bypass the https ssl.
like @Aminah Nuraini said, just using base64.encodestring instead of base64.b64encode in the proxy middleware

add following code in middlewares.py

import base64

class ProxyMiddleware(object):

    # overwrite process request

    def process_request(self, request, spider):

        # Set the location of the proxy

        request.meta['proxy'] = "<PROXY_SERVER>:<PROXY_PROT>"

        # Use the following lines if your proxy requires authentication

        proxy_user_pass = "<PROXY_USERNAME>:<PROXY_PASSWORD>"

        # setup basic authentication for the proxy

        encoded_user_pass = base64.b64encode(proxy_user_pass)

        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

Add the proxy in settings.py

DOWNLOADER_MIDDLEWARES = {



    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,



    '<YOUR_PROJECT>.middlewares.ProxyMiddleware': 100,



}

edited Nov 13 '18 at 12:18

Suraj Rao

22.8k75469

answered Nov 13 '18 at 12:13

Ken

214

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f31973947%2fis-it-possible-to-access-https-pages-through-a-proxy-with-scrapy%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

I got this because of using base64.encodestring instead of base64.b64encode in the proxy middleware. Try changing it.

answered Sep 14 '16 at 1:32

Aminah Nuraini

6,52744562

add a comment |

I got this because of using base64.encodestring instead of base64.b64encode in the proxy middleware. Try changing it.

answered Sep 14 '16 at 1:32

Aminah Nuraini

6,52744562

add a comment |

I got this because of using base64.encodestring instead of base64.b64encode in the proxy middleware. Try changing it.

answered Sep 14 '16 at 1:32

Aminah Nuraini

6,52744562

I got this because of using base64.encodestring instead of base64.b64encode in the proxy middleware. Try changing it.

answered Sep 14 '16 at 1:32

Aminah Nuraini

6,52744562

answered Sep 14 '16 at 1:32

Aminah Nuraini

6,52744562

answered Sep 14 '16 at 1:32

Aminah Nuraini

6,52744562

answered Sep 14 '16 at 1:32

Aminah Nuraini

6,52744562

add a comment |

I think it's possible.

If you're setting the proxy through Request.meta it should just work.
If you're setting the proxy with the http_proxy environment variable, you might also need to set https_proxy.

It might be the case, however, that your proxy does not support HTTPS.

It would be easier to help you if you posted the error you are getting.

edited Aug 12 '15 at 23:54

answered Aug 12 '15 at 22:46

Artur Gaspar

3,39111725

add a comment |

I think it's possible.

If you're setting the proxy through Request.meta it should just work.
If you're setting the proxy with the http_proxy environment variable, you might also need to set https_proxy.

It might be the case, however, that your proxy does not support HTTPS.

It would be easier to help you if you posted the error you are getting.

edited Aug 12 '15 at 23:54

answered Aug 12 '15 at 22:46

Artur Gaspar

3,39111725

add a comment |

I think it's possible.

If you're setting the proxy through Request.meta it should just work.
If you're setting the proxy with the http_proxy environment variable, you might also need to set https_proxy.

It might be the case, however, that your proxy does not support HTTPS.

It would be easier to help you if you posted the error you are getting.

edited Aug 12 '15 at 23:54

answered Aug 12 '15 at 22:46

Artur Gaspar

3,39111725

I think it's possible.

If you're setting the proxy through Request.meta it should just work.
If you're setting the proxy with the http_proxy environment variable, you might also need to set https_proxy.

It might be the case, however, that your proxy does not support HTTPS.

It would be easier to help you if you posted the error you are getting.

edited Aug 12 '15 at 23:54

answered Aug 12 '15 at 22:46

Artur Gaspar

3,39111725

edited Aug 12 '15 at 23:54

answered Aug 12 '15 at 22:46

Artur Gaspar

3,39111725

answered Aug 12 '15 at 22:46

Artur Gaspar

3,39111725

answered Aug 12 '15 at 22:46

Artur Gaspar

3,39111725

add a comment |

The Scrapy is automatically bypass the https ssl.
like @Aminah Nuraini said, just using base64.encodestring instead of base64.b64encode in the proxy middleware

add following code in middlewares.py

import base64

class ProxyMiddleware(object):

    # overwrite process request

    def process_request(self, request, spider):

        # Set the location of the proxy

        request.meta['proxy'] = "<PROXY_SERVER>:<PROXY_PROT>"

        # Use the following lines if your proxy requires authentication

        proxy_user_pass = "<PROXY_USERNAME>:<PROXY_PASSWORD>"

        # setup basic authentication for the proxy

        encoded_user_pass = base64.b64encode(proxy_user_pass)

        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

Add the proxy in settings.py

DOWNLOADER_MIDDLEWARES = {



    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,



    '<YOUR_PROJECT>.middlewares.ProxyMiddleware': 100,



}

edited Nov 13 '18 at 12:18

Suraj Rao

22.8k75469

answered Nov 13 '18 at 12:13

Ken

214

add a comment |

The Scrapy is automatically bypass the https ssl.
like @Aminah Nuraini said, just using base64.encodestring instead of base64.b64encode in the proxy middleware

add following code in middlewares.py

import base64

class ProxyMiddleware(object):

    # overwrite process request

    def process_request(self, request, spider):

        # Set the location of the proxy

        request.meta['proxy'] = "<PROXY_SERVER>:<PROXY_PROT>"

        # Use the following lines if your proxy requires authentication

        proxy_user_pass = "<PROXY_USERNAME>:<PROXY_PASSWORD>"

        # setup basic authentication for the proxy

        encoded_user_pass = base64.b64encode(proxy_user_pass)

        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

Add the proxy in settings.py

DOWNLOADER_MIDDLEWARES = {



    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,



    '<YOUR_PROJECT>.middlewares.ProxyMiddleware': 100,



}

edited Nov 13 '18 at 12:18

Suraj Rao

22.8k75469

answered Nov 13 '18 at 12:13

Ken

214

add a comment |

The Scrapy is automatically bypass the https ssl.
like @Aminah Nuraini said, just using base64.encodestring instead of base64.b64encode in the proxy middleware

add following code in middlewares.py

import base64

class ProxyMiddleware(object):

    # overwrite process request

    def process_request(self, request, spider):

        # Set the location of the proxy

        request.meta['proxy'] = "<PROXY_SERVER>:<PROXY_PROT>"

        # Use the following lines if your proxy requires authentication

        proxy_user_pass = "<PROXY_USERNAME>:<PROXY_PASSWORD>"

        # setup basic authentication for the proxy

        encoded_user_pass = base64.b64encode(proxy_user_pass)

        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

Add the proxy in settings.py

DOWNLOADER_MIDDLEWARES = {



    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,



    '<YOUR_PROJECT>.middlewares.ProxyMiddleware': 100,



}

edited Nov 13 '18 at 12:18

Suraj Rao

22.8k75469

answered Nov 13 '18 at 12:13

Ken

214

The Scrapy is automatically bypass the https ssl.
like @Aminah Nuraini said, just using base64.encodestring instead of base64.b64encode in the proxy middleware

add following code in middlewares.py

import base64

class ProxyMiddleware(object):

    # overwrite process request

    def process_request(self, request, spider):

        # Set the location of the proxy

        request.meta['proxy'] = "<PROXY_SERVER>:<PROXY_PROT>"

        # Use the following lines if your proxy requires authentication

        proxy_user_pass = "<PROXY_USERNAME>:<PROXY_PASSWORD>"

        # setup basic authentication for the proxy

        encoded_user_pass = base64.b64encode(proxy_user_pass)

        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

Add the proxy in settings.py

DOWNLOADER_MIDDLEWARES = {



    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,



    '<YOUR_PROJECT>.middlewares.ProxyMiddleware': 100,



}

edited Nov 13 '18 at 12:18

Suraj Rao

22.8k75469

answered Nov 13 '18 at 12:13

Ken

214

edited Nov 13 '18 at 12:18

Suraj Rao

22.8k75469

edited Nov 13 '18 at 12:18

Suraj Rao

22.8k75469

edited Nov 13 '18 at 12:18

Suraj Rao

22.8k75469

answered Nov 13 '18 at 12:13

Ken

214

answered Nov 13 '18 at 12:13

Ken

214

answered Nov 13 '18 at 12:13

Ken

214

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

z8ovwTTIh2hQ3b7IjstBTeF1E2KC,hvubjak0

搜尋此網誌

Vfrdtyky