How can I read from S3 in pyspark running in local mode?

I am using PyCharm 2018.1 using Python 3.4 with Spark 2.3 installed via pip in a virtualenv. There is no hadoop installation on the local host, so there is no Spark installation (thus no SPARK_HOME, HADOOP_HOME, etc.)

When I try this:

from pyspark import SparkConf

from pyspark import SparkContext

conf = SparkConf()

    .setMaster("local")

    .setAppName("pyspark-unittests")

    .set("spark.sql.parquet.compression.codec", "snappy")

sc = SparkContext(conf = conf)

inputFile = sparkContext.textFile("s3://somebucket/file.csv")

I get:

py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.

: java.io.IOException: No FileSystem for scheme: s3

How can I read from s3 while running pyspark in local mode without a complete Hadoop install locally?

FWIW - this works great when I execute it on an EMR node in non-local mode.

The following does not work (same error, although it does resolve and download the dependancies):

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'

from pyspark import SparkConf

from pyspark import SparkContext

conf = SparkConf()

    .setMaster("local")

    .setAppName("pyspark-unittests")

    .set("spark.sql.parquet.compression.codec", "snappy")

sc = SparkContext(conf = conf)

inputFile = sparkContext.textFile("s3://somebucket/file.csv")

Same (bad) results with:

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'

from pyspark import SparkConf

from pyspark import SparkContext

conf = SparkConf()

    .setMaster("local")

    .setAppName("pyspark-unittests")

    .set("spark.sql.parquet.compression.codec", "snappy")

sc = SparkContext(conf = conf)

inputFile = sparkContext.textFile("s3://somebucket/file.csv")

edited May 6 '18 at 18:36

asked May 4 '18 at 22:36

Jared

16.5k156697

Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
– hi-zir
May 4 '18 at 22:53

There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
– Jared
May 6 '18 at 18:02

2

Try to use s3a protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
– prudenko
May 8 '18 at 10:44

add a comment |

When I try this:

from pyspark import SparkConf

from pyspark import SparkContext

conf = SparkConf()

    .setMaster("local")

    .setAppName("pyspark-unittests")

    .set("spark.sql.parquet.compression.codec", "snappy")

sc = SparkContext(conf = conf)

inputFile = sparkContext.textFile("s3://somebucket/file.csv")

I get:

py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.

: java.io.IOException: No FileSystem for scheme: s3

How can I read from s3 while running pyspark in local mode without a complete Hadoop install locally?

FWIW - this works great when I execute it on an EMR node in non-local mode.

The following does not work (same error, although it does resolve and download the dependancies):

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'

from pyspark import SparkConf

from pyspark import SparkContext

conf = SparkConf()

    .setMaster("local")

    .setAppName("pyspark-unittests")

    .set("spark.sql.parquet.compression.codec", "snappy")

sc = SparkContext(conf = conf)

inputFile = sparkContext.textFile("s3://somebucket/file.csv")

Same (bad) results with:

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'

from pyspark import SparkConf

from pyspark import SparkContext

conf = SparkConf()

    .setMaster("local")

    .setAppName("pyspark-unittests")

    .set("spark.sql.parquet.compression.codec", "snappy")

sc = SparkContext(conf = conf)

inputFile = sparkContext.textFile("s3://somebucket/file.csv")

edited May 6 '18 at 18:36

asked May 4 '18 at 22:36

Jared

16.5k156697

Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
– hi-zir
May 4 '18 at 22:53

There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
– Jared
May 6 '18 at 18:02

2

Try to use s3a protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
– prudenko
May 8 '18 at 10:44

add a comment |

When I try this:

from pyspark import SparkConf

from pyspark import SparkContext

conf = SparkConf()

    .setMaster("local")

    .setAppName("pyspark-unittests")

    .set("spark.sql.parquet.compression.codec", "snappy")

sc = SparkContext(conf = conf)

inputFile = sparkContext.textFile("s3://somebucket/file.csv")

I get:

py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.

: java.io.IOException: No FileSystem for scheme: s3

How can I read from s3 while running pyspark in local mode without a complete Hadoop install locally?

FWIW - this works great when I execute it on an EMR node in non-local mode.

The following does not work (same error, although it does resolve and download the dependancies):

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'

from pyspark import SparkConf

from pyspark import SparkContext

conf = SparkConf()

    .setMaster("local")

    .setAppName("pyspark-unittests")

    .set("spark.sql.parquet.compression.codec", "snappy")

sc = SparkContext(conf = conf)

inputFile = sparkContext.textFile("s3://somebucket/file.csv")

Same (bad) results with:

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'

from pyspark import SparkConf

from pyspark import SparkContext

conf = SparkConf()

    .setMaster("local")

    .setAppName("pyspark-unittests")

    .set("spark.sql.parquet.compression.codec", "snappy")

sc = SparkContext(conf = conf)

inputFile = sparkContext.textFile("s3://somebucket/file.csv")

edited May 6 '18 at 18:36

asked May 4 '18 at 22:36

Jared

16.5k156697

When I try this:

from pyspark import SparkConf

from pyspark import SparkContext

conf = SparkConf()

    .setMaster("local")

    .setAppName("pyspark-unittests")

    .set("spark.sql.parquet.compression.codec", "snappy")

sc = SparkContext(conf = conf)

inputFile = sparkContext.textFile("s3://somebucket/file.csv")

I get:

py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.

: java.io.IOException: No FileSystem for scheme: s3

How can I read from s3 while running pyspark in local mode without a complete Hadoop install locally?

FWIW - this works great when I execute it on an EMR node in non-local mode.

The following does not work (same error, although it does resolve and download the dependancies):

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'

from pyspark import SparkConf

from pyspark import SparkContext

conf = SparkConf()

    .setMaster("local")

    .setAppName("pyspark-unittests")

    .set("spark.sql.parquet.compression.codec", "snappy")

sc = SparkContext(conf = conf)

inputFile = sparkContext.textFile("s3://somebucket/file.csv")

Same (bad) results with:

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'

from pyspark import SparkConf

from pyspark import SparkContext

conf = SparkConf()

    .setMaster("local")

    .setAppName("pyspark-unittests")

    .set("spark.sql.parquet.compression.codec", "snappy")

sc = SparkContext(conf = conf)

inputFile = sparkContext.textFile("s3://somebucket/file.csv")

python apache-spark amazon-s3 pyspark

edited May 6 '18 at 18:36

asked May 4 '18 at 22:36

Jared

16.5k156697

edited May 6 '18 at 18:36

asked May 4 '18 at 22:36

Jared

16.5k156697

edited May 6 '18 at 18:36

asked May 4 '18 at 22:36

Jared

16.5k156697

asked May 4 '18 at 22:36

Jared

16.5k156697

asked May 4 '18 at 22:36

Jared

16.5k156697

Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
– hi-zir
May 4 '18 at 22:53

There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
– Jared
May 6 '18 at 18:02

2

Try to use s3a protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
– prudenko
May 8 '18 at 10:44

add a comment |

Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
– hi-zir
May 4 '18 at 22:53

There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
– Jared
May 6 '18 at 18:02

2

Try to use s3a protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
– prudenko
May 8 '18 at 10:44

Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
– hi-zir
May 4 '18 at 22:53

There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
– Jared
May 6 '18 at 18:02

Try to use s3a protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
– prudenko
May 8 '18 at 10:44

add a comment |

3 Answers
3

active

oldest

votes

+300

So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment

Jars

Everything points to one version which 2.7.3, which what you also need to use

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'

You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars inside your project's virtual env

And after that you can use s3a by default or s3 by defining the handler class for the same

# Only needed if you use s3://

sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')

sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

s3File = sc.textFile("s3a://myrepo/test.csv")



print(s3File.count())

print(s3File.id())

And the output is below

OutputSpark

edited May 8 '18 at 21:21

answered May 8 '18 at 21:16

Tarun Lalwani

77.8k243103

What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
– Jared
May 8 '18 at 21:18

On my laptop the relative path in environment is venv/Lib/site-packages/pyspark/jars
– Tarun Lalwani
May 8 '18 at 21:19

@Jared, worked?
– Tarun Lalwani
May 8 '18 at 21:38

Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
– Jared
May 8 '18 at 21:44

I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
– Tarun Lalwani
May 8 '18 at 21:45

|
show 2 more comments

You should use the s3a protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext first. Like this:

sc = SparkContext(conf = conf)

sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')

sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')



inputFile = sparkContext.textFile("s3a://somebucket/file.csv")

answered May 8 '18 at 17:54

Glennie Helles Sindholt

6,93022741

With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
– Jared
May 8 '18 at 21:04

If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
– Jared
May 8 '18 at 21:13

add a comment |

preparation:

Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf

spark.hadoop.fs.s3a.access.key=<your access key>

spark.hadoop.fs.s3a.secret.key=<your secret key>

python file content:

from __future__ import print_function

import os



from pyspark import SparkConf

from pyspark import SparkContext



os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"

os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"





if __name__ == "__main__":



    conf = SparkConf().setAppName("read_s3").setMaster("local[2]")

    sc = SparkContext(conf=conf)



    my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")

    print("file count:", my_s3_file3.count())

commit:

spark-submit --master local 

--packages org.apache.hadoop:hadoop-aws:2.7.3,

com.amazonaws:aws-java-sdk:1.7.4,

org.apache.hadoop:hadoop-common:2.7.3 

<path to the py file above>

answered Nov 13 '18 at 1:52

buxizhizhoum

35849

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f50183915%2fhow-can-i-read-from-s3-in-pyspark-running-in-local-mode%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

+300

So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment

Jars

Everything points to one version which 2.7.3, which what you also need to use

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'

You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars inside your project's virtual env

And after that you can use s3a by default or s3 by defining the handler class for the same

# Only needed if you use s3://

sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')

sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

s3File = sc.textFile("s3a://myrepo/test.csv")



print(s3File.count())

print(s3File.id())

And the output is below

OutputSpark

edited May 8 '18 at 21:21

answered May 8 '18 at 21:16

Tarun Lalwani

77.8k243103

What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
– Jared
May 8 '18 at 21:18

On my laptop the relative path in environment is venv/Lib/site-packages/pyspark/jars
– Tarun Lalwani
May 8 '18 at 21:19

@Jared, worked?
– Tarun Lalwani
May 8 '18 at 21:38

Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
– Jared
May 8 '18 at 21:44

I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
– Tarun Lalwani
May 8 '18 at 21:45

|
show 2 more comments

+300

So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment

Jars

Everything points to one version which 2.7.3, which what you also need to use

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'

You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars inside your project's virtual env

And after that you can use s3a by default or s3 by defining the handler class for the same

# Only needed if you use s3://

sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')

sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

s3File = sc.textFile("s3a://myrepo/test.csv")



print(s3File.count())

print(s3File.id())

And the output is below

OutputSpark

edited May 8 '18 at 21:21

answered May 8 '18 at 21:16

Tarun Lalwani

77.8k243103

What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
– Jared
May 8 '18 at 21:18

On my laptop the relative path in environment is venv/Lib/site-packages/pyspark/jars
– Tarun Lalwani
May 8 '18 at 21:19

@Jared, worked?
– Tarun Lalwani
May 8 '18 at 21:38

Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
– Jared
May 8 '18 at 21:44

I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
– Tarun Lalwani
May 8 '18 at 21:45

|
show 2 more comments

+300

So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment

Jars

Everything points to one version which 2.7.3, which what you also need to use

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'

You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars inside your project's virtual env

And after that you can use s3a by default or s3 by defining the handler class for the same

# Only needed if you use s3://

sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')

sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

s3File = sc.textFile("s3a://myrepo/test.csv")



print(s3File.count())

print(s3File.id())

And the output is below

OutputSpark

edited May 8 '18 at 21:21

answered May 8 '18 at 21:16

Tarun Lalwani

77.8k243103

So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment

Jars

Everything points to one version which 2.7.3, which what you also need to use

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'

You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars inside your project's virtual env

And after that you can use s3a by default or s3 by defining the handler class for the same

# Only needed if you use s3://

sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')

sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

s3File = sc.textFile("s3a://myrepo/test.csv")



print(s3File.count())

print(s3File.id())

And the output is below

OutputSpark

edited May 8 '18 at 21:21

answered May 8 '18 at 21:16

Tarun Lalwani

77.8k243103

edited May 8 '18 at 21:21

answered May 8 '18 at 21:16

Tarun Lalwani

77.8k243103

answered May 8 '18 at 21:16

Tarun Lalwani

77.8k243103

answered May 8 '18 at 21:16

Tarun Lalwani

77.8k243103

What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
– Jared
May 8 '18 at 21:18

On my laptop the relative path in environment is venv/Lib/site-packages/pyspark/jars
– Tarun Lalwani
May 8 '18 at 21:19

@Jared, worked?
– Tarun Lalwani
May 8 '18 at 21:38

Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
– Jared
May 8 '18 at 21:44

I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
– Tarun Lalwani
May 8 '18 at 21:45

|
show 2 more comments

What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
– Jared
May 8 '18 at 21:18

On my laptop the relative path in environment is venv/Lib/site-packages/pyspark/jars
– Tarun Lalwani
May 8 '18 at 21:19

@Jared, worked?
– Tarun Lalwani
May 8 '18 at 21:38

Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
– Jared
May 8 '18 at 21:44

I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
– Tarun Lalwani
May 8 '18 at 21:45

What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
– Jared
May 8 '18 at 21:18

On my laptop the relative path in environment is venv/Lib/site-packages/pyspark/jars
– Tarun Lalwani
May 8 '18 at 21:19

@Jared, worked?
– Tarun Lalwani
May 8 '18 at 21:38

Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
– Jared
May 8 '18 at 21:44

I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
– Tarun Lalwani
May 8 '18 at 21:45

|
show 2 more comments

You should use the s3a protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext first. Like this:

sc = SparkContext(conf = conf)

sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')

sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')



inputFile = sparkContext.textFile("s3a://somebucket/file.csv")

answered May 8 '18 at 17:54

Glennie Helles Sindholt

6,93022741

With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
– Jared
May 8 '18 at 21:04

If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
– Jared
May 8 '18 at 21:13

add a comment |

You should use the s3a protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext first. Like this:

sc = SparkContext(conf = conf)

sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')

sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')



inputFile = sparkContext.textFile("s3a://somebucket/file.csv")

answered May 8 '18 at 17:54

Glennie Helles Sindholt

6,93022741

With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
– Jared
May 8 '18 at 21:04

If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
– Jared
May 8 '18 at 21:13

add a comment |

You should use the s3a protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext first. Like this:

sc = SparkContext(conf = conf)

sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')

sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')



inputFile = sparkContext.textFile("s3a://somebucket/file.csv")

answered May 8 '18 at 17:54

Glennie Helles Sindholt

6,93022741

You should use the s3a protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext first. Like this:

sc = SparkContext(conf = conf)

sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')

sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')



inputFile = sparkContext.textFile("s3a://somebucket/file.csv")

answered May 8 '18 at 17:54

Glennie Helles Sindholt

6,93022741

answered May 8 '18 at 17:54

Glennie Helles Sindholt

6,93022741

answered May 8 '18 at 17:54

Glennie Helles Sindholt

6,93022741

answered May 8 '18 at 17:54

Glennie Helles Sindholt

6,93022741

With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
– Jared
May 8 '18 at 21:04

If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
– Jared
May 8 '18 at 21:13

add a comment |

With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
– Jared
May 8 '18 at 21:04

If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
– Jared
May 8 '18 at 21:13

With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
– Jared
May 8 '18 at 21:04

If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
– Jared
May 8 '18 at 21:13

add a comment |

preparation:

Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf

spark.hadoop.fs.s3a.access.key=<your access key>

spark.hadoop.fs.s3a.secret.key=<your secret key>

python file content:

from __future__ import print_function

import os



from pyspark import SparkConf

from pyspark import SparkContext



os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"

os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"





if __name__ == "__main__":



    conf = SparkConf().setAppName("read_s3").setMaster("local[2]")

    sc = SparkContext(conf=conf)



    my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")

    print("file count:", my_s3_file3.count())

commit:

spark-submit --master local 

--packages org.apache.hadoop:hadoop-aws:2.7.3,

com.amazonaws:aws-java-sdk:1.7.4,

org.apache.hadoop:hadoop-common:2.7.3 

<path to the py file above>

answered Nov 13 '18 at 1:52

buxizhizhoum

35849

add a comment |

preparation:

Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf

spark.hadoop.fs.s3a.access.key=<your access key>

spark.hadoop.fs.s3a.secret.key=<your secret key>

python file content:

from __future__ import print_function

import os



from pyspark import SparkConf

from pyspark import SparkContext



os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"

os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"





if __name__ == "__main__":



    conf = SparkConf().setAppName("read_s3").setMaster("local[2]")

    sc = SparkContext(conf=conf)



    my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")

    print("file count:", my_s3_file3.count())

commit:

spark-submit --master local 

--packages org.apache.hadoop:hadoop-aws:2.7.3,

com.amazonaws:aws-java-sdk:1.7.4,

org.apache.hadoop:hadoop-common:2.7.3 

<path to the py file above>

answered Nov 13 '18 at 1:52

buxizhizhoum

35849

add a comment |

preparation:

Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf

spark.hadoop.fs.s3a.access.key=<your access key>

spark.hadoop.fs.s3a.secret.key=<your secret key>

python file content:

from __future__ import print_function

import os



from pyspark import SparkConf

from pyspark import SparkContext



os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"

os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"





if __name__ == "__main__":



    conf = SparkConf().setAppName("read_s3").setMaster("local[2]")

    sc = SparkContext(conf=conf)



    my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")

    print("file count:", my_s3_file3.count())

commit:

spark-submit --master local 

--packages org.apache.hadoop:hadoop-aws:2.7.3,

com.amazonaws:aws-java-sdk:1.7.4,

org.apache.hadoop:hadoop-common:2.7.3 

<path to the py file above>

answered Nov 13 '18 at 1:52

buxizhizhoum

35849

preparation:

Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf

spark.hadoop.fs.s3a.access.key=<your access key>

spark.hadoop.fs.s3a.secret.key=<your secret key>

python file content:

from __future__ import print_function

import os



from pyspark import SparkConf

from pyspark import SparkContext



os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"

os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"





if __name__ == "__main__":



    conf = SparkConf().setAppName("read_s3").setMaster("local[2]")

    sc = SparkContext(conf=conf)



    my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")

    print("file count:", my_s3_file3.count())

commit:

spark-submit --master local 

--packages org.apache.hadoop:hadoop-aws:2.7.3,

com.amazonaws:aws-java-sdk:1.7.4,

org.apache.hadoop:hadoop-common:2.7.3 

<path to the py file above>

answered Nov 13 '18 at 1:52

buxizhizhoum

35849

answered Nov 13 '18 at 1:52

buxizhizhoum

35849

answered Nov 13 '18 at 1:52

buxizhizhoum

35849

answered Nov 13 '18 at 1:52

buxizhizhoum

35849

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

V0,gilBOv8fuE dOBCDlK ii YTn2rL Fr

搜尋此網誌

Vfrdtyky