How can I read from S3 in pyspark running in local mode?
I am using PyCharm 2018.1 using Python 3.4 with Spark 2.3 installed via pip in a virtualenv. There is no hadoop installation on the local host, so there is no Spark installation (thus no SPARK_HOME, HADOOP_HOME, etc.)
When I try this:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")
I get:
py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.io.IOException: No FileSystem for scheme: s3
How can I read from s3 while running pyspark in local mode without a complete Hadoop install locally?
FWIW - this works great when I execute it on an EMR node in non-local mode.
The following does not work (same error, although it does resolve and download the dependancies):
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")
Same (bad) results with:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")
python apache-spark amazon-s3 pyspark
add a comment |
I am using PyCharm 2018.1 using Python 3.4 with Spark 2.3 installed via pip in a virtualenv. There is no hadoop installation on the local host, so there is no Spark installation (thus no SPARK_HOME, HADOOP_HOME, etc.)
When I try this:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")
I get:
py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.io.IOException: No FileSystem for scheme: s3
How can I read from s3 while running pyspark in local mode without a complete Hadoop install locally?
FWIW - this works great when I execute it on an EMR node in non-local mode.
The following does not work (same error, although it does resolve and download the dependancies):
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")
Same (bad) results with:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")
python apache-spark amazon-s3 pyspark
Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
– hi-zir
May 4 '18 at 22:53
There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
– Jared
May 6 '18 at 18:02
2
Try to uses3a
protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
– prudenko
May 8 '18 at 10:44
add a comment |
I am using PyCharm 2018.1 using Python 3.4 with Spark 2.3 installed via pip in a virtualenv. There is no hadoop installation on the local host, so there is no Spark installation (thus no SPARK_HOME, HADOOP_HOME, etc.)
When I try this:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")
I get:
py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.io.IOException: No FileSystem for scheme: s3
How can I read from s3 while running pyspark in local mode without a complete Hadoop install locally?
FWIW - this works great when I execute it on an EMR node in non-local mode.
The following does not work (same error, although it does resolve and download the dependancies):
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")
Same (bad) results with:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")
python apache-spark amazon-s3 pyspark
I am using PyCharm 2018.1 using Python 3.4 with Spark 2.3 installed via pip in a virtualenv. There is no hadoop installation on the local host, so there is no Spark installation (thus no SPARK_HOME, HADOOP_HOME, etc.)
When I try this:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")
I get:
py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.io.IOException: No FileSystem for scheme: s3
How can I read from s3 while running pyspark in local mode without a complete Hadoop install locally?
FWIW - this works great when I execute it on an EMR node in non-local mode.
The following does not work (same error, although it does resolve and download the dependancies):
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")
Same (bad) results with:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")
python apache-spark amazon-s3 pyspark
python apache-spark amazon-s3 pyspark
edited May 6 '18 at 18:36
asked May 4 '18 at 22:36
Jared
16.5k156697
16.5k156697
Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
– hi-zir
May 4 '18 at 22:53
There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
– Jared
May 6 '18 at 18:02
2
Try to uses3a
protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
– prudenko
May 8 '18 at 10:44
add a comment |
Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
– hi-zir
May 4 '18 at 22:53
There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
– Jared
May 6 '18 at 18:02
2
Try to uses3a
protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
– prudenko
May 8 '18 at 10:44
Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
– hi-zir
May 4 '18 at 22:53
Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
– hi-zir
May 4 '18 at 22:53
There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
– Jared
May 6 '18 at 18:02
There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
– Jared
May 6 '18 at 18:02
2
2
Try to use
s3a
protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")– prudenko
May 8 '18 at 10:44
Try to use
s3a
protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")– prudenko
May 8 '18 at 10:44
add a comment |
3 Answers
3
active
oldest
votes
So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment
Everything points to one version which 2.7.3
, which what you also need to use
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'
You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars
inside your project's virtual env
And after that you can use s3a
by default or s3
by defining the handler class for the same
# Only needed if you use s3://
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
s3File = sc.textFile("s3a://myrepo/test.csv")
print(s3File.count())
print(s3File.id())
And the output is below
What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
– Jared
May 8 '18 at 21:18
On my laptop the relative path in environment isvenv/Lib/site-packages/pyspark/jars
– Tarun Lalwani
May 8 '18 at 21:19
@Jared, worked?
– Tarun Lalwani
May 8 '18 at 21:38
Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
– Jared
May 8 '18 at 21:44
I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
– Tarun Lalwani
May 8 '18 at 21:45
|
show 2 more comments
You should use the s3a
protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext
first. Like this:
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
– Jared
May 8 '18 at 21:04
If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
– Jared
May 8 '18 at 21:13
add a comment |
preparation:
Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf
spark.hadoop.fs.s3a.access.key=<your access key>
spark.hadoop.fs.s3a.secret.key=<your secret key>
python file content:
from __future__ import print_function
import os
from pyspark import SparkConf
from pyspark import SparkContext
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
if __name__ == "__main__":
conf = SparkConf().setAppName("read_s3").setMaster("local[2]")
sc = SparkContext(conf=conf)
my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")
print("file count:", my_s3_file3.count())
commit:
spark-submit --master local
--packages org.apache.hadoop:hadoop-aws:2.7.3,
com.amazonaws:aws-java-sdk:1.7.4,
org.apache.hadoop:hadoop-common:2.7.3
<path to the py file above>
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f50183915%2fhow-can-i-read-from-s3-in-pyspark-running-in-local-mode%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment
Everything points to one version which 2.7.3
, which what you also need to use
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'
You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars
inside your project's virtual env
And after that you can use s3a
by default or s3
by defining the handler class for the same
# Only needed if you use s3://
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
s3File = sc.textFile("s3a://myrepo/test.csv")
print(s3File.count())
print(s3File.id())
And the output is below
What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
– Jared
May 8 '18 at 21:18
On my laptop the relative path in environment isvenv/Lib/site-packages/pyspark/jars
– Tarun Lalwani
May 8 '18 at 21:19
@Jared, worked?
– Tarun Lalwani
May 8 '18 at 21:38
Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
– Jared
May 8 '18 at 21:44
I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
– Tarun Lalwani
May 8 '18 at 21:45
|
show 2 more comments
So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment
Everything points to one version which 2.7.3
, which what you also need to use
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'
You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars
inside your project's virtual env
And after that you can use s3a
by default or s3
by defining the handler class for the same
# Only needed if you use s3://
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
s3File = sc.textFile("s3a://myrepo/test.csv")
print(s3File.count())
print(s3File.id())
And the output is below
What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
– Jared
May 8 '18 at 21:18
On my laptop the relative path in environment isvenv/Lib/site-packages/pyspark/jars
– Tarun Lalwani
May 8 '18 at 21:19
@Jared, worked?
– Tarun Lalwani
May 8 '18 at 21:38
Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
– Jared
May 8 '18 at 21:44
I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
– Tarun Lalwani
May 8 '18 at 21:45
|
show 2 more comments
So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment
Everything points to one version which 2.7.3
, which what you also need to use
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'
You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars
inside your project's virtual env
And after that you can use s3a
by default or s3
by defining the handler class for the same
# Only needed if you use s3://
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
s3File = sc.textFile("s3a://myrepo/test.csv")
print(s3File.count())
print(s3File.id())
And the output is below
So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment
Everything points to one version which 2.7.3
, which what you also need to use
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'
You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars
inside your project's virtual env
And after that you can use s3a
by default or s3
by defining the handler class for the same
# Only needed if you use s3://
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
s3File = sc.textFile("s3a://myrepo/test.csv")
print(s3File.count())
print(s3File.id())
And the output is below
edited May 8 '18 at 21:21
answered May 8 '18 at 21:16
Tarun Lalwani
77.8k243103
77.8k243103
What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
– Jared
May 8 '18 at 21:18
On my laptop the relative path in environment isvenv/Lib/site-packages/pyspark/jars
– Tarun Lalwani
May 8 '18 at 21:19
@Jared, worked?
– Tarun Lalwani
May 8 '18 at 21:38
Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
– Jared
May 8 '18 at 21:44
I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
– Tarun Lalwani
May 8 '18 at 21:45
|
show 2 more comments
What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
– Jared
May 8 '18 at 21:18
On my laptop the relative path in environment isvenv/Lib/site-packages/pyspark/jars
– Tarun Lalwani
May 8 '18 at 21:19
@Jared, worked?
– Tarun Lalwani
May 8 '18 at 21:38
Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
– Jared
May 8 '18 at 21:44
I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
– Tarun Lalwani
May 8 '18 at 21:45
What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
– Jared
May 8 '18 at 21:18
What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
– Jared
May 8 '18 at 21:18
On my laptop the relative path in environment is
venv/Lib/site-packages/pyspark/jars
– Tarun Lalwani
May 8 '18 at 21:19
On my laptop the relative path in environment is
venv/Lib/site-packages/pyspark/jars
– Tarun Lalwani
May 8 '18 at 21:19
@Jared, worked?
– Tarun Lalwani
May 8 '18 at 21:38
@Jared, worked?
– Tarun Lalwani
May 8 '18 at 21:38
Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
– Jared
May 8 '18 at 21:44
Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
– Jared
May 8 '18 at 21:44
I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
– Tarun Lalwani
May 8 '18 at 21:45
I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
– Tarun Lalwani
May 8 '18 at 21:45
|
show 2 more comments
You should use the s3a
protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext
first. Like this:
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
– Jared
May 8 '18 at 21:04
If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
– Jared
May 8 '18 at 21:13
add a comment |
You should use the s3a
protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext
first. Like this:
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
– Jared
May 8 '18 at 21:04
If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
– Jared
May 8 '18 at 21:13
add a comment |
You should use the s3a
protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext
first. Like this:
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
You should use the s3a
protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext
first. Like this:
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
answered May 8 '18 at 17:54
Glennie Helles Sindholt
6,93022741
6,93022741
With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
– Jared
May 8 '18 at 21:04
If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
– Jared
May 8 '18 at 21:13
add a comment |
With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
– Jared
May 8 '18 at 21:04
If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
– Jared
May 8 '18 at 21:13
With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
– Jared
May 8 '18 at 21:04
With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
– Jared
May 8 '18 at 21:04
If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
– Jared
May 8 '18 at 21:13
If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
– Jared
May 8 '18 at 21:13
add a comment |
preparation:
Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf
spark.hadoop.fs.s3a.access.key=<your access key>
spark.hadoop.fs.s3a.secret.key=<your secret key>
python file content:
from __future__ import print_function
import os
from pyspark import SparkConf
from pyspark import SparkContext
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
if __name__ == "__main__":
conf = SparkConf().setAppName("read_s3").setMaster("local[2]")
sc = SparkContext(conf=conf)
my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")
print("file count:", my_s3_file3.count())
commit:
spark-submit --master local
--packages org.apache.hadoop:hadoop-aws:2.7.3,
com.amazonaws:aws-java-sdk:1.7.4,
org.apache.hadoop:hadoop-common:2.7.3
<path to the py file above>
add a comment |
preparation:
Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf
spark.hadoop.fs.s3a.access.key=<your access key>
spark.hadoop.fs.s3a.secret.key=<your secret key>
python file content:
from __future__ import print_function
import os
from pyspark import SparkConf
from pyspark import SparkContext
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
if __name__ == "__main__":
conf = SparkConf().setAppName("read_s3").setMaster("local[2]")
sc = SparkContext(conf=conf)
my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")
print("file count:", my_s3_file3.count())
commit:
spark-submit --master local
--packages org.apache.hadoop:hadoop-aws:2.7.3,
com.amazonaws:aws-java-sdk:1.7.4,
org.apache.hadoop:hadoop-common:2.7.3
<path to the py file above>
add a comment |
preparation:
Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf
spark.hadoop.fs.s3a.access.key=<your access key>
spark.hadoop.fs.s3a.secret.key=<your secret key>
python file content:
from __future__ import print_function
import os
from pyspark import SparkConf
from pyspark import SparkContext
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
if __name__ == "__main__":
conf = SparkConf().setAppName("read_s3").setMaster("local[2]")
sc = SparkContext(conf=conf)
my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")
print("file count:", my_s3_file3.count())
commit:
spark-submit --master local
--packages org.apache.hadoop:hadoop-aws:2.7.3,
com.amazonaws:aws-java-sdk:1.7.4,
org.apache.hadoop:hadoop-common:2.7.3
<path to the py file above>
preparation:
Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf
spark.hadoop.fs.s3a.access.key=<your access key>
spark.hadoop.fs.s3a.secret.key=<your secret key>
python file content:
from __future__ import print_function
import os
from pyspark import SparkConf
from pyspark import SparkContext
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
if __name__ == "__main__":
conf = SparkConf().setAppName("read_s3").setMaster("local[2]")
sc = SparkContext(conf=conf)
my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")
print("file count:", my_s3_file3.count())
commit:
spark-submit --master local
--packages org.apache.hadoop:hadoop-aws:2.7.3,
com.amazonaws:aws-java-sdk:1.7.4,
org.apache.hadoop:hadoop-common:2.7.3
<path to the py file above>
answered Nov 13 '18 at 1:52
buxizhizhoum
35849
35849
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f50183915%2fhow-can-i-read-from-s3-in-pyspark-running-in-local-mode%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
– hi-zir
May 4 '18 at 22:53
There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
– Jared
May 6 '18 at 18:02
2
Try to use
s3a
protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")– prudenko
May 8 '18 at 10:44