How can I read from S3 in pyspark running in local mode?












2














I am using PyCharm 2018.1 using Python 3.4 with Spark 2.3 installed via pip in a virtualenv. There is no hadoop installation on the local host, so there is no Spark installation (thus no SPARK_HOME, HADOOP_HOME, etc.)



When I try this:



from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")


I get:



py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.io.IOException: No FileSystem for scheme: s3


How can I read from s3 while running pyspark in local mode without a complete Hadoop install locally?



FWIW - this works great when I execute it on an EMR node in non-local mode.



The following does not work (same error, although it does resolve and download the dependancies):



import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")


Same (bad) results with:



import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")









share|improve this question
























  • Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
    – hi-zir
    May 4 '18 at 22:53










  • There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
    – Jared
    May 6 '18 at 18:02






  • 2




    Try to use s3a protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
    – prudenko
    May 8 '18 at 10:44


















2














I am using PyCharm 2018.1 using Python 3.4 with Spark 2.3 installed via pip in a virtualenv. There is no hadoop installation on the local host, so there is no Spark installation (thus no SPARK_HOME, HADOOP_HOME, etc.)



When I try this:



from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")


I get:



py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.io.IOException: No FileSystem for scheme: s3


How can I read from s3 while running pyspark in local mode without a complete Hadoop install locally?



FWIW - this works great when I execute it on an EMR node in non-local mode.



The following does not work (same error, although it does resolve and download the dependancies):



import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")


Same (bad) results with:



import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")









share|improve this question
























  • Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
    – hi-zir
    May 4 '18 at 22:53










  • There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
    – Jared
    May 6 '18 at 18:02






  • 2




    Try to use s3a protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
    – prudenko
    May 8 '18 at 10:44
















2












2








2


0





I am using PyCharm 2018.1 using Python 3.4 with Spark 2.3 installed via pip in a virtualenv. There is no hadoop installation on the local host, so there is no Spark installation (thus no SPARK_HOME, HADOOP_HOME, etc.)



When I try this:



from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")


I get:



py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.io.IOException: No FileSystem for scheme: s3


How can I read from s3 while running pyspark in local mode without a complete Hadoop install locally?



FWIW - this works great when I execute it on an EMR node in non-local mode.



The following does not work (same error, although it does resolve and download the dependancies):



import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")


Same (bad) results with:



import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")









share|improve this question















I am using PyCharm 2018.1 using Python 3.4 with Spark 2.3 installed via pip in a virtualenv. There is no hadoop installation on the local host, so there is no Spark installation (thus no SPARK_HOME, HADOOP_HOME, etc.)



When I try this:



from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")


I get:



py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.io.IOException: No FileSystem for scheme: s3


How can I read from s3 while running pyspark in local mode without a complete Hadoop install locally?



FWIW - this works great when I execute it on an EMR node in non-local mode.



The following does not work (same error, although it does resolve and download the dependancies):



import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")


Same (bad) results with:



import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
.setMaster("local")
.setAppName("pyspark-unittests")
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")






python apache-spark amazon-s3 pyspark






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited May 6 '18 at 18:36

























asked May 4 '18 at 22:36









Jared

16.5k156697




16.5k156697












  • Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
    – hi-zir
    May 4 '18 at 22:53










  • There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
    – Jared
    May 6 '18 at 18:02






  • 2




    Try to use s3a protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
    – prudenko
    May 8 '18 at 10:44




















  • Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
    – hi-zir
    May 4 '18 at 22:53










  • There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
    – Jared
    May 6 '18 at 18:02






  • 2




    Try to use s3a protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
    – prudenko
    May 8 '18 at 10:44


















Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
– hi-zir
May 4 '18 at 22:53




Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
– hi-zir
May 4 '18 at 22:53












There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
– Jared
May 6 '18 at 18:02




There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
– Jared
May 6 '18 at 18:02




2




2




Try to use s3a protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
– prudenko
May 8 '18 at 10:44






Try to use s3a protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
– prudenko
May 8 '18 at 10:44














3 Answers
3






active

oldest

votes


















1





+300









So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment



Jars



Everything points to one version which 2.7.3, which what you also need to use



os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'


You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars inside your project's virtual env



And after that you can use s3a by default or s3 by defining the handler class for the same



# Only needed if you use s3://
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
s3File = sc.textFile("s3a://myrepo/test.csv")

print(s3File.count())
print(s3File.id())


And the output is below



OutputSpark






share|improve this answer























  • What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
    – Jared
    May 8 '18 at 21:18










  • On my laptop the relative path in environment is venv/Lib/site-packages/pyspark/jars
    – Tarun Lalwani
    May 8 '18 at 21:19










  • @Jared, worked?
    – Tarun Lalwani
    May 8 '18 at 21:38










  • Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
    – Jared
    May 8 '18 at 21:44










  • I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
    – Tarun Lalwani
    May 8 '18 at 21:45



















2














You should use the s3a protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext first. Like this:



sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

inputFile = sparkContext.textFile("s3a://somebucket/file.csv")





share|improve this answer





















  • With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
    – Jared
    May 8 '18 at 21:04












  • If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
    – Jared
    May 8 '18 at 21:13



















0














preparation:



Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf



spark.hadoop.fs.s3a.access.key=<your access key>
spark.hadoop.fs.s3a.secret.key=<your secret key>


python file content:



from __future__ import print_function
import os

from pyspark import SparkConf
from pyspark import SparkContext

os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"


if __name__ == "__main__":

conf = SparkConf().setAppName("read_s3").setMaster("local[2]")
sc = SparkContext(conf=conf)

my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")
print("file count:", my_s3_file3.count())


commit:



spark-submit --master local 
--packages org.apache.hadoop:hadoop-aws:2.7.3,
com.amazonaws:aws-java-sdk:1.7.4,
org.apache.hadoop:hadoop-common:2.7.3
<path to the py file above>





share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f50183915%2fhow-can-i-read-from-s3-in-pyspark-running-in-local-mode%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1





    +300









    So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment



    Jars



    Everything points to one version which 2.7.3, which what you also need to use



    os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'


    You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars inside your project's virtual env



    And after that you can use s3a by default or s3 by defining the handler class for the same



    # Only needed if you use s3://
    sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
    sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
    s3File = sc.textFile("s3a://myrepo/test.csv")

    print(s3File.count())
    print(s3File.id())


    And the output is below



    OutputSpark






    share|improve this answer























    • What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
      – Jared
      May 8 '18 at 21:18










    • On my laptop the relative path in environment is venv/Lib/site-packages/pyspark/jars
      – Tarun Lalwani
      May 8 '18 at 21:19










    • @Jared, worked?
      – Tarun Lalwani
      May 8 '18 at 21:38










    • Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
      – Jared
      May 8 '18 at 21:44










    • I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
      – Tarun Lalwani
      May 8 '18 at 21:45
















    1





    +300









    So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment



    Jars



    Everything points to one version which 2.7.3, which what you also need to use



    os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'


    You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars inside your project's virtual env



    And after that you can use s3a by default or s3 by defining the handler class for the same



    # Only needed if you use s3://
    sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
    sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
    s3File = sc.textFile("s3a://myrepo/test.csv")

    print(s3File.count())
    print(s3File.id())


    And the output is below



    OutputSpark






    share|improve this answer























    • What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
      – Jared
      May 8 '18 at 21:18










    • On my laptop the relative path in environment is venv/Lib/site-packages/pyspark/jars
      – Tarun Lalwani
      May 8 '18 at 21:19










    • @Jared, worked?
      – Tarun Lalwani
      May 8 '18 at 21:38










    • Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
      – Jared
      May 8 '18 at 21:44










    • I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
      – Tarun Lalwani
      May 8 '18 at 21:45














    1





    +300







    1





    +300



    1




    +300




    So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment



    Jars



    Everything points to one version which 2.7.3, which what you also need to use



    os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'


    You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars inside your project's virtual env



    And after that you can use s3a by default or s3 by defining the handler class for the same



    # Only needed if you use s3://
    sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
    sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
    s3File = sc.textFile("s3a://myrepo/test.csv")

    print(s3File.count())
    print(s3File.id())


    And the output is below



    OutputSpark






    share|improve this answer














    So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment



    Jars



    Everything points to one version which 2.7.3, which what you also need to use



    os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'


    You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars inside your project's virtual env



    And after that you can use s3a by default or s3 by defining the handler class for the same



    # Only needed if you use s3://
    sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
    sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
    s3File = sc.textFile("s3a://myrepo/test.csv")

    print(s3File.count())
    print(s3File.id())


    And the output is below



    OutputSpark







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited May 8 '18 at 21:21

























    answered May 8 '18 at 21:16









    Tarun Lalwani

    77.8k243103




    77.8k243103












    • What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
      – Jared
      May 8 '18 at 21:18










    • On my laptop the relative path in environment is venv/Lib/site-packages/pyspark/jars
      – Tarun Lalwani
      May 8 '18 at 21:19










    • @Jared, worked?
      – Tarun Lalwani
      May 8 '18 at 21:38










    • Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
      – Jared
      May 8 '18 at 21:44










    • I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
      – Tarun Lalwani
      May 8 '18 at 21:45


















    • What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
      – Jared
      May 8 '18 at 21:18










    • On my laptop the relative path in environment is venv/Lib/site-packages/pyspark/jars
      – Tarun Lalwani
      May 8 '18 at 21:19










    • @Jared, worked?
      – Tarun Lalwani
      May 8 '18 at 21:38










    • Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
      – Jared
      May 8 '18 at 21:44










    • I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
      – Tarun Lalwani
      May 8 '18 at 21:45
















    What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
    – Jared
    May 8 '18 at 21:18




    What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
    – Jared
    May 8 '18 at 21:18












    On my laptop the relative path in environment is venv/Lib/site-packages/pyspark/jars
    – Tarun Lalwani
    May 8 '18 at 21:19




    On my laptop the relative path in environment is venv/Lib/site-packages/pyspark/jars
    – Tarun Lalwani
    May 8 '18 at 21:19












    @Jared, worked?
    – Tarun Lalwani
    May 8 '18 at 21:38




    @Jared, worked?
    – Tarun Lalwani
    May 8 '18 at 21:38












    Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
    – Jared
    May 8 '18 at 21:44




    Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
    – Jared
    May 8 '18 at 21:44












    I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
    – Tarun Lalwani
    May 8 '18 at 21:45




    I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
    – Tarun Lalwani
    May 8 '18 at 21:45













    2














    You should use the s3a protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext first. Like this:



    sc = SparkContext(conf = conf)
    sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
    sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

    inputFile = sparkContext.textFile("s3a://somebucket/file.csv")





    share|improve this answer





















    • With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
      – Jared
      May 8 '18 at 21:04












    • If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
      – Jared
      May 8 '18 at 21:13
















    2














    You should use the s3a protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext first. Like this:



    sc = SparkContext(conf = conf)
    sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
    sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

    inputFile = sparkContext.textFile("s3a://somebucket/file.csv")





    share|improve this answer





















    • With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
      – Jared
      May 8 '18 at 21:04












    • If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
      – Jared
      May 8 '18 at 21:13














    2












    2








    2






    You should use the s3a protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext first. Like this:



    sc = SparkContext(conf = conf)
    sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
    sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

    inputFile = sparkContext.textFile("s3a://somebucket/file.csv")





    share|improve this answer












    You should use the s3a protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext first. Like this:



    sc = SparkContext(conf = conf)
    sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
    sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

    inputFile = sparkContext.textFile("s3a://somebucket/file.csv")






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered May 8 '18 at 17:54









    Glennie Helles Sindholt

    6,93022741




    6,93022741












    • With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
      – Jared
      May 8 '18 at 21:04












    • If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
      – Jared
      May 8 '18 at 21:13


















    • With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
      – Jared
      May 8 '18 at 21:04












    • If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
      – Jared
      May 8 '18 at 21:13
















    With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
    – Jared
    May 8 '18 at 21:04






    With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
    – Jared
    May 8 '18 at 21:04














    If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
    – Jared
    May 8 '18 at 21:13




    If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)
    – Jared
    May 8 '18 at 21:13











    0














    preparation:



    Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf



    spark.hadoop.fs.s3a.access.key=<your access key>
    spark.hadoop.fs.s3a.secret.key=<your secret key>


    python file content:



    from __future__ import print_function
    import os

    from pyspark import SparkConf
    from pyspark import SparkContext

    os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
    os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"


    if __name__ == "__main__":

    conf = SparkConf().setAppName("read_s3").setMaster("local[2]")
    sc = SparkContext(conf=conf)

    my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")
    print("file count:", my_s3_file3.count())


    commit:



    spark-submit --master local 
    --packages org.apache.hadoop:hadoop-aws:2.7.3,
    com.amazonaws:aws-java-sdk:1.7.4,
    org.apache.hadoop:hadoop-common:2.7.3
    <path to the py file above>





    share|improve this answer


























      0














      preparation:



      Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf



      spark.hadoop.fs.s3a.access.key=<your access key>
      spark.hadoop.fs.s3a.secret.key=<your secret key>


      python file content:



      from __future__ import print_function
      import os

      from pyspark import SparkConf
      from pyspark import SparkContext

      os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
      os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"


      if __name__ == "__main__":

      conf = SparkConf().setAppName("read_s3").setMaster("local[2]")
      sc = SparkContext(conf=conf)

      my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")
      print("file count:", my_s3_file3.count())


      commit:



      spark-submit --master local 
      --packages org.apache.hadoop:hadoop-aws:2.7.3,
      com.amazonaws:aws-java-sdk:1.7.4,
      org.apache.hadoop:hadoop-common:2.7.3
      <path to the py file above>





      share|improve this answer
























        0












        0








        0






        preparation:



        Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf



        spark.hadoop.fs.s3a.access.key=<your access key>
        spark.hadoop.fs.s3a.secret.key=<your secret key>


        python file content:



        from __future__ import print_function
        import os

        from pyspark import SparkConf
        from pyspark import SparkContext

        os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
        os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"


        if __name__ == "__main__":

        conf = SparkConf().setAppName("read_s3").setMaster("local[2]")
        sc = SparkContext(conf=conf)

        my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")
        print("file count:", my_s3_file3.count())


        commit:



        spark-submit --master local 
        --packages org.apache.hadoop:hadoop-aws:2.7.3,
        com.amazonaws:aws-java-sdk:1.7.4,
        org.apache.hadoop:hadoop-common:2.7.3
        <path to the py file above>





        share|improve this answer












        preparation:



        Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf



        spark.hadoop.fs.s3a.access.key=<your access key>
        spark.hadoop.fs.s3a.secret.key=<your secret key>


        python file content:



        from __future__ import print_function
        import os

        from pyspark import SparkConf
        from pyspark import SparkContext

        os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
        os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"


        if __name__ == "__main__":

        conf = SparkConf().setAppName("read_s3").setMaster("local[2]")
        sc = SparkContext(conf=conf)

        my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")
        print("file count:", my_s3_file3.count())


        commit:



        spark-submit --master local 
        --packages org.apache.hadoop:hadoop-aws:2.7.3,
        com.amazonaws:aws-java-sdk:1.7.4,
        org.apache.hadoop:hadoop-common:2.7.3
        <path to the py file above>






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 13 '18 at 1:52









        buxizhizhoum

        35849




        35849






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f50183915%2fhow-can-i-read-from-s3-in-pyspark-running-in-local-mode%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Xamarin.iOS Cant Deploy on Iphone

            Glorious Revolution

            Dulmage-Mendelsohn matrix decomposition in Python