Best way to read in part of a huge table to AWS GLUE












0














I'm having some trouble loading a large file from my data lake (currently stored in postgres) into AWS GLUE. It is 4.3 Billion rows.



In testing, I've found that the table is too large to be fully read in.



Here's how I'm loading the data frame:



large_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = "DBNAME",
table_name = "TABLENAME",
redshift_tmp_dir = args["TempDir"],
transformation_ctx = "TABLECTX")


Important Factors




  • I don't need the whole data frame! I'll ultimately filter based on a couple of attributes and join with smaller tables.


  • I've already tried using a push_down_predicate, but that required the data to be stored in S3 using a specific folder organization and unfortunately I don't get to choose the pre-existing format of this table.


  • I've also tried reading in the table and simply re-organizing it to the S3 folder organization necessary for pushdown_predicate to work, but the process ends with "exit code 1" after 5 hours of running.



Primary Question



How can I read in part of a table without using a pushdown predicate?










share|improve this question



























    0














    I'm having some trouble loading a large file from my data lake (currently stored in postgres) into AWS GLUE. It is 4.3 Billion rows.



    In testing, I've found that the table is too large to be fully read in.



    Here's how I'm loading the data frame:



    large_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = "DBNAME",
    table_name = "TABLENAME",
    redshift_tmp_dir = args["TempDir"],
    transformation_ctx = "TABLECTX")


    Important Factors




    • I don't need the whole data frame! I'll ultimately filter based on a couple of attributes and join with smaller tables.


    • I've already tried using a push_down_predicate, but that required the data to be stored in S3 using a specific folder organization and unfortunately I don't get to choose the pre-existing format of this table.


    • I've also tried reading in the table and simply re-organizing it to the S3 folder organization necessary for pushdown_predicate to work, but the process ends with "exit code 1" after 5 hours of running.



    Primary Question



    How can I read in part of a table without using a pushdown predicate?










    share|improve this question

























      0












      0








      0


      1





      I'm having some trouble loading a large file from my data lake (currently stored in postgres) into AWS GLUE. It is 4.3 Billion rows.



      In testing, I've found that the table is too large to be fully read in.



      Here's how I'm loading the data frame:



      large_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = "DBNAME",
      table_name = "TABLENAME",
      redshift_tmp_dir = args["TempDir"],
      transformation_ctx = "TABLECTX")


      Important Factors




      • I don't need the whole data frame! I'll ultimately filter based on a couple of attributes and join with smaller tables.


      • I've already tried using a push_down_predicate, but that required the data to be stored in S3 using a specific folder organization and unfortunately I don't get to choose the pre-existing format of this table.


      • I've also tried reading in the table and simply re-organizing it to the S3 folder organization necessary for pushdown_predicate to work, but the process ends with "exit code 1" after 5 hours of running.



      Primary Question



      How can I read in part of a table without using a pushdown predicate?










      share|improve this question













      I'm having some trouble loading a large file from my data lake (currently stored in postgres) into AWS GLUE. It is 4.3 Billion rows.



      In testing, I've found that the table is too large to be fully read in.



      Here's how I'm loading the data frame:



      large_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = "DBNAME",
      table_name = "TABLENAME",
      redshift_tmp_dir = args["TempDir"],
      transformation_ctx = "TABLECTX")


      Important Factors




      • I don't need the whole data frame! I'll ultimately filter based on a couple of attributes and join with smaller tables.


      • I've already tried using a push_down_predicate, but that required the data to be stored in S3 using a specific folder organization and unfortunately I don't get to choose the pre-existing format of this table.


      • I've also tried reading in the table and simply re-organizing it to the S3 folder organization necessary for pushdown_predicate to work, but the process ends with "exit code 1" after 5 hours of running.



      Primary Question



      How can I read in part of a table without using a pushdown predicate?







      python amazon-web-services apache-spark pyspark aws-glue






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 12 at 17:51









      Reid

      22539




      22539
























          2 Answers
          2






          active

          oldest

          votes


















          1














          You can also use pure spark/pyspark code in Glue and take advantage of its read methods.
          You can see in their documentation, how to read from redshift or in general any SQL DB through JDBC. Even reading data from a query like the following example:



          # Read data from a query
          df = spark.read
          .format("com.databricks.spark.redshift")
          .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
          .option("query", "select x, count(*) my_table group by x")
          .option("tempdir", "s3n://path/for/temp/data")
          .load()


          I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.






          share|improve this answer





























            0














            Unfortunately predicate pushdown works only for S3 as you figured out already.




            1. Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.

            2. How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.

            3. Or you can create a view in Postgres and use that as source.


            Please let me know if that helped.






            share|improve this answer





















              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267549%2fbest-way-to-read-in-part-of-a-huge-table-to-aws-glue%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1














              You can also use pure spark/pyspark code in Glue and take advantage of its read methods.
              You can see in their documentation, how to read from redshift or in general any SQL DB through JDBC. Even reading data from a query like the following example:



              # Read data from a query
              df = spark.read
              .format("com.databricks.spark.redshift")
              .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
              .option("query", "select x, count(*) my_table group by x")
              .option("tempdir", "s3n://path/for/temp/data")
              .load()


              I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.






              share|improve this answer


























                1














                You can also use pure spark/pyspark code in Glue and take advantage of its read methods.
                You can see in their documentation, how to read from redshift or in general any SQL DB through JDBC. Even reading data from a query like the following example:



                # Read data from a query
                df = spark.read
                .format("com.databricks.spark.redshift")
                .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
                .option("query", "select x, count(*) my_table group by x")
                .option("tempdir", "s3n://path/for/temp/data")
                .load()


                I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.






                share|improve this answer
























                  1












                  1








                  1






                  You can also use pure spark/pyspark code in Glue and take advantage of its read methods.
                  You can see in their documentation, how to read from redshift or in general any SQL DB through JDBC. Even reading data from a query like the following example:



                  # Read data from a query
                  df = spark.read
                  .format("com.databricks.spark.redshift")
                  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
                  .option("query", "select x, count(*) my_table group by x")
                  .option("tempdir", "s3n://path/for/temp/data")
                  .load()


                  I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.






                  share|improve this answer












                  You can also use pure spark/pyspark code in Glue and take advantage of its read methods.
                  You can see in their documentation, how to read from redshift or in general any SQL DB through JDBC. Even reading data from a query like the following example:



                  # Read data from a query
                  df = spark.read
                  .format("com.databricks.spark.redshift")
                  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
                  .option("query", "select x, count(*) my_table group by x")
                  .option("tempdir", "s3n://path/for/temp/data")
                  .load()


                  I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 14 at 16:17









                  Aida Martinez

                  214




                  214

























                      0














                      Unfortunately predicate pushdown works only for S3 as you figured out already.




                      1. Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.

                      2. How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.

                      3. Or you can create a view in Postgres and use that as source.


                      Please let me know if that helped.






                      share|improve this answer


























                        0














                        Unfortunately predicate pushdown works only for S3 as you figured out already.




                        1. Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.

                        2. How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.

                        3. Or you can create a view in Postgres and use that as source.


                        Please let me know if that helped.






                        share|improve this answer
























                          0












                          0








                          0






                          Unfortunately predicate pushdown works only for S3 as you figured out already.




                          1. Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.

                          2. How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.

                          3. Or you can create a view in Postgres and use that as source.


                          Please let me know if that helped.






                          share|improve this answer












                          Unfortunately predicate pushdown works only for S3 as you figured out already.




                          1. Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.

                          2. How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.

                          3. Or you can create a view in Postgres and use that as source.


                          Please let me know if that helped.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 13 at 21:30









                          Tanveer Uddin

                          42218




                          42218






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.





                              Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                              Please pay close attention to the following guidance:


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267549%2fbest-way-to-read-in-part-of-a-huge-table-to-aws-glue%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Xamarin.iOS Cant Deploy on Iphone

                              Glorious Revolution

                              Dulmage-Mendelsohn matrix decomposition in Python