Best way to read in part of a huge table to AWS GLUE

I'm having some trouble loading a large file from my data lake (currently stored in postgres) into AWS GLUE. It is 4.3 Billion rows.

In testing, I've found that the table is too large to be fully read in.

Here's how I'm loading the data frame:

large_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = "DBNAME",

                                          table_name = "TABLENAME",

                                          redshift_tmp_dir = args["TempDir"],

                                          transformation_ctx = "TABLECTX")

Important Factors

I don't need the whole data frame! I'll ultimately filter based on a couple of attributes and join with smaller tables.

I've already tried using a push_down_predicate, but that required the data to be stored in S3 using a specific folder organization and unfortunately I don't get to choose the pre-existing format of this table.

I've also tried reading in the table and simply re-organizing it to the S3 folder organization necessary for pushdown_predicate to work, but the process ends with "exit code 1" after 5 hours of running.

Primary Question

How can I read in part of a table without using a pushdown predicate?

asked Nov 12 at 17:51

Reid

22539

add a comment |

I'm having some trouble loading a large file from my data lake (currently stored in postgres) into AWS GLUE. It is 4.3 Billion rows.

In testing, I've found that the table is too large to be fully read in.

Here's how I'm loading the data frame:

large_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = "DBNAME",

                                          table_name = "TABLENAME",

                                          redshift_tmp_dir = args["TempDir"],

                                          transformation_ctx = "TABLECTX")

Important Factors

I don't need the whole data frame! I'll ultimately filter based on a couple of attributes and join with smaller tables.

I've already tried using a push_down_predicate, but that required the data to be stored in S3 using a specific folder organization and unfortunately I don't get to choose the pre-existing format of this table.

I've also tried reading in the table and simply re-organizing it to the S3 folder organization necessary for pushdown_predicate to work, but the process ends with "exit code 1" after 5 hours of running.

Primary Question

How can I read in part of a table without using a pushdown predicate?

asked Nov 12 at 17:51

Reid

22539

add a comment |

I'm having some trouble loading a large file from my data lake (currently stored in postgres) into AWS GLUE. It is 4.3 Billion rows.

In testing, I've found that the table is too large to be fully read in.

Here's how I'm loading the data frame:

large_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = "DBNAME",

                                          table_name = "TABLENAME",

                                          redshift_tmp_dir = args["TempDir"],

                                          transformation_ctx = "TABLECTX")

Important Factors

I don't need the whole data frame! I'll ultimately filter based on a couple of attributes and join with smaller tables.

I've already tried using a push_down_predicate, but that required the data to be stored in S3 using a specific folder organization and unfortunately I don't get to choose the pre-existing format of this table.

I've also tried reading in the table and simply re-organizing it to the S3 folder organization necessary for pushdown_predicate to work, but the process ends with "exit code 1" after 5 hours of running.

Primary Question

How can I read in part of a table without using a pushdown predicate?

asked Nov 12 at 17:51

Reid

22539

I'm having some trouble loading a large file from my data lake (currently stored in postgres) into AWS GLUE. It is 4.3 Billion rows.

In testing, I've found that the table is too large to be fully read in.

Here's how I'm loading the data frame:

large_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = "DBNAME",

                                          table_name = "TABLENAME",

                                          redshift_tmp_dir = args["TempDir"],

                                          transformation_ctx = "TABLECTX")

Important Factors

I don't need the whole data frame! I'll ultimately filter based on a couple of attributes and join with smaller tables.

I've already tried using a push_down_predicate, but that required the data to be stored in S3 using a specific folder organization and unfortunately I don't get to choose the pre-existing format of this table.

I've also tried reading in the table and simply re-organizing it to the S3 folder organization necessary for pushdown_predicate to work, but the process ends with "exit code 1" after 5 hours of running.

Primary Question

How can I read in part of a table without using a pushdown predicate?

python amazon-web-services apache-spark pyspark aws-glue

asked Nov 12 at 17:51

Reid

22539

asked Nov 12 at 17:51

Reid

22539

asked Nov 12 at 17:51

Reid

22539

asked Nov 12 at 17:51

Reid

22539

asked Nov 12 at 17:51

Reid

22539

add a comment |

2 Answers
2

active

oldest

votes

You can also use pure spark/pyspark code in Glue and take advantage of its read methods.
You can see in their documentation, how to read from redshift or in general any SQL DB through JDBC. Even reading data from a query like the following example:

# Read data from a query

df = spark.read 

  .format("com.databricks.spark.redshift") 

  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") 

  .option("query", "select x, count(*) my_table group by x") 

  .option("tempdir", "s3n://path/for/temp/data") 

  .load()

I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.

answered Nov 14 at 16:17

Aida Martinez

214

add a comment |

Unfortunately predicate pushdown works only for S3 as you figured out already.

Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.

How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.

Or you can create a view in Postgres and use that as source.

Please let me know if that helped.

answered Nov 13 at 21:30

Tanveer Uddin

42218

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267549%2fbest-way-to-read-in-part-of-a-huge-table-to-aws-glue%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

# Read data from a query

df = spark.read 

  .format("com.databricks.spark.redshift") 

  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") 

  .option("query", "select x, count(*) my_table group by x") 

  .option("tempdir", "s3n://path/for/temp/data") 

  .load()

I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.

answered Nov 14 at 16:17

Aida Martinez

214

add a comment |

# Read data from a query

df = spark.read 

  .format("com.databricks.spark.redshift") 

  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") 

  .option("query", "select x, count(*) my_table group by x") 

  .option("tempdir", "s3n://path/for/temp/data") 

  .load()

I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.

answered Nov 14 at 16:17

Aida Martinez

214

add a comment |

# Read data from a query

df = spark.read 

  .format("com.databricks.spark.redshift") 

  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") 

  .option("query", "select x, count(*) my_table group by x") 

  .option("tempdir", "s3n://path/for/temp/data") 

  .load()

I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.

answered Nov 14 at 16:17

Aida Martinez

214

# Read data from a query

df = spark.read 

  .format("com.databricks.spark.redshift") 

  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") 

  .option("query", "select x, count(*) my_table group by x") 

  .option("tempdir", "s3n://path/for/temp/data") 

  .load()

I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.

answered Nov 14 at 16:17

Aida Martinez

214

answered Nov 14 at 16:17

Aida Martinez

214

answered Nov 14 at 16:17

Aida Martinez

214

answered Nov 14 at 16:17

Aida Martinez

214

add a comment |

Unfortunately predicate pushdown works only for S3 as you figured out already.

Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.

How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.

Or you can create a view in Postgres and use that as source.

Please let me know if that helped.

answered Nov 13 at 21:30

Tanveer Uddin

42218

add a comment |

Unfortunately predicate pushdown works only for S3 as you figured out already.

Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.

How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.

Or you can create a view in Postgres and use that as source.

Please let me know if that helped.

answered Nov 13 at 21:30

Tanveer Uddin

42218

add a comment |

Unfortunately predicate pushdown works only for S3 as you figured out already.

Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.

How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.

Or you can create a view in Postgres and use that as source.

Please let me know if that helped.

answered Nov 13 at 21:30

Tanveer Uddin

42218

Unfortunately predicate pushdown works only for S3 as you figured out already.

Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.

How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.

Or you can create a view in Postgres and use that as source.

Please let me know if that helped.

answered Nov 13 at 21:30

Tanveer Uddin

42218

answered Nov 13 at 21:30

Tanveer Uddin

42218

answered Nov 13 at 21:30

Tanveer Uddin

42218

answered Nov 13 at 21:30

Tanveer Uddin

42218

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

0hoP mrGbRCos37xZAl v1IzrMhgvNaxEP,Ou,L9,1,k84E vwpa 7k,XPyS9yFFOUjrF6kblOr4Guwr

搜尋此網誌

Vfrdtyky