Best way to read in part of a huge table to AWS GLUE
I'm having some trouble loading a large file from my data lake (currently stored in postgres) into AWS GLUE. It is 4.3 Billion rows.
In testing, I've found that the table is too large to be fully read in.
Here's how I'm loading the data frame:
large_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = "DBNAME",
table_name = "TABLENAME",
redshift_tmp_dir = args["TempDir"],
transformation_ctx = "TABLECTX")
Important Factors
I don't need the whole data frame! I'll ultimately filter based on a couple of attributes and join with smaller tables.
I've already tried using a
push_down_predicate
, but that required the data to be stored in S3 using a specific folder organization and unfortunately I don't get to choose the pre-existing format of this table.I've also tried reading in the table and simply re-organizing it to the S3 folder organization necessary for
pushdown_predicate
to work, but the process ends with "exit code 1" after 5 hours of running.
Primary Question
How can I read in part of a table without using a pushdown predicate
?
python amazon-web-services apache-spark pyspark aws-glue
add a comment |
I'm having some trouble loading a large file from my data lake (currently stored in postgres) into AWS GLUE. It is 4.3 Billion rows.
In testing, I've found that the table is too large to be fully read in.
Here's how I'm loading the data frame:
large_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = "DBNAME",
table_name = "TABLENAME",
redshift_tmp_dir = args["TempDir"],
transformation_ctx = "TABLECTX")
Important Factors
I don't need the whole data frame! I'll ultimately filter based on a couple of attributes and join with smaller tables.
I've already tried using a
push_down_predicate
, but that required the data to be stored in S3 using a specific folder organization and unfortunately I don't get to choose the pre-existing format of this table.I've also tried reading in the table and simply re-organizing it to the S3 folder organization necessary for
pushdown_predicate
to work, but the process ends with "exit code 1" after 5 hours of running.
Primary Question
How can I read in part of a table without using a pushdown predicate
?
python amazon-web-services apache-spark pyspark aws-glue
add a comment |
I'm having some trouble loading a large file from my data lake (currently stored in postgres) into AWS GLUE. It is 4.3 Billion rows.
In testing, I've found that the table is too large to be fully read in.
Here's how I'm loading the data frame:
large_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = "DBNAME",
table_name = "TABLENAME",
redshift_tmp_dir = args["TempDir"],
transformation_ctx = "TABLECTX")
Important Factors
I don't need the whole data frame! I'll ultimately filter based on a couple of attributes and join with smaller tables.
I've already tried using a
push_down_predicate
, but that required the data to be stored in S3 using a specific folder organization and unfortunately I don't get to choose the pre-existing format of this table.I've also tried reading in the table and simply re-organizing it to the S3 folder organization necessary for
pushdown_predicate
to work, but the process ends with "exit code 1" after 5 hours of running.
Primary Question
How can I read in part of a table without using a pushdown predicate
?
python amazon-web-services apache-spark pyspark aws-glue
I'm having some trouble loading a large file from my data lake (currently stored in postgres) into AWS GLUE. It is 4.3 Billion rows.
In testing, I've found that the table is too large to be fully read in.
Here's how I'm loading the data frame:
large_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = "DBNAME",
table_name = "TABLENAME",
redshift_tmp_dir = args["TempDir"],
transformation_ctx = "TABLECTX")
Important Factors
I don't need the whole data frame! I'll ultimately filter based on a couple of attributes and join with smaller tables.
I've already tried using a
push_down_predicate
, but that required the data to be stored in S3 using a specific folder organization and unfortunately I don't get to choose the pre-existing format of this table.I've also tried reading in the table and simply re-organizing it to the S3 folder organization necessary for
pushdown_predicate
to work, but the process ends with "exit code 1" after 5 hours of running.
Primary Question
How can I read in part of a table without using a pushdown predicate
?
python amazon-web-services apache-spark pyspark aws-glue
python amazon-web-services apache-spark pyspark aws-glue
asked Nov 12 at 17:51
Reid
22539
22539
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
You can also use pure spark/pyspark code in Glue and take advantage of its read methods.
You can see in their documentation, how to read from redshift or in general any SQL DB through JDBC. Even reading data from a query like the following example:
# Read data from a query
df = spark.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("query", "select x, count(*) my_table group by x")
.option("tempdir", "s3n://path/for/temp/data")
.load()
I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.
add a comment |
Unfortunately predicate pushdown works only for S3 as you figured out already.
- Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.
- How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.
- Or you can create a view in Postgres and use that as source.
Please let me know if that helped.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267549%2fbest-way-to-read-in-part-of-a-huge-table-to-aws-glue%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can also use pure spark/pyspark code in Glue and take advantage of its read methods.
You can see in their documentation, how to read from redshift or in general any SQL DB through JDBC. Even reading data from a query like the following example:
# Read data from a query
df = spark.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("query", "select x, count(*) my_table group by x")
.option("tempdir", "s3n://path/for/temp/data")
.load()
I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.
add a comment |
You can also use pure spark/pyspark code in Glue and take advantage of its read methods.
You can see in their documentation, how to read from redshift or in general any SQL DB through JDBC. Even reading data from a query like the following example:
# Read data from a query
df = spark.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("query", "select x, count(*) my_table group by x")
.option("tempdir", "s3n://path/for/temp/data")
.load()
I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.
add a comment |
You can also use pure spark/pyspark code in Glue and take advantage of its read methods.
You can see in their documentation, how to read from redshift or in general any SQL DB through JDBC. Even reading data from a query like the following example:
# Read data from a query
df = spark.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("query", "select x, count(*) my_table group by x")
.option("tempdir", "s3n://path/for/temp/data")
.load()
I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.
You can also use pure spark/pyspark code in Glue and take advantage of its read methods.
You can see in their documentation, how to read from redshift or in general any SQL DB through JDBC. Even reading data from a query like the following example:
# Read data from a query
df = spark.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("query", "select x, count(*) my_table group by x")
.option("tempdir", "s3n://path/for/temp/data")
.load()
I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.
answered Nov 14 at 16:17
Aida Martinez
214
214
add a comment |
add a comment |
Unfortunately predicate pushdown works only for S3 as you figured out already.
- Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.
- How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.
- Or you can create a view in Postgres and use that as source.
Please let me know if that helped.
add a comment |
Unfortunately predicate pushdown works only for S3 as you figured out already.
- Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.
- How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.
- Or you can create a view in Postgres and use that as source.
Please let me know if that helped.
add a comment |
Unfortunately predicate pushdown works only for S3 as you figured out already.
- Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.
- How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.
- Or you can create a view in Postgres and use that as source.
Please let me know if that helped.
Unfortunately predicate pushdown works only for S3 as you figured out already.
- Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.
- How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.
- Or you can create a view in Postgres and use that as source.
Please let me know if that helped.
answered Nov 13 at 21:30
Tanveer Uddin
42218
42218
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267549%2fbest-way-to-read-in-part-of-a-huge-table-to-aws-glue%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown