Incremental upload/update to PostgreSQL table using Pentaho DI
up vote
0
down vote
favorite
I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.
The first time I run this flow everything goes ok and the table gets populated. However, if later I want to do an incremental update on the same table, I need to truncate it and run the flow again. Is there any method that allows me to only load new/updated rows?
In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.
See my implementation:
Thanks in advance!!
postgresql pentaho-data-integration incremental-build
add a comment |
up vote
0
down vote
favorite
I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.
The first time I run this flow everything goes ok and the table gets populated. However, if later I want to do an incremental update on the same table, I need to truncate it and run the flow again. Is there any method that allows me to only load new/updated rows?
In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.
See my implementation:
Thanks in advance!!
postgresql pentaho-data-integration incremental-build
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.
The first time I run this flow everything goes ok and the table gets populated. However, if later I want to do an incremental update on the same table, I need to truncate it and run the flow again. Is there any method that allows me to only load new/updated rows?
In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.
See my implementation:
Thanks in advance!!
postgresql pentaho-data-integration incremental-build
I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.
The first time I run this flow everything goes ok and the table gets populated. However, if later I want to do an incremental update on the same table, I need to truncate it and run the flow again. Is there any method that allows me to only load new/updated rows?
In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.
See my implementation:
Thanks in advance!!
postgresql pentaho-data-integration incremental-build
postgresql pentaho-data-integration incremental-build
asked Nov 11 at 18:20
Gabriela Martinez
14014
14014
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.
Try using the Table output step, with Batch size("Commit size" in the step) of 5000, and altering the number of copies executing the step (depends on the number of cores your processor has) to say, 4 copies (Dual core CPU with 2 logical cores ea.). You can alter the number of copies by right clicking the step in the GUI and setting the desired number.
This will parallelize the output into 4 groups of Inserts, of 5000 rows per 'cycle' each. If this cause memory overload in the JVM, you can easily adapt that and increase the memory usage in the option PENTAHO_DI_JAVA_OPTIONS, simply double the amount that's set on Xms(minimum) and XmX(maximum), mine is set to "-Xms2048m" "-Xmx4096m".
The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.
1
I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
– KP M
Nov 14 at 15:52
Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
– Cristian Curti
Nov 14 at 17:56
add a comment |
up vote
0
down vote
you are looking for an incremental load. you can do it in two ways.
- There is a step called "Insert/Update" , this will be used to do incremental load.
you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.
- Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.
Editing here..
if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.
Hope this will help.
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.
Try using the Table output step, with Batch size("Commit size" in the step) of 5000, and altering the number of copies executing the step (depends on the number of cores your processor has) to say, 4 copies (Dual core CPU with 2 logical cores ea.). You can alter the number of copies by right clicking the step in the GUI and setting the desired number.
This will parallelize the output into 4 groups of Inserts, of 5000 rows per 'cycle' each. If this cause memory overload in the JVM, you can easily adapt that and increase the memory usage in the option PENTAHO_DI_JAVA_OPTIONS, simply double the amount that's set on Xms(minimum) and XmX(maximum), mine is set to "-Xms2048m" "-Xmx4096m".
The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.
1
I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
– KP M
Nov 14 at 15:52
Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
– Cristian Curti
Nov 14 at 17:56
add a comment |
up vote
1
down vote
accepted
Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.
Try using the Table output step, with Batch size("Commit size" in the step) of 5000, and altering the number of copies executing the step (depends on the number of cores your processor has) to say, 4 copies (Dual core CPU with 2 logical cores ea.). You can alter the number of copies by right clicking the step in the GUI and setting the desired number.
This will parallelize the output into 4 groups of Inserts, of 5000 rows per 'cycle' each. If this cause memory overload in the JVM, you can easily adapt that and increase the memory usage in the option PENTAHO_DI_JAVA_OPTIONS, simply double the amount that's set on Xms(minimum) and XmX(maximum), mine is set to "-Xms2048m" "-Xmx4096m".
The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.
1
I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
– KP M
Nov 14 at 15:52
Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
– Cristian Curti
Nov 14 at 17:56
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.
Try using the Table output step, with Batch size("Commit size" in the step) of 5000, and altering the number of copies executing the step (depends on the number of cores your processor has) to say, 4 copies (Dual core CPU with 2 logical cores ea.). You can alter the number of copies by right clicking the step in the GUI and setting the desired number.
This will parallelize the output into 4 groups of Inserts, of 5000 rows per 'cycle' each. If this cause memory overload in the JVM, you can easily adapt that and increase the memory usage in the option PENTAHO_DI_JAVA_OPTIONS, simply double the amount that's set on Xms(minimum) and XmX(maximum), mine is set to "-Xms2048m" "-Xmx4096m".
The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.
Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.
Try using the Table output step, with Batch size("Commit size" in the step) of 5000, and altering the number of copies executing the step (depends on the number of cores your processor has) to say, 4 copies (Dual core CPU with 2 logical cores ea.). You can alter the number of copies by right clicking the step in the GUI and setting the desired number.
This will parallelize the output into 4 groups of Inserts, of 5000 rows per 'cycle' each. If this cause memory overload in the JVM, you can easily adapt that and increase the memory usage in the option PENTAHO_DI_JAVA_OPTIONS, simply double the amount that's set on Xms(minimum) and XmX(maximum), mine is set to "-Xms2048m" "-Xmx4096m".
The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.
answered Nov 13 at 18:02
Cristian Curti
1988
1988
1
I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
– KP M
Nov 14 at 15:52
Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
– Cristian Curti
Nov 14 at 17:56
add a comment |
1
I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
– KP M
Nov 14 at 15:52
Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
– Cristian Curti
Nov 14 at 17:56
1
1
I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
– KP M
Nov 14 at 15:52
I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
– KP M
Nov 14 at 15:52
Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
– Cristian Curti
Nov 14 at 17:56
Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
– Cristian Curti
Nov 14 at 17:56
add a comment |
up vote
0
down vote
you are looking for an incremental load. you can do it in two ways.
- There is a step called "Insert/Update" , this will be used to do incremental load.
you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.
- Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.
Editing here..
if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.
Hope this will help.
add a comment |
up vote
0
down vote
you are looking for an incremental load. you can do it in two ways.
- There is a step called "Insert/Update" , this will be used to do incremental load.
you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.
- Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.
Editing here..
if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.
Hope this will help.
add a comment |
up vote
0
down vote
up vote
0
down vote
you are looking for an incremental load. you can do it in two ways.
- There is a step called "Insert/Update" , this will be used to do incremental load.
you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.
- Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.
Editing here..
if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.
Hope this will help.
you are looking for an incremental load. you can do it in two ways.
- There is a step called "Insert/Update" , this will be used to do incremental load.
you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.
- Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.
Editing here..
if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.
Hope this will help.
edited Nov 14 at 15:59
answered Nov 14 at 15:51
KP M
8910
8910
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53251792%2fincremental-upload-update-to-postgresql-table-using-pentaho-di%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown