Incremental upload/update to PostgreSQL table using Pentaho DI











up vote
0
down vote

favorite












I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.



The first time I run this flow everything goes ok and the table gets populated. However, if later I want to do an incremental update on the same table, I need to truncate it and run the flow again. Is there any method that allows me to only load new/updated rows?



In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.



See my implementation:



enter image description here



Thanks in advance!!










share|improve this question


























    up vote
    0
    down vote

    favorite












    I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.



    The first time I run this flow everything goes ok and the table gets populated. However, if later I want to do an incremental update on the same table, I need to truncate it and run the flow again. Is there any method that allows me to only load new/updated rows?



    In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.



    See my implementation:



    enter image description here



    Thanks in advance!!










    share|improve this question
























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.



      The first time I run this flow everything goes ok and the table gets populated. However, if later I want to do an incremental update on the same table, I need to truncate it and run the flow again. Is there any method that allows me to only load new/updated rows?



      In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.



      See my implementation:



      enter image description here



      Thanks in advance!!










      share|improve this question













      I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.



      The first time I run this flow everything goes ok and the table gets populated. However, if later I want to do an incremental update on the same table, I need to truncate it and run the flow again. Is there any method that allows me to only load new/updated rows?



      In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.



      See my implementation:



      enter image description here



      Thanks in advance!!







      postgresql pentaho-data-integration incremental-build






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 11 at 18:20









      Gabriela Martinez

      14014




      14014
























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          1
          down vote



          accepted










          Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.



          Try using the Table output step, with Batch size("Commit size" in the step) of 5000, and altering the number of copies executing the step (depends on the number of cores your processor has) to say, 4 copies (Dual core CPU with 2 logical cores ea.). You can alter the number of copies by right clicking the step in the GUI and setting the desired number.



          This will parallelize the output into 4 groups of Inserts, of 5000 rows per 'cycle' each. If this cause memory overload in the JVM, you can easily adapt that and increase the memory usage in the option PENTAHO_DI_JAVA_OPTIONS, simply double the amount that's set on Xms(minimum) and XmX(maximum), mine is set to "-Xms2048m" "-Xmx4096m".



          The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.






          share|improve this answer

















          • 1




            I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
            – KP M
            Nov 14 at 15:52










          • Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
            – Cristian Curti
            Nov 14 at 17:56


















          up vote
          0
          down vote













          you are looking for an incremental load. you can do it in two ways.




          1. There is a step called "Insert/Update" , this will be used to do incremental load.


          you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.




          1. Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.


          Editing here..
          if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.



          Hope this will help.






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53251792%2fincremental-upload-update-to-postgresql-table-using-pentaho-di%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote



            accepted










            Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.



            Try using the Table output step, with Batch size("Commit size" in the step) of 5000, and altering the number of copies executing the step (depends on the number of cores your processor has) to say, 4 copies (Dual core CPU with 2 logical cores ea.). You can alter the number of copies by right clicking the step in the GUI and setting the desired number.



            This will parallelize the output into 4 groups of Inserts, of 5000 rows per 'cycle' each. If this cause memory overload in the JVM, you can easily adapt that and increase the memory usage in the option PENTAHO_DI_JAVA_OPTIONS, simply double the amount that's set on Xms(minimum) and XmX(maximum), mine is set to "-Xms2048m" "-Xmx4096m".



            The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.






            share|improve this answer

















            • 1




              I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
              – KP M
              Nov 14 at 15:52










            • Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
              – Cristian Curti
              Nov 14 at 17:56















            up vote
            1
            down vote



            accepted










            Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.



            Try using the Table output step, with Batch size("Commit size" in the step) of 5000, and altering the number of copies executing the step (depends on the number of cores your processor has) to say, 4 copies (Dual core CPU with 2 logical cores ea.). You can alter the number of copies by right clicking the step in the GUI and setting the desired number.



            This will parallelize the output into 4 groups of Inserts, of 5000 rows per 'cycle' each. If this cause memory overload in the JVM, you can easily adapt that and increase the memory usage in the option PENTAHO_DI_JAVA_OPTIONS, simply double the amount that's set on Xms(minimum) and XmX(maximum), mine is set to "-Xms2048m" "-Xmx4096m".



            The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.






            share|improve this answer

















            • 1




              I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
              – KP M
              Nov 14 at 15:52










            • Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
              – Cristian Curti
              Nov 14 at 17:56













            up vote
            1
            down vote



            accepted







            up vote
            1
            down vote



            accepted






            Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.



            Try using the Table output step, with Batch size("Commit size" in the step) of 5000, and altering the number of copies executing the step (depends on the number of cores your processor has) to say, 4 copies (Dual core CPU with 2 logical cores ea.). You can alter the number of copies by right clicking the step in the GUI and setting the desired number.



            This will parallelize the output into 4 groups of Inserts, of 5000 rows per 'cycle' each. If this cause memory overload in the JVM, you can easily adapt that and increase the memory usage in the option PENTAHO_DI_JAVA_OPTIONS, simply double the amount that's set on Xms(minimum) and XmX(maximum), mine is set to "-Xms2048m" "-Xmx4096m".



            The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.






            share|improve this answer












            Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.



            Try using the Table output step, with Batch size("Commit size" in the step) of 5000, and altering the number of copies executing the step (depends on the number of cores your processor has) to say, 4 copies (Dual core CPU with 2 logical cores ea.). You can alter the number of copies by right clicking the step in the GUI and setting the desired number.



            This will parallelize the output into 4 groups of Inserts, of 5000 rows per 'cycle' each. If this cause memory overload in the JVM, you can easily adapt that and increase the memory usage in the option PENTAHO_DI_JAVA_OPTIONS, simply double the amount that's set on Xms(minimum) and XmX(maximum), mine is set to "-Xms2048m" "-Xmx4096m".



            The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 13 at 18:02









            Cristian Curti

            1988




            1988








            • 1




              I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
              – KP M
              Nov 14 at 15:52










            • Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
              – Cristian Curti
              Nov 14 at 17:56














            • 1




              I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
              – KP M
              Nov 14 at 15:52










            • Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
              – Cristian Curti
              Nov 14 at 17:56








            1




            1




            I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
            – KP M
            Nov 14 at 15:52




            I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
            – KP M
            Nov 14 at 15:52












            Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
            – Cristian Curti
            Nov 14 at 17:56




            Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
            – Cristian Curti
            Nov 14 at 17:56












            up vote
            0
            down vote













            you are looking for an incremental load. you can do it in two ways.




            1. There is a step called "Insert/Update" , this will be used to do incremental load.


            you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.




            1. Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.


            Editing here..
            if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.



            Hope this will help.






            share|improve this answer



























              up vote
              0
              down vote













              you are looking for an incremental load. you can do it in two ways.




              1. There is a step called "Insert/Update" , this will be used to do incremental load.


              you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.




              1. Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.


              Editing here..
              if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.



              Hope this will help.






              share|improve this answer

























                up vote
                0
                down vote










                up vote
                0
                down vote









                you are looking for an incremental load. you can do it in two ways.




                1. There is a step called "Insert/Update" , this will be used to do incremental load.


                you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.




                1. Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.


                Editing here..
                if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.



                Hope this will help.






                share|improve this answer














                you are looking for an incremental load. you can do it in two ways.




                1. There is a step called "Insert/Update" , this will be used to do incremental load.


                you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.




                1. Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.


                Editing here..
                if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.



                Hope this will help.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 14 at 15:59

























                answered Nov 14 at 15:51









                KP M

                8910




                8910






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53251792%2fincremental-upload-update-to-postgresql-table-using-pentaho-di%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Xamarin.iOS Cant Deploy on Iphone

                    Glorious Revolution

                    Dulmage-Mendelsohn matrix decomposition in Python