How to handle backpressure on databases when using Apache Spark?





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







1















We are using Apache Spark for performing ETL for every 2 hours.



Sometimes Spark puts much pressure on databases when read/write operation is performed.



For Spark Streaming, I can see backpressure configuration on kafka.



Is there a way to handle this issue in batch processing?










share|improve this question































    1















    We are using Apache Spark for performing ETL for every 2 hours.



    Sometimes Spark puts much pressure on databases when read/write operation is performed.



    For Spark Streaming, I can see backpressure configuration on kafka.



    Is there a way to handle this issue in batch processing?










    share|improve this question



























      1












      1








      1








      We are using Apache Spark for performing ETL for every 2 hours.



      Sometimes Spark puts much pressure on databases when read/write operation is performed.



      For Spark Streaming, I can see backpressure configuration on kafka.



      Is there a way to handle this issue in batch processing?










      share|improve this question
















      We are using Apache Spark for performing ETL for every 2 hours.



      Sometimes Spark puts much pressure on databases when read/write operation is performed.



      For Spark Streaming, I can see backpressure configuration on kafka.



      Is there a way to handle this issue in batch processing?







      apache-spark apache-spark-sql






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 16 '18 at 13:56









      eliasah

      27.9k776120




      27.9k776120










      asked Nov 16 '18 at 12:29









      Gowthaman VGowthaman V

      3118




      3118
























          1 Answer
          1






          active

          oldest

          votes


















          3














          Backpressure is actually just a fancy word to refer to setting up the max receiving rate. So actually it doesn't work the way you think it does.



          What should be done here is actually on the reading end.



          Now in classical JDBC usage, jdbc connectors have a fetchSize property for PreparedStatements. So basically you can consider configuring that fetchSize with regards of what is said in the following answers :




          • Spark JDBC fetchsize option


          • What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?



          Unfortunately, this might not solve all of your performance issues with your RDBMS.



          What you must know is that compared to the basic jdbc reader, which run on a single worker, when partitioning data using an integer column or using a sequence of predicates, loading data in a distributed mode but introduce a couple of problems. In your case, high number of concurrent reads can easily throttle the database.



          To deal with this, I suggest the following :




          • If available, consider using specialized data sources over JDBC
            connections.

          • Consider using specialized or generic bulk import/export tools like Postgres COPY or Apache Sqoop.

          • Be sure to understand performance implications of different JDBC data source
            variants, especially when working with production database.

          • Consider using a separate replica for Spark jobs.


          If you wish to know more about Reading data using the JDBC source, I suggest you read the following :





          • Spark SQL and Dataset API.


          Disclaimer: I'm the co-author of that repo.






          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53337950%2fhow-to-handle-backpressure-on-databases-when-using-apache-spark%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            3














            Backpressure is actually just a fancy word to refer to setting up the max receiving rate. So actually it doesn't work the way you think it does.



            What should be done here is actually on the reading end.



            Now in classical JDBC usage, jdbc connectors have a fetchSize property for PreparedStatements. So basically you can consider configuring that fetchSize with regards of what is said in the following answers :




            • Spark JDBC fetchsize option


            • What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?



            Unfortunately, this might not solve all of your performance issues with your RDBMS.



            What you must know is that compared to the basic jdbc reader, which run on a single worker, when partitioning data using an integer column or using a sequence of predicates, loading data in a distributed mode but introduce a couple of problems. In your case, high number of concurrent reads can easily throttle the database.



            To deal with this, I suggest the following :




            • If available, consider using specialized data sources over JDBC
              connections.

            • Consider using specialized or generic bulk import/export tools like Postgres COPY or Apache Sqoop.

            • Be sure to understand performance implications of different JDBC data source
              variants, especially when working with production database.

            • Consider using a separate replica for Spark jobs.


            If you wish to know more about Reading data using the JDBC source, I suggest you read the following :





            • Spark SQL and Dataset API.


            Disclaimer: I'm the co-author of that repo.






            share|improve this answer




























              3














              Backpressure is actually just a fancy word to refer to setting up the max receiving rate. So actually it doesn't work the way you think it does.



              What should be done here is actually on the reading end.



              Now in classical JDBC usage, jdbc connectors have a fetchSize property for PreparedStatements. So basically you can consider configuring that fetchSize with regards of what is said in the following answers :




              • Spark JDBC fetchsize option


              • What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?



              Unfortunately, this might not solve all of your performance issues with your RDBMS.



              What you must know is that compared to the basic jdbc reader, which run on a single worker, when partitioning data using an integer column or using a sequence of predicates, loading data in a distributed mode but introduce a couple of problems. In your case, high number of concurrent reads can easily throttle the database.



              To deal with this, I suggest the following :




              • If available, consider using specialized data sources over JDBC
                connections.

              • Consider using specialized or generic bulk import/export tools like Postgres COPY or Apache Sqoop.

              • Be sure to understand performance implications of different JDBC data source
                variants, especially when working with production database.

              • Consider using a separate replica for Spark jobs.


              If you wish to know more about Reading data using the JDBC source, I suggest you read the following :





              • Spark SQL and Dataset API.


              Disclaimer: I'm the co-author of that repo.






              share|improve this answer


























                3












                3








                3







                Backpressure is actually just a fancy word to refer to setting up the max receiving rate. So actually it doesn't work the way you think it does.



                What should be done here is actually on the reading end.



                Now in classical JDBC usage, jdbc connectors have a fetchSize property for PreparedStatements. So basically you can consider configuring that fetchSize with regards of what is said in the following answers :




                • Spark JDBC fetchsize option


                • What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?



                Unfortunately, this might not solve all of your performance issues with your RDBMS.



                What you must know is that compared to the basic jdbc reader, which run on a single worker, when partitioning data using an integer column or using a sequence of predicates, loading data in a distributed mode but introduce a couple of problems. In your case, high number of concurrent reads can easily throttle the database.



                To deal with this, I suggest the following :




                • If available, consider using specialized data sources over JDBC
                  connections.

                • Consider using specialized or generic bulk import/export tools like Postgres COPY or Apache Sqoop.

                • Be sure to understand performance implications of different JDBC data source
                  variants, especially when working with production database.

                • Consider using a separate replica for Spark jobs.


                If you wish to know more about Reading data using the JDBC source, I suggest you read the following :





                • Spark SQL and Dataset API.


                Disclaimer: I'm the co-author of that repo.






                share|improve this answer













                Backpressure is actually just a fancy word to refer to setting up the max receiving rate. So actually it doesn't work the way you think it does.



                What should be done here is actually on the reading end.



                Now in classical JDBC usage, jdbc connectors have a fetchSize property for PreparedStatements. So basically you can consider configuring that fetchSize with regards of what is said in the following answers :




                • Spark JDBC fetchsize option


                • What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?



                Unfortunately, this might not solve all of your performance issues with your RDBMS.



                What you must know is that compared to the basic jdbc reader, which run on a single worker, when partitioning data using an integer column or using a sequence of predicates, loading data in a distributed mode but introduce a couple of problems. In your case, high number of concurrent reads can easily throttle the database.



                To deal with this, I suggest the following :




                • If available, consider using specialized data sources over JDBC
                  connections.

                • Consider using specialized or generic bulk import/export tools like Postgres COPY or Apache Sqoop.

                • Be sure to understand performance implications of different JDBC data source
                  variants, especially when working with production database.

                • Consider using a separate replica for Spark jobs.


                If you wish to know more about Reading data using the JDBC source, I suggest you read the following :





                • Spark SQL and Dataset API.


                Disclaimer: I'm the co-author of that repo.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 16 '18 at 14:05









                eliasaheliasah

                27.9k776120




                27.9k776120
































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53337950%2fhow-to-handle-backpressure-on-databases-when-using-apache-spark%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Xamarin.iOS Cant Deploy on Iphone

                    Glorious Revolution

                    Dulmage-Mendelsohn matrix decomposition in Python