PostgreSQL increase group by over 30 million rows











up vote
0
down vote

favorite












Is there any way to increase speed of dynamic group by query ? I have a table with 30 million rows.



create table if not exists tb
(
id serial not null constraint tb_pkey primary key,
week integer,
month integer,
year integer,
starttime varchar(20),
endtime varchar(20),
brand smallint,
category smallint,
value real
);


The query below takes 8.5 seconds.



SELECT category from tb group by category



Is there any way to increase the speed. I have tried with and without index.










share|improve this question


























    up vote
    0
    down vote

    favorite












    Is there any way to increase speed of dynamic group by query ? I have a table with 30 million rows.



    create table if not exists tb
    (
    id serial not null constraint tb_pkey primary key,
    week integer,
    month integer,
    year integer,
    starttime varchar(20),
    endtime varchar(20),
    brand smallint,
    category smallint,
    value real
    );


    The query below takes 8.5 seconds.



    SELECT category from tb group by category



    Is there any way to increase the speed. I have tried with and without index.










    share|improve this question
























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      Is there any way to increase speed of dynamic group by query ? I have a table with 30 million rows.



      create table if not exists tb
      (
      id serial not null constraint tb_pkey primary key,
      week integer,
      month integer,
      year integer,
      starttime varchar(20),
      endtime varchar(20),
      brand smallint,
      category smallint,
      value real
      );


      The query below takes 8.5 seconds.



      SELECT category from tb group by category



      Is there any way to increase the speed. I have tried with and without index.










      share|improve this question













      Is there any way to increase speed of dynamic group by query ? I have a table with 30 million rows.



      create table if not exists tb
      (
      id serial not null constraint tb_pkey primary key,
      week integer,
      month integer,
      year integer,
      starttime varchar(20),
      endtime varchar(20),
      brand smallint,
      category smallint,
      value real
      );


      The query below takes 8.5 seconds.



      SELECT category from tb group by category



      Is there any way to increase the speed. I have tried with and without index.







      postgresql performance rdbms database-performance






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 10 at 19:06









      Viswanath Lekshmanan

      7,13912850




      7,13912850
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          For that exact query, not really; doing this operation requires scanning every row. No way around it.



          But if you're looking to be able to quickly get the set of unique categories, and you have an index on that column, you can use a variation of the WITH RECURSIVE example shown in the edit to the question here (look towards the end of the question):



          Counting distinct rows using recursive cte over non-distinct index



          You'll need to change it to return the set of unique values instead of counting them, but it looks like a simple change:



          testdb=# create table tb(id bigserial, category smallint);
          CREATE TABLE
          testdb=# insert into tb(category) select 2 from generate_series(1, 10000)
          testdb-# ;
          INSERT 0 10000
          testdb=# insert into tb(category) select 1 from generate_series(1, 10000);
          INSERT 0 10000
          testdb=# insert into tb(category) select 3 from generate_series(1, 10000);
          INSERT 0 10000
          testdb=# create index on tb(category);
          CREATE INDEX
          testdb=# WITH RECURSIVE cte AS
          (
          (SELECT category
          FROM tb
          WHERE category >= 0
          ORDER BY 1
          LIMIT 1)
          UNION ALL SELECT
          (SELECT category
          FROM tb
          WHERE category > c.category
          ORDER BY 1
          LIMIT 1)
          FROM cte c
          WHERE category IS NOT NULL)
          SELECT category
          FROM cte
          WHERE category IS NOT NULL;
          category
          ----------
          1
          2
          3
          (3 rows)


          And here's the EXPLAIN ANALYZE:



              QUERY PLAN                                                                         
          -----------------------------------------------------------------------------------------------------------------------------------------------------------
          CTE Scan on cte (cost=40.66..42.68 rows=100 width=2) (actual time=0.057..0.127 rows=3 loops=1)
          Filter: (category IS NOT NULL)
          Rows Removed by Filter: 1
          CTE cte
          -> Recursive Union (cost=0.29..40.66 rows=101 width=2) (actual time=0.052..0.119 rows=4 loops=1)
          -> Limit (cost=0.29..0.33 rows=1 width=2) (actual time=0.051..0.051 rows=1 loops=1)
          -> Index Only Scan using tb_category_idx on tb tb_1 (cost=0.29..1363.29 rows=30000 width=2) (actual time=0.050..0.050 rows=1 loops=1)
          Index Cond: (category >= 0)
          Heap Fetches: 1
          -> WorkTable Scan on cte c (cost=0.00..3.83 rows=10 width=2) (actual time=0.015..0.015 rows=1 loops=4)
          Filter: (category IS NOT NULL)
          Rows Removed by Filter: 0
          SubPlan 1
          -> Limit (cost=0.29..0.36 rows=1 width=2) (actual time=0.016..0.016 rows=1 loops=3)
          -> Index Only Scan using tb_category_idx on tb (cost=0.29..755.95 rows=10000 width=2) (actual time=0.015..0.015 rows=1 loops=3)
          Index Cond: (category > c.category)
          Heap Fetches: 2
          Planning time: 0.224 ms
          Execution time: 0.191 ms
          (19 rows)


          The number of loops it has to do the WorkTable scan node will be equal to the number of unique categories you have plus one, so it should stay very fast up to, say, hundreds of unique values.



          Another route you can take is to add another table where you just store unique values of tb.category and have application logic check that table and insert their value when updating/inserting that column. This can also be done database-side with triggers; that solution is also discussed in the answers to the linked question.






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242446%2fpostgresql-increase-group-by-over-30-million-rows%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote













            For that exact query, not really; doing this operation requires scanning every row. No way around it.



            But if you're looking to be able to quickly get the set of unique categories, and you have an index on that column, you can use a variation of the WITH RECURSIVE example shown in the edit to the question here (look towards the end of the question):



            Counting distinct rows using recursive cte over non-distinct index



            You'll need to change it to return the set of unique values instead of counting them, but it looks like a simple change:



            testdb=# create table tb(id bigserial, category smallint);
            CREATE TABLE
            testdb=# insert into tb(category) select 2 from generate_series(1, 10000)
            testdb-# ;
            INSERT 0 10000
            testdb=# insert into tb(category) select 1 from generate_series(1, 10000);
            INSERT 0 10000
            testdb=# insert into tb(category) select 3 from generate_series(1, 10000);
            INSERT 0 10000
            testdb=# create index on tb(category);
            CREATE INDEX
            testdb=# WITH RECURSIVE cte AS
            (
            (SELECT category
            FROM tb
            WHERE category >= 0
            ORDER BY 1
            LIMIT 1)
            UNION ALL SELECT
            (SELECT category
            FROM tb
            WHERE category > c.category
            ORDER BY 1
            LIMIT 1)
            FROM cte c
            WHERE category IS NOT NULL)
            SELECT category
            FROM cte
            WHERE category IS NOT NULL;
            category
            ----------
            1
            2
            3
            (3 rows)


            And here's the EXPLAIN ANALYZE:



                QUERY PLAN                                                                         
            -----------------------------------------------------------------------------------------------------------------------------------------------------------
            CTE Scan on cte (cost=40.66..42.68 rows=100 width=2) (actual time=0.057..0.127 rows=3 loops=1)
            Filter: (category IS NOT NULL)
            Rows Removed by Filter: 1
            CTE cte
            -> Recursive Union (cost=0.29..40.66 rows=101 width=2) (actual time=0.052..0.119 rows=4 loops=1)
            -> Limit (cost=0.29..0.33 rows=1 width=2) (actual time=0.051..0.051 rows=1 loops=1)
            -> Index Only Scan using tb_category_idx on tb tb_1 (cost=0.29..1363.29 rows=30000 width=2) (actual time=0.050..0.050 rows=1 loops=1)
            Index Cond: (category >= 0)
            Heap Fetches: 1
            -> WorkTable Scan on cte c (cost=0.00..3.83 rows=10 width=2) (actual time=0.015..0.015 rows=1 loops=4)
            Filter: (category IS NOT NULL)
            Rows Removed by Filter: 0
            SubPlan 1
            -> Limit (cost=0.29..0.36 rows=1 width=2) (actual time=0.016..0.016 rows=1 loops=3)
            -> Index Only Scan using tb_category_idx on tb (cost=0.29..755.95 rows=10000 width=2) (actual time=0.015..0.015 rows=1 loops=3)
            Index Cond: (category > c.category)
            Heap Fetches: 2
            Planning time: 0.224 ms
            Execution time: 0.191 ms
            (19 rows)


            The number of loops it has to do the WorkTable scan node will be equal to the number of unique categories you have plus one, so it should stay very fast up to, say, hundreds of unique values.



            Another route you can take is to add another table where you just store unique values of tb.category and have application logic check that table and insert their value when updating/inserting that column. This can also be done database-side with triggers; that solution is also discussed in the answers to the linked question.






            share|improve this answer



























              up vote
              0
              down vote













              For that exact query, not really; doing this operation requires scanning every row. No way around it.



              But if you're looking to be able to quickly get the set of unique categories, and you have an index on that column, you can use a variation of the WITH RECURSIVE example shown in the edit to the question here (look towards the end of the question):



              Counting distinct rows using recursive cte over non-distinct index



              You'll need to change it to return the set of unique values instead of counting them, but it looks like a simple change:



              testdb=# create table tb(id bigserial, category smallint);
              CREATE TABLE
              testdb=# insert into tb(category) select 2 from generate_series(1, 10000)
              testdb-# ;
              INSERT 0 10000
              testdb=# insert into tb(category) select 1 from generate_series(1, 10000);
              INSERT 0 10000
              testdb=# insert into tb(category) select 3 from generate_series(1, 10000);
              INSERT 0 10000
              testdb=# create index on tb(category);
              CREATE INDEX
              testdb=# WITH RECURSIVE cte AS
              (
              (SELECT category
              FROM tb
              WHERE category >= 0
              ORDER BY 1
              LIMIT 1)
              UNION ALL SELECT
              (SELECT category
              FROM tb
              WHERE category > c.category
              ORDER BY 1
              LIMIT 1)
              FROM cte c
              WHERE category IS NOT NULL)
              SELECT category
              FROM cte
              WHERE category IS NOT NULL;
              category
              ----------
              1
              2
              3
              (3 rows)


              And here's the EXPLAIN ANALYZE:



                  QUERY PLAN                                                                         
              -----------------------------------------------------------------------------------------------------------------------------------------------------------
              CTE Scan on cte (cost=40.66..42.68 rows=100 width=2) (actual time=0.057..0.127 rows=3 loops=1)
              Filter: (category IS NOT NULL)
              Rows Removed by Filter: 1
              CTE cte
              -> Recursive Union (cost=0.29..40.66 rows=101 width=2) (actual time=0.052..0.119 rows=4 loops=1)
              -> Limit (cost=0.29..0.33 rows=1 width=2) (actual time=0.051..0.051 rows=1 loops=1)
              -> Index Only Scan using tb_category_idx on tb tb_1 (cost=0.29..1363.29 rows=30000 width=2) (actual time=0.050..0.050 rows=1 loops=1)
              Index Cond: (category >= 0)
              Heap Fetches: 1
              -> WorkTable Scan on cte c (cost=0.00..3.83 rows=10 width=2) (actual time=0.015..0.015 rows=1 loops=4)
              Filter: (category IS NOT NULL)
              Rows Removed by Filter: 0
              SubPlan 1
              -> Limit (cost=0.29..0.36 rows=1 width=2) (actual time=0.016..0.016 rows=1 loops=3)
              -> Index Only Scan using tb_category_idx on tb (cost=0.29..755.95 rows=10000 width=2) (actual time=0.015..0.015 rows=1 loops=3)
              Index Cond: (category > c.category)
              Heap Fetches: 2
              Planning time: 0.224 ms
              Execution time: 0.191 ms
              (19 rows)


              The number of loops it has to do the WorkTable scan node will be equal to the number of unique categories you have plus one, so it should stay very fast up to, say, hundreds of unique values.



              Another route you can take is to add another table where you just store unique values of tb.category and have application logic check that table and insert their value when updating/inserting that column. This can also be done database-side with triggers; that solution is also discussed in the answers to the linked question.






              share|improve this answer

























                up vote
                0
                down vote










                up vote
                0
                down vote









                For that exact query, not really; doing this operation requires scanning every row. No way around it.



                But if you're looking to be able to quickly get the set of unique categories, and you have an index on that column, you can use a variation of the WITH RECURSIVE example shown in the edit to the question here (look towards the end of the question):



                Counting distinct rows using recursive cte over non-distinct index



                You'll need to change it to return the set of unique values instead of counting them, but it looks like a simple change:



                testdb=# create table tb(id bigserial, category smallint);
                CREATE TABLE
                testdb=# insert into tb(category) select 2 from generate_series(1, 10000)
                testdb-# ;
                INSERT 0 10000
                testdb=# insert into tb(category) select 1 from generate_series(1, 10000);
                INSERT 0 10000
                testdb=# insert into tb(category) select 3 from generate_series(1, 10000);
                INSERT 0 10000
                testdb=# create index on tb(category);
                CREATE INDEX
                testdb=# WITH RECURSIVE cte AS
                (
                (SELECT category
                FROM tb
                WHERE category >= 0
                ORDER BY 1
                LIMIT 1)
                UNION ALL SELECT
                (SELECT category
                FROM tb
                WHERE category > c.category
                ORDER BY 1
                LIMIT 1)
                FROM cte c
                WHERE category IS NOT NULL)
                SELECT category
                FROM cte
                WHERE category IS NOT NULL;
                category
                ----------
                1
                2
                3
                (3 rows)


                And here's the EXPLAIN ANALYZE:



                    QUERY PLAN                                                                         
                -----------------------------------------------------------------------------------------------------------------------------------------------------------
                CTE Scan on cte (cost=40.66..42.68 rows=100 width=2) (actual time=0.057..0.127 rows=3 loops=1)
                Filter: (category IS NOT NULL)
                Rows Removed by Filter: 1
                CTE cte
                -> Recursive Union (cost=0.29..40.66 rows=101 width=2) (actual time=0.052..0.119 rows=4 loops=1)
                -> Limit (cost=0.29..0.33 rows=1 width=2) (actual time=0.051..0.051 rows=1 loops=1)
                -> Index Only Scan using tb_category_idx on tb tb_1 (cost=0.29..1363.29 rows=30000 width=2) (actual time=0.050..0.050 rows=1 loops=1)
                Index Cond: (category >= 0)
                Heap Fetches: 1
                -> WorkTable Scan on cte c (cost=0.00..3.83 rows=10 width=2) (actual time=0.015..0.015 rows=1 loops=4)
                Filter: (category IS NOT NULL)
                Rows Removed by Filter: 0
                SubPlan 1
                -> Limit (cost=0.29..0.36 rows=1 width=2) (actual time=0.016..0.016 rows=1 loops=3)
                -> Index Only Scan using tb_category_idx on tb (cost=0.29..755.95 rows=10000 width=2) (actual time=0.015..0.015 rows=1 loops=3)
                Index Cond: (category > c.category)
                Heap Fetches: 2
                Planning time: 0.224 ms
                Execution time: 0.191 ms
                (19 rows)


                The number of loops it has to do the WorkTable scan node will be equal to the number of unique categories you have plus one, so it should stay very fast up to, say, hundreds of unique values.



                Another route you can take is to add another table where you just store unique values of tb.category and have application logic check that table and insert their value when updating/inserting that column. This can also be done database-side with triggers; that solution is also discussed in the answers to the linked question.






                share|improve this answer














                For that exact query, not really; doing this operation requires scanning every row. No way around it.



                But if you're looking to be able to quickly get the set of unique categories, and you have an index on that column, you can use a variation of the WITH RECURSIVE example shown in the edit to the question here (look towards the end of the question):



                Counting distinct rows using recursive cte over non-distinct index



                You'll need to change it to return the set of unique values instead of counting them, but it looks like a simple change:



                testdb=# create table tb(id bigserial, category smallint);
                CREATE TABLE
                testdb=# insert into tb(category) select 2 from generate_series(1, 10000)
                testdb-# ;
                INSERT 0 10000
                testdb=# insert into tb(category) select 1 from generate_series(1, 10000);
                INSERT 0 10000
                testdb=# insert into tb(category) select 3 from generate_series(1, 10000);
                INSERT 0 10000
                testdb=# create index on tb(category);
                CREATE INDEX
                testdb=# WITH RECURSIVE cte AS
                (
                (SELECT category
                FROM tb
                WHERE category >= 0
                ORDER BY 1
                LIMIT 1)
                UNION ALL SELECT
                (SELECT category
                FROM tb
                WHERE category > c.category
                ORDER BY 1
                LIMIT 1)
                FROM cte c
                WHERE category IS NOT NULL)
                SELECT category
                FROM cte
                WHERE category IS NOT NULL;
                category
                ----------
                1
                2
                3
                (3 rows)


                And here's the EXPLAIN ANALYZE:



                    QUERY PLAN                                                                         
                -----------------------------------------------------------------------------------------------------------------------------------------------------------
                CTE Scan on cte (cost=40.66..42.68 rows=100 width=2) (actual time=0.057..0.127 rows=3 loops=1)
                Filter: (category IS NOT NULL)
                Rows Removed by Filter: 1
                CTE cte
                -> Recursive Union (cost=0.29..40.66 rows=101 width=2) (actual time=0.052..0.119 rows=4 loops=1)
                -> Limit (cost=0.29..0.33 rows=1 width=2) (actual time=0.051..0.051 rows=1 loops=1)
                -> Index Only Scan using tb_category_idx on tb tb_1 (cost=0.29..1363.29 rows=30000 width=2) (actual time=0.050..0.050 rows=1 loops=1)
                Index Cond: (category >= 0)
                Heap Fetches: 1
                -> WorkTable Scan on cte c (cost=0.00..3.83 rows=10 width=2) (actual time=0.015..0.015 rows=1 loops=4)
                Filter: (category IS NOT NULL)
                Rows Removed by Filter: 0
                SubPlan 1
                -> Limit (cost=0.29..0.36 rows=1 width=2) (actual time=0.016..0.016 rows=1 loops=3)
                -> Index Only Scan using tb_category_idx on tb (cost=0.29..755.95 rows=10000 width=2) (actual time=0.015..0.015 rows=1 loops=3)
                Index Cond: (category > c.category)
                Heap Fetches: 2
                Planning time: 0.224 ms
                Execution time: 0.191 ms
                (19 rows)


                The number of loops it has to do the WorkTable scan node will be equal to the number of unique categories you have plus one, so it should stay very fast up to, say, hundreds of unique values.



                Another route you can take is to add another table where you just store unique values of tb.category and have application logic check that table and insert their value when updating/inserting that column. This can also be done database-side with triggers; that solution is also discussed in the answers to the linked question.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 10 at 21:13

























                answered Nov 10 at 21:04









                AdamKG

                9,14722635




                9,14722635






























                     

                    draft saved


                    draft discarded



















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242446%2fpostgresql-increase-group-by-over-30-million-rows%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Xamarin.iOS Cant Deploy on Iphone

                    Glorious Revolution

                    Dulmage-Mendelsohn matrix decomposition in Python