BUG #16031: Group by returns duplicate groups - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #16031: Group by returns duplicate groups
Date
Msg-id 16031-4ef55395a5fbb687@postgresql.org
Whole thread Raw
Responses Re: BUG #16031: Group by returns duplicate groups
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      16031
Logged by:          David Raymond
Email address:      david.raymond@tomtom.com
PostgreSQL version: 11.5
Operating system:   Windows 10
Description:

I have a large table which I'm running a query on, grouping on a single
field, and returning only that 1 field, filtered with a HAVING clause. The
results that I'm getting back however contain a duplicate entry for one of
the returned values, which I believe should never be able to happen.
I'm working on dropping as many extra fields and records as I can to still
get the bad result from something small enough to send as a test case, but
figured I'd post this while I'm at it to get any advice. I've done a dump of
the table and then restored it to a new table, and still get the same weird
results. On the original version of the table I had clustered it on the
index that starts with the field being grouped on, but it's still giving the
bad groups after a dump and restore.
I'm running 11.5 on Windows 10 (Enterprise DB installer) and unfortunately
don't have the resources to build a new version myself if you come up with a
patch (also why I'm hoping to shrink it down to where others can test it)

Here's output from psql:

testing=> select version();
                          version
------------------------------------------------------------
 PostgreSQL 11.5, compiled by Visual C++ build 1914, 64-bit
(1 row)

Time: 0.272 ms

testing=> \d+ big_table
                                         Table "name_stuff.big_table"
   Column    |          Type          | Collation | Nullable | Default |
Storage  | Stats target | Description
-------------+------------------------+-----------+----------+---------+----------+--------------+-------------
 id_1        | uuid                   |           | not null |         |
plain    |              |
 field_2     | uuid                   |           |          |         |
plain    |              |
 name        | character varying(254) |           | not null |         |
extended |              |
 field_4     | character varying(254) |           |          |         |
extended |              |
 field_5     | numeric(2,0)           |           | not null |         |
main     |              |
 field_6     | character varying(4)   |           |          |         |
extended |              |
 field_7     | character varying(3)   |           | not null |         |
extended |              |
 field_8     | character varying(3)   |           |          |         |
extended |              |
 arr_field_1 | character varying(254) |           |          |         |
extended |              |
 arr_field_2 | character varying(254) |           |          |         |
extended |              |
 arr_field_3 | character varying(254) |           |          |         |
extended |              |
 arr_field_4 | character varying(254) |           |          |         |
extended |              |
 arr_field_5 | character varying(254) |           |          |         |
extended |              |
 arr_field_6 | character varying(254) |           |          |         |
extended |              |
 field_15    | boolean                |           |          |         |
plain    |              |
 field_16    | boolean                |           |          |         |
plain    |              |
 id_2        | text                   |           | not null |         |
extended |              |
Indexes:
    "big_table_pkey" PRIMARY KEY, btree (id_1, id_2)
    "big_table_name_id_1_id_2_idx" btree (name, id_1, id_2)

testing=> select count(*) from big_table;
    count
-------------
 108,565,086
(1 row)

Time: 273770.205 ms (04:33.770)

testing=> explain (analyze, verbose, costs, buffers, timing, summary) create
table bad_groups_1 as select name from big_table group by name having
count(distinct array[arr_field_1, arr_field_2, arr_field_3, arr_field_4,
arr_field_5, arr_field_6]) > 1;

      QUERY PLAN                                       

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=29317810.81..30149735.19 rows=487417 width=22)
(actual time=758501.326..895894.082 rows=745024 loops=1)
   Output: name
   Group Key: big_table.name
   Filter: (count(DISTINCT ARRAY[big_table.arr_field_1,
big_table.arr_field_2, big_table.arr_field_3, big_table.arr_field_4,
big_table.arr_field_5, big_table.arr_field_6]) > 1)
   Rows Removed by Filter: 80610652
   Buffers: shared hit=2325 read=1515081, temp read=2464481
written=2467410
   ->  Sort  (cost=29317810.81..29589026.23 rows=108486168 width=57) (actual
time=758493.476..819035.136 rows=108565086 loops=1)
         Output: name, arr_field_1, arr_field_2, arr_field_3, arr_field_4,
arr_field_5, arr_field_6
         Sort Key: big_table.name
         Sort Method: external merge  Disk: 4174488kB
         Buffers: shared hit=2317 read=1515081, temp read=2464481
written=2467410
         ->  Seq Scan on name_stuff.big_table  (cost=0.00..2602259.68
rows=108486168 width=57) (actual time=23.216..119113.708 rows=108565086
loops=1)
               Output: name, arr_field_1, arr_field_2, arr_field_3,
arr_field_4, arr_field_5, arr_field_6
               Buffers: shared hit=2317 read=1515081
 Planning Time: 0.196 ms
 Execution Time: 897276.109 ms
(16 rows)

Time: 897285.808 ms (14:57.286)

testing=> \d+ bad_groups_1
                                     Table "name_stuff.bad_groups_1"
 Column |          Type          | Collation | Nullable | Default | Storage
| Stats target | Description
--------+------------------------+-----------+----------+---------+----------+--------------+-------------
 name   | character varying(254) |           |          |         | extended
|              |

testing=> select count(*), count(distinct name) from bad_groups_1;
  count  |  count
---------+---------
 745,024 | 745,023
(1 row)

Time: 899.273 ms

testing=> select name from bad_groups_1 group by name having count(*) > 1;
 name
------
 DK
(1 row)

Time: 337.663 ms

testing=>


pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #16030: yum update failed due to download the https://download.postgresql.org/pub/repos/yum/12/redhat/rhel-7
Next
From: Tom Lane
Date:
Subject: Re: BUG #16031: Group by returns duplicate groups