Re: BUG #16031: Group by returns duplicate groups - Mailing list pgsql-bugs
From | Tomas Vondra |
---|---|
Subject | Re: BUG #16031: Group by returns duplicate groups |
Date | |
Msg-id | 20191007205341.koxjl5etfoqg4y7l@development Whole thread Raw |
In response to | RE: BUG #16031: Group by returns duplicate groups (David Raymond <David.Raymond@tomtom.com>) |
Responses |
RE: BUG #16031: Group by returns duplicate groups
|
List | pgsql-bugs |
Hi, On Mon, Oct 07, 2019 at 06:29:04PM +0000, David Raymond wrote: >Downloaded and installed 12.0, created a nice shiny new cluster, and >confirmed that it's still doing it. Now in 12 you have to force it to >materialize the CTE, which was why I had used a CTE in 11 in the first >place. > I've tried running this on a 12.0 cluster too (on Linux though), using the data set linked by Andrew, and I haven't observed any duplicate rows either. I do have some observations, though ... > >testing=> select version(); > version >------------------------------------------------------------ > PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit >(1 row) > >Time: 0.148 ms >testing=> \d+ weird_grouping > Table "name_stuff.weird_grouping" > Column | Type | Collation | Nullable | Default | Storage | Stats target | Description >--------+------------------------+-----------+----------+---------+----------+--------------+------------- > name | character varying(254) | | not null | | extended | | >Access method: heap > >testing=> analyze verbose weird_grouping; >INFO: analyzing "name_stuff.weird_grouping" >INFO: "weird_grouping": scanned 8316 of 8316 pages, containing 1297265 live rows and 0 dead rows; 30000 rows in sample,1297265 estimated total rows >ANALYZE >Time: 206.577 ms >testing=> select count(*), count(distinct name) from weird_grouping; > count | count >-----------+----------- > 1,297,265 | 1,176,103 >(1 row) > I do get this: count | count ---------+--------- 1297265 | 1176101 (1 row) >Time: 6729.011 ms (00:06.729) >testing=> with foo as (select name from weird_grouping group by name) select name from foo group by name having count(*)> 1; > name >------ >(0 rows) > >Time: 7289.128 ms (00:07.289) >testing=> explain (analyze, verbose, costs, settings, buffers, timing, summary) with foo as (select name from weird_groupinggroup by name) select name from foo group by name having count(*) > 1; > QUERY PLAN >-------------------------------------------------------------------------------------------------------------------------------------------------------- > GroupAggregate (cost=179613.72..200268.04 rows=67 width=20) (actual time=6203.868..6203.868 rows=0 loops=1) > Output: weird_grouping.name > Group Key: weird_grouping.name > Filter: (count(*) > 1) > Rows Removed by Filter: 1176103 > Buffers: shared hit=2464 read=5852, temp read=7800 written=7830 > -> Group (cost=179613.72..186100.05 rows=944366 width=20) (actual time=4769.781..5985.111 rows=1176103 loops=1) > Output: weird_grouping.name > Group Key: weird_grouping.name > Buffers: shared hit=2464 read=5852, temp read=7800 written=7830 > -> Sort (cost=179613.72..182856.89 rows=1297265 width=20) (actual time=4769.779..5844.350 rows=1297265 loops=1) > Output: weird_grouping.name > Sort Key: weird_grouping.name > Sort Method: external merge Disk: 39048kB > Buffers: shared hit=2464 read=5852, temp read=7800 written=7830 > -> Seq Scan on name_stuff.weird_grouping (cost=0.00..21288.65 rows=1297265 width=20) (actual time=0.059..102.772rows=1297265 loops=1) > Output: weird_grouping.name > Buffers: shared hit=2464 read=5852 > Settings: search_path = 'name_stuff' > Planning Time: 0.048 ms > Execution Time: 7115.761 ms >(21 rows) > I'd like to point this may not be actually correct either. It does not produce any rows, i.e. it does not see any "duplicate groups", but it does this: -> Group (cost=179613.72..186100.05 rows=944366 width=20) (actual time=4769.781..5985.111 rows=1176103 loops=1) Notice that it has 1176103 groups, but is that actually correct? On my machine I only see 1176101 groups, and the difference (2) would match the DCL and DELTA duplicates with the next query. >Time: 7116.170 ms (00:07.116) >testing=> with foo as materialized (select name from weird_grouping group by name) select name from foo group by name havingcount(*) > 1; > name >------- > DCT > DELTA >(2 rows) > >Time: 8850.833 ms (00:08.851) >testing=> explain (analyze, verbose, costs, settings, buffers, timing, summary) with foo as materialized (select name fromweird_grouping group by name) select name from foo group by name having count(*) > 1; > QUERY PLAN >---------------------------------------------------------------------------------------------------------------------------------------------------------- > HashAggregate (cost=209709.20..209711.70 rows=67 width=516) (actual time=6676.811..6768.094 rows=2 loops=1) > Output: foo.name > Group Key: foo.name > Filter: (count(*) > 1) > Rows Removed by Filter: 1176099 > Buffers: shared hit=2528 read=5788, temp read=7800 written=12363 > CTE foo > -> Group (cost=179613.72..186100.05 rows=944366 width=20) (actual time=4774.681..6004.725 rows=1176103 loops=1) > Output: weird_grouping.name > Group Key: weird_grouping.name > Buffers: shared hit=2528 read=5788, temp read=7800 written=7830 > -> Sort (cost=179613.72..182856.89 rows=1297265 width=20) (actual time=4774.678..5860.270 rows=1297265 loops=1) > Output: weird_grouping.name > Sort Key: weird_grouping.name > Sort Method: external merge Disk: 39048kB > Buffers: shared hit=2528 read=5788, temp read=7800 written=7830 > -> Seq Scan on name_stuff.weird_grouping (cost=0.00..21288.65 rows=1297265 width=20) (actual time=0.065..101.141rows=1297265 loops=1) > Output: weird_grouping.name > Buffers: shared hit=2528 read=5788 > -> CTE Scan on foo (cost=0.00..18887.32 rows=944366 width=516) (actual time=4774.683..6228.002 rows=1176103 loops=1) > Output: foo.name > Buffers: shared hit=2528 read=5788, temp read=7800 written=12363 > Settings: search_path = 'name_stuff' > Planning Time: 0.054 ms > Execution Time: 8786.597 ms >(25 rows) > >Time: 8787.011 ms (00:08.787) >testing=> > Can you try running this with enable_hashagg = off? That should give you another Sort on the CTE Scan, and a GroupAggregate at the top. I wonder if that makes the issue go away ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-bugs by date: