Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H |
Date | |
Msg-id | 55846D3F.3070607@2ndquadrant.com Whole thread Raw |
In response to | Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H (Jeff Janes <jeff.janes@gmail.com>) |
Responses |
Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
|
List | pgsql-hackers |
On 06/19/2015 08:32 PM, Jeff Janes wrote: > On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > Hi, > > I'm currently running some tests on a 3TB TPC-H data set, and I > tripped over a pretty bad n_distinct underestimate, causing OOM in > HashAgg (which somehow illustrates the importance of the > memory-bounded hashagg patch Jeff Davis is working on). > > The problem is Q18, particularly this simple subquery: > > select l_orderkey > from lineitem > group by l_orderkey > having sum(l_quantity) > 313; > > which is planned like this: > > QUERY PLAN > --------------------------------------------------------------------------------- > HashAggregate (cost=598510163.92..598515393.93 rows=418401 width=12) > Group Key: l_orderkey > Filter: (sum(l_quantity) > '313'::double precision) > -> Seq Scan on lineitem (cost=0.00..508509923.28 > rows=18000048128 width=12) > (4 rows) > > but sadly, in reality the l_orderkey cardinality looks like this: > > tpch=# select count(distinct l_orderkey) from lineitem; > count > ------------ > 4500000000 > (1 row) > > That's a helluva difference - not the usual one or two orders of > magnitude, but 10000x underestimate. > > > > Is the row order in the table correlated with the value l_orderkey? The statistics look like this: attnum | attname | n_distinct | correlation --------+-----------------+-------------+------------- 1 | l_orderkey | 4.27349e+07 | 0.988631 so yes, it's pretty much perfectly correlated. > Could you create copy of the table ordered at random, and see if it > exhibits the same estimation issue? Sadly no - this is a 2.5TB table, so it's really easy to do. But I don't really see why that should matter, because the sample is supposed to be random. But I think you might be on to something, because I manually collected a random sample with 30k rows (by explicitly generating 30k random TIDs), and I get this: tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt from lineitem_sample group by 1) foo group by 1; cnt | count -----+------- 1 | 29998 2 | 1 (2 rows) That's quite different compared to what analyze gets, which effectively looks something like this (this is derived from the logs, so not perfectly accurate - I only have f1, ndistinct, nmultiple): cnt | count -----+------- 1 | 27976 2 | 976 3 | 24 Am I wrong or is the sample not that random? -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: