Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Date
Msg-id 55846D3F.3070607@2ndquadrant.com
Whole thread Raw
In response to Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
List pgsql-hackers

On 06/19/2015 08:32 PM, Jeff Janes wrote:
> On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>     Hi,
>
>     I'm currently running some tests on a 3TB TPC-H data set, and I
>     tripped over a pretty bad n_distinct underestimate, causing OOM in
>     HashAgg (which somehow illustrates the importance of the
>     memory-bounded hashagg patch Jeff Davis is working on).
>
>     The problem is Q18, particularly this simple subquery:
>
>          select l_orderkey
>          from lineitem
>          group by l_orderkey
>          having sum(l_quantity) > 313;
>
>     which is planned like this:
>
>                                     QUERY PLAN
>     ---------------------------------------------------------------------------------
>       HashAggregate  (cost=598510163.92..598515393.93 rows=418401 width=12)
>         Group Key: l_orderkey
>         Filter: (sum(l_quantity) > '313'::double precision)
>         ->  Seq Scan on lineitem  (cost=0.00..508509923.28
>     rows=18000048128 width=12)
>     (4 rows)
>
>     but sadly, in reality the l_orderkey cardinality looks like this:
>
>          tpch=# select count(distinct l_orderkey) from lineitem;
>             count
>          ------------
>           4500000000
>          (1 row)
>
>     That's a helluva difference - not the usual one or two orders of
>     magnitude, but 10000x underestimate.
>
>
>
> Is the row order in the table correlated with the value l_orderkey?

The statistics look like this:
 attnum |     attname     | n_distinct  | correlation
--------+-----------------+-------------+-------------      1 | l_orderkey      | 4.27349e+07 |    0.988631

so yes, it's pretty much perfectly correlated.

> Could you create copy of the table ordered at random, and see if it
> exhibits the same estimation issue?

Sadly no - this is a 2.5TB table, so it's really easy to do. But I don't 
really see why that should matter, because the sample is supposed to be 
random.

But I think you might be on to something, because I manually collected a 
random sample with 30k rows (by explicitly generating 30k random TIDs), 
and I get this:

tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt 
from lineitem_sample group by 1) foo group by 1;
 cnt | count
-----+-------   1 | 29998   2 |     1
(2 rows)


That's quite different compared to what analyze gets, which effectively 
looks something like this (this is derived from the logs, so not 
perfectly accurate - I only have f1, ndistinct, nmultiple):
 cnt | count
-----+-------   1 | 27976   2 |   976   3 |    24

Am I wrong or is the sample not that random?

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: pgbench - allow backslash-continuations in custom scripts
Next
From: Andres Freund
Date:
Subject: Re: 9.5 release notes