Thread: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Tomas Vondra
Date:
Hi,

I'm currently running some tests on a 3TB TPC-H data set, and I tripped 
over a pretty bad n_distinct underestimate, causing OOM in HashAgg 
(which somehow illustrates the importance of the memory-bounded hashagg 
patch Jeff Davis is working on).

The problem is Q18, particularly this simple subquery:
    select l_orderkey    from lineitem    group by l_orderkey    having sum(l_quantity) > 313;

which is planned like this:
                               QUERY PLAN
--------------------------------------------------------------------------------- HashAggregate
(cost=598510163.92..598515393.93rows=418401 width=12)   Group Key: l_orderkey   Filter: (sum(l_quantity) >
'313'::doubleprecision)   ->  Seq Scan on lineitem  (cost=0.00..508509923.28 rows=18000048128 
 
width=12)
(4 rows)

but sadly, in reality the l_orderkey cardinality looks like this:
    tpch=# select count(distinct l_orderkey) from lineitem;       count    ------------     4500000000    (1 row)

That's a helluva difference - not the usual one or two orders of 
magnitude, but 10000x underestimate.

The usual thing to do in this case is increasing statistics target, and 
while this improves the estimate, the improvement is rather small:
   statistics target     estimate        difference   --------------------------------------------------   100
          429491             10000   1000                   4240418              1000   10000                 42913759
            100
 

I find the pattern rather strange - every time the statistics target 
increases 10x, the difference decreases 10x - maybe that's natural, but 
the perfect proportionality is suspicious IMHO.

Also, this is a quite large dataset - the table has ~18 billion rows, 
and even with target=10000 we're sampling only 3M rows, which is ~0.02%. 
That's a tiny sample, so inaccuracy is naturally expected, but OTOH the 
TPC-H dataset is damn uniform - there's pretty much no skew in the 
distributions AFAIK. So I'd expect a slightly better result.

With target=10000 the plan switches to GroupAggregate, because the 
estimate gets sufficient to exceed work_mem (2GB). But it's still way 
off, and it's mostly just a lucky coincidence.

So I'm wondering if there's some bug because of the dataset size (an 
integer overflow or something like), so I added a bunch of logging into 
the estimator, logging all the parameters computed:

target=100 (samplerows=30000)
-----------------------------
WARNING:  attnum=1 attname=l_orderkey f1=27976 ndistinct=28977 
nmultiple=1001 toowide_cnt=0 d=28977 numer=869310000.000000 
denom=2024.046627 stadistinct=429491.094029
WARNING:  ndistinct estimate attnum=1 attname=l_orderkey 
current=429491.09 adaptive=443730.00

target=1000 (samplerows=300000)
-------------------------------
WARNING:  attnum=1 attname=l_orderkey f1=279513 ndistinct=289644 
nmultiple=10131 toowide_cnt=0 d=289644 numer=86893200000.000000 
denom=20491.658538 stadistinct=4240418.111618
WARNING:  ndistinct estimate attnum=1 attname=l_orderkey 
current=4240418.11 adaptive=4375171.00

target=10000 (samplerows=3000000)
---------------------------------
WARNING:  attnum=1 attname=l_orderkey f1=2797888 ndistinct=2897799 
nmultiple=99911 toowide_cnt=0 d=2897799 numer=8693397000000.000000 
denom=202578.313396 stadistinct=42913759.396282
WARNING:  ndistinct estimate attnum=1 attname=l_orderkey 
current=42913759.40 adaptive=44449882.00

It's totalrows=18000049031 in all cases. The logs also show estimate 
produced by the adaptive estimate (discussed in a separate thread), but 
apparently that does not change the estimates much :-(

Any ideas?

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Jeff Janes
Date:
On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,

I'm currently running some tests on a 3TB TPC-H data set, and I tripped over a pretty bad n_distinct underestimate, causing OOM in HashAgg (which somehow illustrates the importance of the memory-bounded hashagg patch Jeff Davis is working on).

The problem is Q18, particularly this simple subquery:

    select l_orderkey
    from lineitem
    group by l_orderkey
    having sum(l_quantity) > 313;

which is planned like this:

                               QUERY PLAN
---------------------------------------------------------------------------------
 HashAggregate  (cost=598510163.92..598515393.93 rows=418401 width=12)
   Group Key: l_orderkey
   Filter: (sum(l_quantity) > '313'::double precision)
   ->  Seq Scan on lineitem  (cost=0.00..508509923.28 rows=18000048128 width=12)
(4 rows)

but sadly, in reality the l_orderkey cardinality looks like this:

    tpch=# select count(distinct l_orderkey) from lineitem;
       count
    ------------
     4500000000
    (1 row)

That's a helluva difference - not the usual one or two orders of magnitude, but 10000x underestimate.


Is the row order in the table correlated with the value l_orderkey?

Could you create copy of the table ordered at random, and see if it exhibits the same estimation issue?

Cheers,

Jeff

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Tomas Vondra
Date:

On 06/19/2015 08:32 PM, Jeff Janes wrote:
> On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>     Hi,
>
>     I'm currently running some tests on a 3TB TPC-H data set, and I
>     tripped over a pretty bad n_distinct underestimate, causing OOM in
>     HashAgg (which somehow illustrates the importance of the
>     memory-bounded hashagg patch Jeff Davis is working on).
>
>     The problem is Q18, particularly this simple subquery:
>
>          select l_orderkey
>          from lineitem
>          group by l_orderkey
>          having sum(l_quantity) > 313;
>
>     which is planned like this:
>
>                                     QUERY PLAN
>     ---------------------------------------------------------------------------------
>       HashAggregate  (cost=598510163.92..598515393.93 rows=418401 width=12)
>         Group Key: l_orderkey
>         Filter: (sum(l_quantity) > '313'::double precision)
>         ->  Seq Scan on lineitem  (cost=0.00..508509923.28
>     rows=18000048128 width=12)
>     (4 rows)
>
>     but sadly, in reality the l_orderkey cardinality looks like this:
>
>          tpch=# select count(distinct l_orderkey) from lineitem;
>             count
>          ------------
>           4500000000
>          (1 row)
>
>     That's a helluva difference - not the usual one or two orders of
>     magnitude, but 10000x underestimate.
>
>
>
> Is the row order in the table correlated with the value l_orderkey?

The statistics look like this:
 attnum |     attname     | n_distinct  | correlation
--------+-----------------+-------------+-------------      1 | l_orderkey      | 4.27349e+07 |    0.988631

so yes, it's pretty much perfectly correlated.

> Could you create copy of the table ordered at random, and see if it
> exhibits the same estimation issue?

Sadly no - this is a 2.5TB table, so it's really easy to do. But I don't 
really see why that should matter, because the sample is supposed to be 
random.

But I think you might be on to something, because I manually collected a 
random sample with 30k rows (by explicitly generating 30k random TIDs), 
and I get this:

tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt 
from lineitem_sample group by 1) foo group by 1;
 cnt | count
-----+-------   1 | 29998   2 |     1
(2 rows)


That's quite different compared to what analyze gets, which effectively 
looks something like this (this is derived from the logs, so not 
perfectly accurate - I only have f1, ndistinct, nmultiple):
 cnt | count
-----+-------   1 | 27976   2 |   976   3 |    24

Am I wrong or is the sample not that random?

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Jeff Janes
Date:
On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
 
But I think you might be on to something, because I manually collected a random sample with 30k rows (by explicitly generating 30k random TIDs), and I get this:

tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt from lineitem_sample group by 1) foo group by 1;

 cnt | count
-----+-------
   1 | 29998
   2 |     1
(2 rows)


That's quite different compared to what analyze gets, which effectively looks something like this (this is derived from the logs, so not perfectly accurate - I only have f1, ndistinct, nmultiple):

 cnt | count
-----+-------
   1 | 27976
   2 |   976
   3 |    24

Am I wrong or is the sample not that random?

The sample is not truly random.  The two-stage sampling method causes too few blocks to have exactly one row chosen from them, and too many to have either 0 or 2+ rows chosen from them.

When values in the same block are likely to be equal, then it finds too many duplicates because it too often picks two rows from a single block.

See analysis here:

http://www.postgresql.org/message-id/CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com

If we assume all the blocks have the same tuple density, then it is easy to correct this.  But without that assumption of constant tuple density, I don't know how to get a truly random sample while still skipping most of the table.

Cheers,

Jeff

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Tomas Vondra
Date:
On 06/19/2015 09:48 PM, Jeff Janes wrote:
> On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>     But I think you might be on to something, because I manually
>     collected a random sample with 30k rows (by explicitly generating
>     30k random TIDs), and I get this:
>
>     tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt
>     from lineitem_sample group by 1) foo group by 1;
>
>       cnt | count
>     -----+-------
>         1 | 29998
>         2 |     1
>     (2 rows)
>
>
>     That's quite different compared to what analyze gets, which
>     effectively looks something like this (this is derived from the
>     logs, so not perfectly accurate - I only have f1, ndistinct, nmultiple):
>
>       cnt | count
>     -----+-------
>         1 | 27976
>         2 |   976
>         3 |    24
>
>     Am I wrong or is the sample not that random?
>
>
> The sample is not truly random.  The two-stage sampling method causes
> too few blocks to have exactly one row chosen from them, and too many to
> have either 0 or 2+ rows chosen from them.
>
> When values in the same block are likely to be equal, then it finds too
> many duplicates because it too often picks two rows from a single block.

Yeah, I came to the same conclusion after a bit of experimenting. I've 
logged the block numbers for all the 30k sampled tuples (target=100) and 
I get this statistics for number of repetitions:
 cnt | count
-----+-------   1 | 11020   2 |  5637   3 |  1800   4 |   450   5 |    94   6 |     6

so 11020 blocks have exactly 1 tuple sampled from them, 5637 blocks have 
2 tuples sampled etc.

With truly random sampling (just generating 30k random numbers between 0 
and 328509442 (number of pages of this particular table), I get this:

test=# select cnt, count(*) from (select (328509442 * random())::int AS 
blockno, count(*) AS cnt from blocks group by 1) foo group by 1 order by 1;
 cnt | count
-----+-------   1 | 29994   2 |     3

So yeah, not really random.

> See analysis here:
>
> http://www.postgresql.org/message-id/CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com

Thanks.

> If we assume all the blocks have the same tuple density, then it is
> easy to correct this. But without that assumption of constant tuple
> density, I don't know how to get a truly random sample while still
> skipping most of the table.

Hmmm, that's probably true. OTOH correlated columns are not all that 
uncommon (e.g. table storing time-series data etc.), and this blowup is 
quite bad ...

I don't think we need to really assume the density to be constant, maybe 
we can verify that while collecting the sample? I mean, we're already 
reading the pages, so we can track the density, and either do the 
correction or not.

Also, doesn't Vitter do pretty much the same assumption implicitly, 
otherwise it couldn't skipping some of the blocks?

regards
Tomas

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Jeff Janes
Date:
On Fri, Jun 19, 2015 at 1:39 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 06/19/2015 09:48 PM, Jeff Janes wrote:
On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:

    But I think you might be on to something, because I manually
    collected a random sample with 30k rows (by explicitly generating
    30k random TIDs), and I get this:

    tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt
    from lineitem_sample group by 1) foo group by 1;

      cnt | count
    -----+-------
        1 | 29998
        2 |     1
    (2 rows)


    That's quite different compared to what analyze gets, which
    effectively looks something like this (this is derived from the
    logs, so not perfectly accurate - I only have f1, ndistinct, nmultiple):

      cnt | count
    -----+-------
        1 | 27976
        2 |   976
        3 |    24

    Am I wrong or is the sample not that random?


The sample is not truly random.  The two-stage sampling method causes
too few blocks to have exactly one row chosen from them, and too many to
have either 0 or 2+ rows chosen from them.

When values in the same block are likely to be equal, then it finds too
many duplicates because it too often picks two rows from a single block.

Yeah, I came to the same conclusion after a bit of experimenting. I've logged the block numbers for all the 30k sampled tuples (target=100) and I get this statistics for number of repetitions:

 cnt | count
-----+-------
   1 | 11020
   2 |  5637
   3 |  1800
   4 |   450
   5 |    94
   6 |     6

so 11020 blocks have exactly 1 tuple sampled from them, 5637 blocks have 2 tuples sampled etc.

With truly random sampling (just generating 30k random numbers between 0 and 328509442 (number of pages of this particular table), I get this:

test=# select cnt, count(*) from (select (328509442 * random())::int AS blockno, count(*) AS cnt from blocks group by 1) foo group by 1 order by 1;

 cnt | count
-----+-------
   1 | 29994
   2 |     3

So yeah, not really random.

See analysis here:

http://www.postgresql.org/message-id/CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com

Thanks.

If we assume all the blocks have the same tuple density, then it is
easy to correct this. But without that assumption of constant tuple
density, I don't know how to get a truly random sample while still
skipping most of the table.

Hmmm, that's probably true. OTOH correlated columns are not all that uncommon (e.g. table storing time-series data etc.), and this blowup is quite bad ...

True, but we don't know how big of a problem the density-skew problem might be (since the current algorithm isn't sensitive to it).  It might be just as big of a problem.   Tom mentioned some ancient history in the above mentioned thread that made me think the density skew was enough of a problem to motivate the current system.
 

I don't think we need to really assume the density to be constant, maybe we can verify that while collecting the sample? I mean, we're already reading the pages, so we can track the density, and either do the correction or not.

Maybe.  I don't know how that would work.  We would have to keep two samples, and dynamically decide which to use. And what if the decision is that both density skew is a problem and that value clustering is also a problem?
 
I wonder if the n_distinct could be tweaked so that it counted any given value only once for each block it finds it in?  So instead of asking "how many times was this value sampled", ask "in how many blocks was this value sampled at least once"?


Also, doesn't Vitter do pretty much the same assumption implicitly, otherwise it couldn't skipping some of the blocks?

Vitter samples an unstructured stream in a single pass, and is unbiased.  The explicit block sampling is not part of Vitter, it is something we bolted on top of it.

My solution was to just unbolt the block sampling from the top, and let it sample the rows (still 300 * stats_target of them) from the whole table rather than a random 300 * 100 blocks of the table.  On tables of the size I was worried about, block sampling was not very helpful anyway.  Reading 30,000 blocks out of 250,000 blocks lead to no meaningful IO advantage on my hardware. Any advantage from skipped blocks was made up for (and sometimes exceeded) by fouling up the read-ahead mechanisms.  

With 1000 times more blocks, that probably won't work for you.
 
But I do wonder, how much time does it take to read a random 1/10, 1/100, 1/1000, 1/10000 of a table of your size, just from an IO perspective?  How much are we gaining by doing the block sample?

Cheers,

Jeff

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Feng Tian
Date:
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Wed, Jun 17, 2015 at 10:52 AM, Tomas
Vondra<span dir="ltr"><<a href="mailto:tomas.vondra@2ndquadrant.com"
target="_blank">tomas.vondra@2ndquadrant.com</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0
00 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br /><br /> I'm currently running some tests on a 3TB TPC-H
dataset, and I tripped over a pretty bad n_distinct underestimate, causing OOM in HashAgg (which somehow illustrates
theimportance of the memory-bounded hashagg patch Jeff Davis is working on).<br /><br /> The problem is Q18,
particularlythis simple subquery:<br /><br />     select l_orderkey<br />     from lineitem<br />     group by
l_orderkey<br/>     having sum(l_quantity) > 313;<br /><br /> which is planned like this:<br /><br />              
                QUERY PLAN<br /> ---------------------------------------------------------------------------------<br
/> HashAggregate  (cost=598510163.92..598515393.93 rows=418401 width=12)<br />    Group Key: l_orderkey<br />  
 Filter:(sum(l_quantity) > '313'::double precision)<br />    ->  Seq Scan on lineitem  (cost=0.00..508509923.28
rows=18000048128width=12)<br /> (4 rows)<br /><br /> but sadly, in reality the l_orderkey cardinality looks like
this:<br/><br />     tpch=# select count(distinct l_orderkey) from lineitem;<br />        count<br />    
------------<br/>      4500000000<br />     (1 row)<br /><br /> That's a helluva difference - not the usual one or two
ordersof magnitude, but 10000x underestimate.<br /><br /> The usual thing to do in this case is increasing statistics
target,and while this improves the estimate, the improvement is rather small:<br /><br />    statistics target   
 estimate       difference<br />    --------------------------------------------------<br />    100                   
 429491            10000<br />    1000                   4240418              1000<br />    10000               
 42913759              100<br /><br /> I find the pattern rather strange - every time the statistics target increases
10x,the difference decreases 10x - maybe that's natural, but the perfect proportionality is suspicious IMHO.<br /><br
/>Also, this is a quite large dataset - the table has ~18 billion rows, and even with target=10000 we're sampling only
3Mrows, which is ~0.02%. That's a tiny sample, so inaccuracy is naturally expected, but OTOH the TPC-H dataset is damn
uniform- there's pretty much no skew in the distributions AFAIK. So I'd expect a slightly better result.<br /><br />
Withtarget=10000 the plan switches to GroupAggregate, because the estimate gets sufficient to exceed work_mem (2GB).
Butit's still way off, and it's mostly just a lucky coincidence.<br /><br /> So I'm wondering if there's some bug
becauseof the dataset size (an integer overflow or something like), so I added a bunch of logging into the estimator,
loggingall the parameters computed:<br /><br /> target=100 (samplerows=30000)<br /> -----------------------------<br />
WARNING: attnum=1 attname=l_orderkey f1=27976 ndistinct=28977 nmultiple=1001 toowide_cnt=0 d=28977
numer=869310000.000000denom=2024.046627 stadistinct=429491.094029<br /> WARNING:  ndistinct estimate attnum=1
attname=l_orderkeycurrent=429491.09 adaptive=443730.00<br /><br /> target=1000 (samplerows=300000)<br />
-------------------------------<br/> WARNING:  attnum=1 attname=l_orderkey f1=279513 ndistinct=289644 nmultiple=10131
toowide_cnt=0d=289644 numer=86893200000.000000 denom=20491.658538 stadistinct=4240418.111618<br /> WARNING:  ndistinct
estimateattnum=1 attname=l_orderkey current=4240418.11 adaptive=4375171.00<br /><br /> target=10000
(samplerows=3000000)<br/> ---------------------------------<br /> WARNING:  attnum=1 attname=l_orderkey f1=2797888
ndistinct=2897799nmultiple=99911 toowide_cnt=0 d=2897799 numer=8693397000000.000000 denom=202578.313396
stadistinct=42913759.396282<br/> WARNING:  ndistinct estimate attnum=1 attname=l_orderkey current=42913759.40
adaptive=44449882.00<br/><br /> It's totalrows=18000049031 in all cases. The logs also show estimate produced by the
adaptiveestimate (discussed in a separate thread), but apparently that does not change the estimates much :-(<br /><br
/>Any ideas?<br /><br /> --<br /> Tomas Vondra                   <a href="http://www.2ndQuadrant.com/" rel="noreferrer"
target="_blank">http://www.2ndQuadrant.com/</a><br/> PostgreSQL Development, 24x7 Support, Remote DBA, Training &
Services<spanclass="HOEnZb"><font color="#888888"><br /><br /><br /> -- <br /> Sent via pgsql-hackers mailing list (<a
href="mailto:pgsql-hackers@postgresql.org"target="_blank">pgsql-hackers@postgresql.org</a>)<br /> To make changes to
yoursubscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-hackers" rel="noreferrer"
target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/></font></span></blockquote></div><br
/></div><divclass="gmail_extra"><br /></div><div class="gmail_extra">While better sample/stats is important for
choosinga good plan, in this query, hash agg is really the right plan.   If a sort agg is chosen, the performance will
bereally really bad.   The patch that Jeff is working on is critical for a decent TPCH number (unless you have
unlimitedamount of memory).</div><div class="gmail_extra"><br /></div><div class="gmail_extra">Thanks,</div><div
class="gmail_extra">   </div></div> 

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Robert Haas
Date:
On Wed, Jun 17, 2015 at 1:52 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I'm currently running some tests on a 3TB TPC-H data set, and I tripped over
> a pretty bad n_distinct underestimate, causing OOM in HashAgg (which somehow
> illustrates the importance of the memory-bounded hashagg patch Jeff Davis is
> working on).

Stupid question, but why not just override it using ALTER TABLE ...
ALTER COLUMN ... SET (n_distinct = ...)?

I think it's been discussed quite often on previous threads that you
need to sample an awful lot of the table to get a good estimate for
n_distinct.  We could support that, but it would be expensive, and it
would have to be done again every time the table is auto-analyzed.
The above syntax supports nailing the estimate to either an exact
value or a percentage of the table, and I'm not sure why that isn't
good enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Tomas Vondra
Date:
Hi,

On 06/20/2015 08:54 AM, Feng Tian wrote:
>
> While better sample/stats is important for choosing a good plan, in
> this query, hash agg is really the right plan. If a sort agg is
> chosen, the performance will be really really bad. The patch that
> Jeff is working on is critical for a decent TPCH number (unless you
> have unlimited amount of memory).

I do agree that Jeff's memory-bounded hashagg patch is very important 
feature, and in fact we spent a fair amount of time discussing it in 
Ottawa. So I'm looking forward to getting that soon ;-)

But I don't think hashagg is going to be very good in this particular 
case. With a 3TB dataset, the query runs out of memory on a machine with 
256GB of RAM. So let's assume a complete hash table has ~256GB. With 
work_mem=1GB that means only ~1/256 of the table can be processed in one 
batch, so we'll process the first 1/256 of the table, and write out the 
remaining 99% into batches. Then we'll read the batches one by one, and 
process those. The table has ~2.5TB, so we'll read 2.5TB, write out 
~2.49TB into batches, and then read those ~2.49TB again. At least that's 
how I understand Jeff's memory-bounded hashagg proposal.

The sort may perform worse in the general case, but in this case there's 
an index on the column, and the table is almost perfectly correlated by 
that column (due to generating the orders one by one, but it seems 
plausible it'd be the same in reality, assuming the orders are numbered 
using a sequence). So doing the sort by an indexscan seems rather cheap, 
and you only need to scan the table once.

regards

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Tomas Vondra
Date:

On 06/20/2015 04:17 PM, Robert Haas wrote:
> On Wed, Jun 17, 2015 at 1:52 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I'm currently running some tests on a 3TB TPC-H data set, and I
>> tripped over a pretty bad n_distinct underestimate, causing OOM in>> HashAgg (which somehow illustrates the
importanceof the>> memory-bounded hashagg patch Jeff Davis is working on).
 
>
> Stupid question, but why not just override it using ALTER TABLE ...
> ALTER COLUMN ... SET (n_distinct = ...)?

Sure, I'll do that, and it's probably the only way to do that at the 
moment. But I wasn't really sure why we're producing so poor estimate 
initially, even considering how small the sample is, it seemed a bit too 
bad, and the proportionality to statistics target seemed really strange.

Also, if we could produce better n_distinct estimate, wouldn't that be 
awesome? I'd much rather not force users to use the manual override if 
possible.

>
> I think it's been discussed quite often on previous threads that you
> need to sample an awful lot of the table to get a good estimate for
> n_distinct. We could support that, but it would be expensive, and it
> would have to be done again every time the table is auto-analyzed.
> The above syntax supports nailing the estimate to either an exact
> value or a percentage of the table, and I'm not sure why that isn't
> good enough.

Not really. Using a larger sample would certainly help in most cases, in 
this case the main problem is that the sample is not as random needed. 
It produces way more duplicate values than it should, which then utterly 
confuses the estimator which assumes random sample. This happens because 
the column is

I've lobotomized the sampling a bit to really produce a random set of 
blocks first, and that produces way better estimates:
   statistics target     estimate               random
-----------------------------------------------------------------  100                     429491 (10000x)   334430766
(14x)  1000                   4240418  (1000x)   439010499 (10x)
 

Also, the number of sampled blocks is not that different. With target 
100, the current sampler reads ~2900 blocks, while a completely random 
sampler uses 3000 blocks. So, where's the benefit?

I'm sure there are counter-examples, but that's true for all estimators. 
I do realize the random sampler assumes the tuple density is about the 
same on all blocks (so that it can sample the blocks directly), but ISTM 
the current block sampler has to do basically the same assumption to 
skip some of the blocks.

Have to read the previous threads on this topic I guess ...

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Feng Tian
Date:


On Sat, Jun 20, 2015 at 7:56 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,

On 06/20/2015 08:54 AM, Feng Tian wrote:

While better sample/stats is important for choosing a good plan, in
this query, hash agg is really the right plan. If a sort agg is
chosen, the performance will be really really bad. The patch that
Jeff is working on is critical for a decent TPCH number (unless you
have unlimited amount of memory).

I do agree that Jeff's memory-bounded hashagg patch is very important feature, and in fact we spent a fair amount of time discussing it in Ottawa. So I'm looking forward to getting that soon ;-)

But I don't think hashagg is going to be very good in this particular case. With a 3TB dataset, the query runs out of memory on a machine with 256GB of RAM. So let's assume a complete hash table has ~256GB. With work_mem=1GB that means only ~1/256 of the table can be processed in one batch, so we'll process the first 1/256 of the table, and write out the remaining 99% into batches. Then we'll read the batches one by one, and process those. The table has ~2.5TB, so we'll read 2.5TB, write out ~2.49TB into batches, and then read those ~2.49TB again. At least that's how I understand Jeff's memory-bounded hashagg proposal.

The sort may perform worse in the general case, but in this case there's an index on the column, and the table is almost perfectly correlated by that column (due to generating the orders one by one, but it seems plausible it'd be the same in reality, assuming the orders are numbered using a sequence). So doing the sort by an indexscan seems rather cheap, and you only need to scan the table once.

regards


--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

I have not read Jeff's patch, but here is how I think hash agg should work,

Hash agg scan lineitem table, perform aggregation in memory.   Once workmem is exhausted, it write intermediate state to disk, bucket by bucket.  When lineitem table is finished, it reads all tuples from one bucket back, combining intermediate state and finalize the aggregation.   I saw a quite extensive discussion on combining aggregation on the dev list, so I assume it will be added.

Assume after modulo an efficient size for I/O, workmem is bigger than the square root of data after aggregation, the above algorithm can finish by write out and read back only once.  

For TPCH 3T, lineitem table has about 20 billion rows, 4 or 5 billion orders.   For the simple subquery, one need to
1. scan table, 3TB I/O
2. write out intermediate state, 4 billion * size of (key column + intermediate state ~ 20 bytes) = 80GB 
3. read back 80GB.

If sort is used, also assume workmem bigger than sqrt of data, you need to scan table, write out about 20B * 20 ~ 400GB, read back 400GB.   Sort may have to do extra rounds of merge, but let's ignore that.

Hash agg has better performace, because,
1. less I/O
2. hash is a linear algorithm, compared to sort at n*lg(n).

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Alvaro Herrera
Date:
Feng Tian wrote:

> I have not read Jeff's patch, but here is how I think hash agg should work,

I think you should discuss that in Jeff's thread, not here.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Tomas Vondra
Date:
Hi,

On 06/20/2015 05:29 PM, Feng Tian wrote:
>
> I have not read Jeff's patch, but here is how I think hash agg should work,
>
> Hash agg scan lineitem table, perform aggregation in memory.   Once
> workmem is exhausted, it write intermediate state to disk, bucket by
> bucket.  When lineitem table is finished, it reads all tuples from one
> bucket back, combining intermediate state and finalize the aggregation.
>    I saw a quite extensive discussion on combining aggregation on the
> dev list, so I assume it will be added.

That's not really how the proposed patch works, and the fact that we 
don't have a good way to serialize/deserialize the aggregate state etc. 
There are also various corner cases how you can end up with writing much 
more data than you assumed, but let's discuss that in the thread about 
the patch, not here.

regards

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Tomas Vondra
Date:
Hi,

On 06/20/2015 03:01 AM, Jeff Janes wrote:
>
>     Hmmm, that's probably true. OTOH correlated columns are not all that
>     uncommon (e.g. table storing time-series data etc.), and this blowup
>     is quite bad ...
>
>
> True, but we don't know how big of a problem the density-skew
> problem might be (since the current algorithm isn't sensitive to it).
> It might be just as big of a problem. Tom mentioned some ancient
> history in the above mentioned thread that made me think the density
> skew was enough of a problem to motivate the current system.
>
>
>     I don't think we need to really assume the density to be constant,
>     maybe we can verify that while collecting the sample? I mean, we're
>     already reading the pages, so we can track the density, and either
>     do the correction or not.
>
>
> Maybe.  I don't know how that would work.

I was thinking about something like
  (1) compute average tuples per page
  (2) generate random sample of 'samplesize' blocks
  (3) for each block, see what is the actual number of tuples on the      page and perform correction (e.g. if there's
only1/2 the average      tuples, toss a coin and decide whether to really sample the block)
 
  (4) repeat until you have sample of the requested size

The first weak spot is that this requires a knowledge of reltuples, 
which is something we currently estimate based on the sample, so it 
would have to use the old values I guess.

Also, I'm not sure this works that great with blocks that have higher 
density - you'd like to choose them with a higher probability, but if 
you've already selected them there's no point in repeating that.

> We would have to keep two samples, and dynamically decide which to
> use. And what if the decision is that both density skew is a problem
> and that value clustering is also a problem?

I don't see why we would have to keep two samples? The idea is that 
sampling the blocks truly randomly solves the clustering issue, and the 
correction I mentioned above solves the density skew problem.

> I wonder if the n_distinct could be tweaked so that it counted any
> given value only once for each block it finds it in? So instead of
> asking "how many times was this value sampled", ask "in how many
> blocks was this value sampled at least once"?

I don't think that's a really good idea. Imagine a column with a very 
low cardinality, so that every block you sample has just a very few 
distinct values. That's going to get very nasty very soon, especially 
when combined with estimators assuming a truly random sample (which is 
exactly the issue with the current ndistinct estimator).

>
>     Also, doesn't Vitter do pretty much the same assumption implicitly,
>     otherwise it couldn't skipping some of the blocks?
>
>
> Vitter samples an unstructured stream in a single pass, and is
> unbiased.  The explicit block sampling is not part of Vitter, it is
> something we bolted on top of it.

Yeah, you're right of course. Sorry for not being quite precise. We're 
using a variant of the S algorithm, based on the idea that we know the 
number of blocks at the beginning - but that's pretty much the place 
where we assume uniform density of rows per block, no? Because S is 
skipping blocks using that assumption implicitly.

>
> My solution was to just unbolt the block sampling from the top, and let
> it sample the rows (still 300 * stats_target of them) from the whole
> table rather than a random 300 * 100 blocks of the table.  On tables of
> the size I was worried about, block sampling was not very helpful
> anyway.  Reading 30,000 blocks out of 250,000 blocks lead to no
> meaningful IO advantage on my hardware. Any advantage from skipped
> blocks was made up for (and sometimes exceeded) by fouling up the
> read-ahead mechanisms.>
> With 1000 times more blocks, that probably won't work for you.
> But I do wonder, how much time does it take to read a random 1/10,
> 1/100, 1/1000, 1/10000 of a table of your size, just from an IO
> perspective?  How much are we gaining by doing the block sample?

Well, actually I think it would be even more appropriate for very large 
tables. With a 2.5TB table, you don't really care whether analyze 
collects 5GB or 8GB sample, the difference is rather minor compared to 
I/O generated by the other queries etc. The current sample is already 
random enough not to work well with read-ahead, and it scans only a 
slightly lower number of blocks.

And if the larger "random" sample results in better plans (especially 
plans without OOM) ...

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Jim Nasby
Date:
On 6/20/15 12:55 PM, Tomas Vondra wrote:
> Well, actually I think it would be even more appropriate for very large
> tables. With a 2.5TB table, you don't really care whether analyze
> collects 5GB or 8GB sample, the difference is rather minor compared to
> I/O generated by the other queries etc. The current sample is already
> random enough not to work well with read-ahead, and it scans only a
> slightly lower number of blocks.

Have we ever looked at generating new stats as part of a seqscan? I 
don't know how expensive the math is but if it's too much to push to a 
backend perhaps a bgworker could follow behind the seqscan.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Jeff Janes
Date:
On Sat, Jun 20, 2015 at 9:55 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,

On 06/20/2015 03:01 AM, Jeff Janes wrote:


    I don't think we need to really assume the density to be constant,
    maybe we can verify that while collecting the sample? I mean, we're
    already reading the pages, so we can track the density, and either
    do the correction or not.


Maybe.  I don't know how that would work.

I was thinking about something like

  (1) compute average tuples per page

  (2) generate random sample of 'samplesize' blocks

  (3) for each block, see what is the actual number of tuples on the
      page and perform correction (e.g. if there's only 1/2 the average
      tuples, toss a coin and decide whether to really sample the block)

  (4) repeat until you have sample of the requested size

Are you producing a block sample or a row sample?  If a block is sampled, then what do you feed it to? 

Are you just picking one row out of each block that survives step 3?  If so, that would be similar to my idea of counting a given value only once per block it was in, except I was thinking of applying that only to n_distinct, not to the entire stats collection process.

...



My solution was to just unbolt the block sampling from the top, and let
it sample the rows (still 300 * stats_target of them) from the whole
table rather than a random 300 * 100 blocks of the table.  On tables of
the size I was worried about, block sampling was not very helpful
anyway.  Reading 30,000 blocks out of 250,000 blocks lead to no
meaningful IO advantage on my hardware. Any advantage from skipped
blocks was made up for (and sometimes exceeded) by fouling up the
read-ahead mechanisms.
>
With 1000 times more blocks, that probably won't work for you.
But I do wonder, how much time does it take to read a random 1/10,
1/100, 1/1000, 1/10000 of a table of your size, just from an IO
perspective?  How much are we gaining by doing the block sample?

Well, actually I think it would be even more appropriate for very large tables. With a 2.5TB table, you don't really care whether analyze collects 5GB or 8GB sample, the difference is rather minor compared to I/O generated by the other queries etc.

My answer was to take out the block sampling entirely and read the whole table.  That is what probably won't work for you.  (Come to think of it, I was hesitant to deploy custom code to production, so I never actually deployed that. Instead I cranked up the stats target and let the chips fall where they may)

I think you want to go back to the table to find new blocks to replace ones which were included in the original block sample but ended up having no tuples in the end tuple sample, either because they were low density blocks, or just through the luck of the draw.  But I don't see how fixes the problem, unless you also prevent a block from contributing more than 1 tuple.  And at that point, I worry about the uneven density again.  If a block has 2 rows selected by pure random chance, I think it would be fine to keep only one (or, go find another random block to pull one row from as a replacement).  But if it has 2 rows selected because it has more rows than average, then we would have to pull the replacement from a random block *of similar density* to avoid swapping one bias for another one.  

Cheers,

Jeff

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Jeff Janes
Date:
On Sat, Jun 20, 2015 at 8:28 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hi Tomas,


I've lobotomized the sampling a bit to really produce a random set of blocks first, and that produces way better estimates:

   statistics target     estimate               random
   -----------------------------------------------------------------
   100                     429491 (10000x)   334430766 (14x)
   1000                   4240418  (1000x)   439010499 (10x)

Also, the number of sampled blocks is not that different. With target 100, the current sampler reads ~2900 blocks, while a completely random sampler uses 3000 blocks. So, where's the benefit?

I don't know you did.  The block sampling is already random, unless there is some overlooked bug, it can't be made more random.  Could you post the patch?

As I understand it, with a target of 100, it should be sampling exactly 30,000 blocks.  Some of those blocks will end up having no rows chosen from them (just by chance), but the blocks were still read.  If a few thousand of those blocks end up with no tuples in the final sample, the motivation for that was not to save IO, is just an artifact of how the random sampling work.

Thanks,

Jeff

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Tomas Vondra
Date:

On 06/22/2015 07:21 AM, Jeff Janes wrote:
> On Sat, Jun 20, 2015 at 9:55 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>     Hi,
>
>     On 06/20/2015 03:01 AM, Jeff Janes wrote:
>
>
>
>              I don't think we need to really assume the density to be
>         constant,
>              maybe we can verify that while collecting the sample? I
>         mean, we're
>              already reading the pages, so we can track the density, and
>         either
>              do the correction or not.
>
>
>         Maybe.  I don't know how that would work.
>
>
>     I was thinking about something like
>
>        (1) compute average tuples per page
>
>        (2) generate random sample of 'samplesize' blocks
>
>        (3) for each block, see what is the actual number of tuples on the
>            page and perform correction (e.g. if there's only 1/2 the average
>            tuples, toss a coin and decide whether to really sample the
>     block)
>
>        (4) repeat until you have sample of the requested size
>
>
> Are you producing a block sample or a row sample?  If a block is
> sampled, then what do you feed it to?

I'm not sure what's the question. The ultimate goal is to get a random 
row sample, but we need to choose which blocks to sample first. So you 
can just generate K random numbers (where K is the requested sample 
size). If you get the block number once, sample one row, if you get the 
block twice, sample two rows, etc.

Of course, this is based on the assumption of uniform tuple density, 
which allows you to choose block first and then independently a row from 
that block. Hence the correction idea I outlined in the previous message.

> Are you just picking one row out of each block that survives step 3?
> If so, that would be similar to my idea of counting a given value
> only once per block it was in, except I was thinking of applying that
> only to n_distinct, not to the entire stats collection process.

No, I'm not doing that, and I think it's a bad idea in general. The 
problem with the current sampling is that it does not produce a random 
sample of rows, but something skewed, which causes serious trouble in 
the estimators processing the sample. I think this 'count only once' 
would result in similar issues (e.g. for low-cardinality columns).

>
> My answer was to take out the block sampling entirely and read the
> whole table. That is what probably won't work for you. (Come to think
> of it, I was hesitant to deploy custom code to production, so I never
> actually deployed that. Instead I cranked up the stats target and let
> the chips fall where they may)

Yeah, reading the whole table is not going to fly I guess. But the 
estimates would be very accurate! ;-)

>
> I think you want to go back to the table to find new blocks to
> replace ones which were included in the original block sample but
> ended up having no tuples in the end tuple sample, either because
> they were low density blocks, or just through the luck of the draw.
> But I don't see how fixes the problem, unless you also prevent a
> block from contributing more than 1 tuple. And at that point, I worry
> about the uneven density again. If a block has 2 rows selected by
> pure random chance, I think it would be fine to keep only one (or, go
> find another random block to pull one row from as a replacement). But
> if it has 2 rows selected because it has more rows than average, then
> we would have to pull the replacement from a random block *of similar
> density* to avoid swapping one bias for another one.

Hmmm, maybe limiting the sample to just 1 tuple is a good idea, but I 
have to think about that a bit more.

The basic idea behind density correction is that when you get a block 
with density significantly different from the average (let's say more 
than 10%?), you can either treat it as a partial block (density below 
average), or a collection of multiple blocks (density above average), 
and see if it really sampled.

Handling the 'partial block' seems rather simple - if the block has half 
the average density, treat do a random coin toss and either keep or 
discard the tuple.

With high density blocks, it's probably a complicated, and I think it 
really means treating it as multiple 'virtual' blocks, and only keep 
samples from the first one.

And of course, this needs to be repeated if some of the rows get evicted 
because of the correction mechanism. Which will lead to slightly larger 
samples, but I don't see that as a major problem (and the difference is 
not that big, at least not in the case discussed in this thread previously).

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

From
Tomas Vondra
Date:
Hi,

On 06/22/2015 07:47 AM, Jeff Janes wrote:
> On Sat, Jun 20, 2015 at 8:28 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
> Hi Tomas,
>
>
>     I've lobotomized the sampling a bit to really produce a random set
>     of blocks first, and that produces way better estimates:
>
>         statistics target     estimate               random
>         -----------------------------------------------------------------
>         100                     429491 (10000x)   334430766 (14x)
>         1000                   4240418  (1000x)   439010499 (10x)
>
>     Also, the number of sampled blocks is not that different. With
>     target 100, the current sampler reads ~2900 blocks, while a
>     completely random sampler uses 3000 blocks. So, where's the benefit?
>
>
> I don't know you did.  The block sampling is already random, unless
> there is some overlooked bug, it can't be made more random.  Could you
> post the patch?

Attached is an early experimental version of the sampling with density
correction. It breaks the SYSTEM sampling for TABLESAMPLE, because it
changes signature of one of the BlockSampler methods, but that can be
fixed later. I also suspect it fails with some small tables that would
get sampled completely with the current patch.

The density-correction idea is quite simple, and the sampling works like
this:

   (1) randomly sample the blocks

       Generate random sequence of block numbers, and track how many
       times a particular block was sampled - so if we got block number
       K twice, we know we should sample 2 tuples from that block.

   (2) sample the tuples

       For each of the blocks, sample as many tuples as needed. Also
       note the number of tuples on the block.

   (3) determine maximum density

       Find how what is the maximum number of tuples per block.

   (4) resample the tuples

       For each block, resample the tuples using ratio vs. maximum
       density. So if a block has 50% density, each sampled tuple will
       be evicted with ~50% probability.

   (5) replace the evicted tuples (not implemented)

       In step (4) some of the tuples will get evicted from the sample,
       so we should replace them with other tuples. Not yet done, so the
       sample may get smaller than requested.

And it seems to be working quite fine. I've done various tests with
tables specifically crafted to use different tuple widths for different
values, like for example this:

create table test as
select
     (case when i < 500000 then 1 else 2 end) AS id,
     (case when i < 500000 then 1 else null end)::bigint as val1,
     (case when i < 500000 then 1 else null end)::bigint as val2,
     (case when i < 500000 then 1 else null end)::bigint as val3
from generate_series(1,1000000) s(i);

which creates a table with tuples containing 50% id=1 and 50% id=2, with
the tuples id=1 being much wider (as id=2 have NULLs at the end). And it
works fine - the numbers are pretty much the same as current estimates.

It also significantly improves the ndistinct estimates - for example on
the TPC-H data set, I do get ndistinct=-1 for the l_orderkey column even
with statistics target=100, which is way better than the 10000x
under-estimate produced by current algorithm.

I do take this as a demonstration that our ndistinct estimator is not
such crap as some claim it to be, but that we're merely feeding it
skewed samples. I still think the adaptive estimator is a neat idea, but
it can't be fixed while keeping the same skewed sampling. That has to be
addressed first ...

regards

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment