Thread: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Hi, I'm currently running some tests on a 3TB TPC-H data set, and I tripped over a pretty bad n_distinct underestimate, causing OOM in HashAgg (which somehow illustrates the importance of the memory-bounded hashagg patch Jeff Davis is working on). The problem is Q18, particularly this simple subquery: select l_orderkey from lineitem group by l_orderkey having sum(l_quantity) > 313; which is planned like this: QUERY PLAN --------------------------------------------------------------------------------- HashAggregate (cost=598510163.92..598515393.93rows=418401 width=12) Group Key: l_orderkey Filter: (sum(l_quantity) > '313'::doubleprecision) -> Seq Scan on lineitem (cost=0.00..508509923.28 rows=18000048128 width=12) (4 rows) but sadly, in reality the l_orderkey cardinality looks like this: tpch=# select count(distinct l_orderkey) from lineitem; count ------------ 4500000000 (1 row) That's a helluva difference - not the usual one or two orders of magnitude, but 10000x underestimate. The usual thing to do in this case is increasing statistics target, and while this improves the estimate, the improvement is rather small: statistics target estimate difference -------------------------------------------------- 100 429491 10000 1000 4240418 1000 10000 42913759 100 I find the pattern rather strange - every time the statistics target increases 10x, the difference decreases 10x - maybe that's natural, but the perfect proportionality is suspicious IMHO. Also, this is a quite large dataset - the table has ~18 billion rows, and even with target=10000 we're sampling only 3M rows, which is ~0.02%. That's a tiny sample, so inaccuracy is naturally expected, but OTOH the TPC-H dataset is damn uniform - there's pretty much no skew in the distributions AFAIK. So I'd expect a slightly better result. With target=10000 the plan switches to GroupAggregate, because the estimate gets sufficient to exceed work_mem (2GB). But it's still way off, and it's mostly just a lucky coincidence. So I'm wondering if there's some bug because of the dataset size (an integer overflow or something like), so I added a bunch of logging into the estimator, logging all the parameters computed: target=100 (samplerows=30000) ----------------------------- WARNING: attnum=1 attname=l_orderkey f1=27976 ndistinct=28977 nmultiple=1001 toowide_cnt=0 d=28977 numer=869310000.000000 denom=2024.046627 stadistinct=429491.094029 WARNING: ndistinct estimate attnum=1 attname=l_orderkey current=429491.09 adaptive=443730.00 target=1000 (samplerows=300000) ------------------------------- WARNING: attnum=1 attname=l_orderkey f1=279513 ndistinct=289644 nmultiple=10131 toowide_cnt=0 d=289644 numer=86893200000.000000 denom=20491.658538 stadistinct=4240418.111618 WARNING: ndistinct estimate attnum=1 attname=l_orderkey current=4240418.11 adaptive=4375171.00 target=10000 (samplerows=3000000) --------------------------------- WARNING: attnum=1 attname=l_orderkey f1=2797888 ndistinct=2897799 nmultiple=99911 toowide_cnt=0 d=2897799 numer=8693397000000.000000 denom=202578.313396 stadistinct=42913759.396282 WARNING: ndistinct estimate attnum=1 attname=l_orderkey current=42913759.40 adaptive=44449882.00 It's totalrows=18000049031 in all cases. The logs also show estimate produced by the adaptive estimate (discussed in a separate thread), but apparently that does not change the estimates much :-( Any ideas? -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,
I'm currently running some tests on a 3TB TPC-H data set, and I tripped over a pretty bad n_distinct underestimate, causing OOM in HashAgg (which somehow illustrates the importance of the memory-bounded hashagg patch Jeff Davis is working on).
The problem is Q18, particularly this simple subquery:
select l_orderkey
from lineitem
group by l_orderkey
having sum(l_quantity) > 313;
which is planned like this:
QUERY PLAN
---------------------------------------------------------------------------------
HashAggregate (cost=598510163.92..598515393.93 rows=418401 width=12)
Group Key: l_orderkey
Filter: (sum(l_quantity) > '313'::double precision)
-> Seq Scan on lineitem (cost=0.00..508509923.28 rows=18000048128 width=12)
(4 rows)
but sadly, in reality the l_orderkey cardinality looks like this:
tpch=# select count(distinct l_orderkey) from lineitem;
count
------------
4500000000
(1 row)
That's a helluva difference - not the usual one or two orders of magnitude, but 10000x underestimate.
Is the row order in the table correlated with the value l_orderkey?
Could you create copy of the table ordered at random, and see if it exhibits the same estimation issue?
Cheers,
Jeff
On 06/19/2015 08:32 PM, Jeff Janes wrote: > On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > Hi, > > I'm currently running some tests on a 3TB TPC-H data set, and I > tripped over a pretty bad n_distinct underestimate, causing OOM in > HashAgg (which somehow illustrates the importance of the > memory-bounded hashagg patch Jeff Davis is working on). > > The problem is Q18, particularly this simple subquery: > > select l_orderkey > from lineitem > group by l_orderkey > having sum(l_quantity) > 313; > > which is planned like this: > > QUERY PLAN > --------------------------------------------------------------------------------- > HashAggregate (cost=598510163.92..598515393.93 rows=418401 width=12) > Group Key: l_orderkey > Filter: (sum(l_quantity) > '313'::double precision) > -> Seq Scan on lineitem (cost=0.00..508509923.28 > rows=18000048128 width=12) > (4 rows) > > but sadly, in reality the l_orderkey cardinality looks like this: > > tpch=# select count(distinct l_orderkey) from lineitem; > count > ------------ > 4500000000 > (1 row) > > That's a helluva difference - not the usual one or two orders of > magnitude, but 10000x underestimate. > > > > Is the row order in the table correlated with the value l_orderkey? The statistics look like this: attnum | attname | n_distinct | correlation --------+-----------------+-------------+------------- 1 | l_orderkey | 4.27349e+07 | 0.988631 so yes, it's pretty much perfectly correlated. > Could you create copy of the table ordered at random, and see if it > exhibits the same estimation issue? Sadly no - this is a 2.5TB table, so it's really easy to do. But I don't really see why that should matter, because the sample is supposed to be random. But I think you might be on to something, because I manually collected a random sample with 30k rows (by explicitly generating 30k random TIDs), and I get this: tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt from lineitem_sample group by 1) foo group by 1; cnt | count -----+------- 1 | 29998 2 | 1 (2 rows) That's quite different compared to what analyze gets, which effectively looks something like this (this is derived from the logs, so not perfectly accurate - I only have f1, ndistinct, nmultiple): cnt | count -----+------- 1 | 27976 2 | 976 3 | 24 Am I wrong or is the sample not that random? -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
When values in the same block are likely to be equal, then it finds too many duplicates because it too often picks two rows from a single block.
See analysis here:
http://www.postgresql.org/message-id/CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com
But I think you might be on to something, because I manually collected a random sample with 30k rows (by explicitly generating 30k random TIDs), and I get this:
tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt from lineitem_sample group by 1) foo group by 1;
cnt | count
-----+-------
1 | 29998
2 | 1
(2 rows)
That's quite different compared to what analyze gets, which effectively looks something like this (this is derived from the logs, so not perfectly accurate - I only have f1, ndistinct, nmultiple):
cnt | count
-----+-------
1 | 27976
2 | 976
3 | 24
Am I wrong or is the sample not that random?
The sample is not truly random. The two-stage sampling method causes too few blocks to have exactly one row chosen from them, and too many to have either 0 or 2+ rows chosen from them.
See analysis here:
http://www.postgresql.org/message-id/CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com
If we assume all the blocks have the same tuple density, then it is easy to correct this. But without that assumption of constant tuple density, I don't know how to get a truly random sample while still skipping most of the table.
Cheers,
Jeff
On 06/19/2015 09:48 PM, Jeff Janes wrote: > On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > But I think you might be on to something, because I manually > collected a random sample with 30k rows (by explicitly generating > 30k random TIDs), and I get this: > > tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt > from lineitem_sample group by 1) foo group by 1; > > cnt | count > -----+------- > 1 | 29998 > 2 | 1 > (2 rows) > > > That's quite different compared to what analyze gets, which > effectively looks something like this (this is derived from the > logs, so not perfectly accurate - I only have f1, ndistinct, nmultiple): > > cnt | count > -----+------- > 1 | 27976 > 2 | 976 > 3 | 24 > > Am I wrong or is the sample not that random? > > > The sample is not truly random. The two-stage sampling method causes > too few blocks to have exactly one row chosen from them, and too many to > have either 0 or 2+ rows chosen from them. > > When values in the same block are likely to be equal, then it finds too > many duplicates because it too often picks two rows from a single block. Yeah, I came to the same conclusion after a bit of experimenting. I've logged the block numbers for all the 30k sampled tuples (target=100) and I get this statistics for number of repetitions: cnt | count -----+------- 1 | 11020 2 | 5637 3 | 1800 4 | 450 5 | 94 6 | 6 so 11020 blocks have exactly 1 tuple sampled from them, 5637 blocks have 2 tuples sampled etc. With truly random sampling (just generating 30k random numbers between 0 and 328509442 (number of pages of this particular table), I get this: test=# select cnt, count(*) from (select (328509442 * random())::int AS blockno, count(*) AS cnt from blocks group by 1) foo group by 1 order by 1; cnt | count -----+------- 1 | 29994 2 | 3 So yeah, not really random. > See analysis here: > > http://www.postgresql.org/message-id/CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com Thanks. > If we assume all the blocks have the same tuple density, then it is > easy to correct this. But without that assumption of constant tuple > density, I don't know how to get a truly random sample while still > skipping most of the table. Hmmm, that's probably true. OTOH correlated columns are not all that uncommon (e.g. table storing time-series data etc.), and this blowup is quite bad ... I don't think we need to really assume the density to be constant, maybe we can verify that while collecting the sample? I mean, we're already reading the pages, so we can track the density, and either do the correction or not. Also, doesn't Vitter do pretty much the same assumption implicitly, otherwise it couldn't skipping some of the blocks? regards Tomas -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jun 19, 2015 at 1:39 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 06/19/2015 09:48 PM, Jeff Janes wrote:On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
But I think you might be on to something, because I manually
collected a random sample with 30k rows (by explicitly generating
30k random TIDs), and I get this:
tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt
from lineitem_sample group by 1) foo group by 1;
cnt | count
-----+-------
1 | 29998
2 | 1
(2 rows)
That's quite different compared to what analyze gets, which
effectively looks something like this (this is derived from the
logs, so not perfectly accurate - I only have f1, ndistinct, nmultiple):
cnt | count
-----+-------
1 | 27976
2 | 976
3 | 24
Am I wrong or is the sample not that random?
The sample is not truly random. The two-stage sampling method causes
too few blocks to have exactly one row chosen from them, and too many to
have either 0 or 2+ rows chosen from them.
When values in the same block are likely to be equal, then it finds too
many duplicates because it too often picks two rows from a single block.
Yeah, I came to the same conclusion after a bit of experimenting. I've logged the block numbers for all the 30k sampled tuples (target=100) and I get this statistics for number of repetitions:
cnt | count
-----+-------
1 | 11020
2 | 5637
3 | 1800
4 | 450
5 | 94
6 | 6
so 11020 blocks have exactly 1 tuple sampled from them, 5637 blocks have 2 tuples sampled etc.
With truly random sampling (just generating 30k random numbers between 0 and 328509442 (number of pages of this particular table), I get this:
test=# select cnt, count(*) from (select (328509442 * random())::int AS blockno, count(*) AS cnt from blocks group by 1) foo group by 1 order by 1;
cnt | count
-----+-------
1 | 29994
2 | 3
So yeah, not really random.See analysis here:
http://www.postgresql.org/message-id/CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com
Thanks.If we assume all the blocks have the same tuple density, then it is
easy to correct this. But without that assumption of constant tuple
density, I don't know how to get a truly random sample while still
skipping most of the table.
Hmmm, that's probably true. OTOH correlated columns are not all that uncommon (e.g. table storing time-series data etc.), and this blowup is quite bad ...
True, but we don't know how big of a problem the density-skew problem might be (since the current algorithm isn't sensitive to it). It might be just as big of a problem. Tom mentioned some ancient history in the above mentioned thread that made me think the density skew was enough of a problem to motivate the current system.
I don't think we need to really assume the density to be constant, maybe we can verify that while collecting the sample? I mean, we're already reading the pages, so we can track the density, and either do the correction or not.
Maybe. I don't know how that would work. We would have to keep two samples, and dynamically decide which to use. And what if the decision is that both density skew is a problem and that value clustering is also a problem?
I wonder if the n_distinct could be tweaked so that it counted any given value only once for each block it finds it in? So instead of asking "how many times was this value sampled", ask "in how many blocks was this value sampled at least once"?
Also, doesn't Vitter do pretty much the same assumption implicitly, otherwise it couldn't skipping some of the blocks?
Vitter samples an unstructured stream in a single pass, and is unbiased. The explicit block sampling is not part of Vitter, it is something we bolted on top of it.
My solution was to just unbolt the block sampling from the top, and let it sample the rows (still 300 * stats_target of them) from the whole table rather than a random 300 * 100 blocks of the table. On tables of the size I was worried about, block sampling was not very helpful anyway. Reading 30,000 blocks out of 250,000 blocks lead to no meaningful IO advantage on my hardware. Any advantage from skipped blocks was made up for (and sometimes exceeded) by fouling up the read-ahead mechanisms.
With 1000 times more blocks, that probably won't work for you.
But I do wonder, how much time does it take to read a random 1/10, 1/100, 1/1000, 1/10000 of a table of your size, just from an IO perspective? How much are we gaining by doing the block sample?
Cheers,
Jeff
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra<span dir="ltr"><<a href="mailto:tomas.vondra@2ndquadrant.com" target="_blank">tomas.vondra@2ndquadrant.com</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 00 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br /><br /> I'm currently running some tests on a 3TB TPC-H dataset, and I tripped over a pretty bad n_distinct underestimate, causing OOM in HashAgg (which somehow illustrates theimportance of the memory-bounded hashagg patch Jeff Davis is working on).<br /><br /> The problem is Q18, particularlythis simple subquery:<br /><br /> select l_orderkey<br /> from lineitem<br /> group by l_orderkey<br/> having sum(l_quantity) > 313;<br /><br /> which is planned like this:<br /><br /> QUERY PLAN<br /> ---------------------------------------------------------------------------------<br /> HashAggregate (cost=598510163.92..598515393.93 rows=418401 width=12)<br /> Group Key: l_orderkey<br /> Filter:(sum(l_quantity) > '313'::double precision)<br /> -> Seq Scan on lineitem (cost=0.00..508509923.28 rows=18000048128width=12)<br /> (4 rows)<br /><br /> but sadly, in reality the l_orderkey cardinality looks like this:<br/><br /> tpch=# select count(distinct l_orderkey) from lineitem;<br /> count<br /> ------------<br/> 4500000000<br /> (1 row)<br /><br /> That's a helluva difference - not the usual one or two ordersof magnitude, but 10000x underestimate.<br /><br /> The usual thing to do in this case is increasing statistics target,and while this improves the estimate, the improvement is rather small:<br /><br /> statistics target estimate difference<br /> --------------------------------------------------<br /> 100 429491 10000<br /> 1000 4240418 1000<br /> 10000 42913759 100<br /><br /> I find the pattern rather strange - every time the statistics target increases 10x,the difference decreases 10x - maybe that's natural, but the perfect proportionality is suspicious IMHO.<br /><br />Also, this is a quite large dataset - the table has ~18 billion rows, and even with target=10000 we're sampling only 3Mrows, which is ~0.02%. That's a tiny sample, so inaccuracy is naturally expected, but OTOH the TPC-H dataset is damn uniform- there's pretty much no skew in the distributions AFAIK. So I'd expect a slightly better result.<br /><br /> Withtarget=10000 the plan switches to GroupAggregate, because the estimate gets sufficient to exceed work_mem (2GB). Butit's still way off, and it's mostly just a lucky coincidence.<br /><br /> So I'm wondering if there's some bug becauseof the dataset size (an integer overflow or something like), so I added a bunch of logging into the estimator, loggingall the parameters computed:<br /><br /> target=100 (samplerows=30000)<br /> -----------------------------<br /> WARNING: attnum=1 attname=l_orderkey f1=27976 ndistinct=28977 nmultiple=1001 toowide_cnt=0 d=28977 numer=869310000.000000denom=2024.046627 stadistinct=429491.094029<br /> WARNING: ndistinct estimate attnum=1 attname=l_orderkeycurrent=429491.09 adaptive=443730.00<br /><br /> target=1000 (samplerows=300000)<br /> -------------------------------<br/> WARNING: attnum=1 attname=l_orderkey f1=279513 ndistinct=289644 nmultiple=10131 toowide_cnt=0d=289644 numer=86893200000.000000 denom=20491.658538 stadistinct=4240418.111618<br /> WARNING: ndistinct estimateattnum=1 attname=l_orderkey current=4240418.11 adaptive=4375171.00<br /><br /> target=10000 (samplerows=3000000)<br/> ---------------------------------<br /> WARNING: attnum=1 attname=l_orderkey f1=2797888 ndistinct=2897799nmultiple=99911 toowide_cnt=0 d=2897799 numer=8693397000000.000000 denom=202578.313396 stadistinct=42913759.396282<br/> WARNING: ndistinct estimate attnum=1 attname=l_orderkey current=42913759.40 adaptive=44449882.00<br/><br /> It's totalrows=18000049031 in all cases. The logs also show estimate produced by the adaptiveestimate (discussed in a separate thread), but apparently that does not change the estimates much :-(<br /><br />Any ideas?<br /><br /> --<br /> Tomas Vondra <a href="http://www.2ndQuadrant.com/" rel="noreferrer" target="_blank">http://www.2ndQuadrant.com/</a><br/> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services<spanclass="HOEnZb"><font color="#888888"><br /><br /><br /> -- <br /> Sent via pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org"target="_blank">pgsql-hackers@postgresql.org</a>)<br /> To make changes to yoursubscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-hackers" rel="noreferrer" target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/></font></span></blockquote></div><br /></div><divclass="gmail_extra"><br /></div><div class="gmail_extra">While better sample/stats is important for choosinga good plan, in this query, hash agg is really the right plan. If a sort agg is chosen, the performance will bereally really bad. The patch that Jeff is working on is critical for a decent TPCH number (unless you have unlimitedamount of memory).</div><div class="gmail_extra"><br /></div><div class="gmail_extra">Thanks,</div><div class="gmail_extra"> </div></div>
On Wed, Jun 17, 2015 at 1:52 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I'm currently running some tests on a 3TB TPC-H data set, and I tripped over > a pretty bad n_distinct underestimate, causing OOM in HashAgg (which somehow > illustrates the importance of the memory-bounded hashagg patch Jeff Davis is > working on). Stupid question, but why not just override it using ALTER TABLE ... ALTER COLUMN ... SET (n_distinct = ...)? I think it's been discussed quite often on previous threads that you need to sample an awful lot of the table to get a good estimate for n_distinct. We could support that, but it would be expensive, and it would have to be done again every time the table is auto-analyzed. The above syntax supports nailing the estimate to either an exact value or a percentage of the table, and I'm not sure why that isn't good enough. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 06/20/2015 08:54 AM, Feng Tian wrote: > > While better sample/stats is important for choosing a good plan, in > this query, hash agg is really the right plan. If a sort agg is > chosen, the performance will be really really bad. The patch that > Jeff is working on is critical for a decent TPCH number (unless you > have unlimited amount of memory). I do agree that Jeff's memory-bounded hashagg patch is very important feature, and in fact we spent a fair amount of time discussing it in Ottawa. So I'm looking forward to getting that soon ;-) But I don't think hashagg is going to be very good in this particular case. With a 3TB dataset, the query runs out of memory on a machine with 256GB of RAM. So let's assume a complete hash table has ~256GB. With work_mem=1GB that means only ~1/256 of the table can be processed in one batch, so we'll process the first 1/256 of the table, and write out the remaining 99% into batches. Then we'll read the batches one by one, and process those. The table has ~2.5TB, so we'll read 2.5TB, write out ~2.49TB into batches, and then read those ~2.49TB again. At least that's how I understand Jeff's memory-bounded hashagg proposal. The sort may perform worse in the general case, but in this case there's an index on the column, and the table is almost perfectly correlated by that column (due to generating the orders one by one, but it seems plausible it'd be the same in reality, assuming the orders are numbered using a sequence). So doing the sort by an indexscan seems rather cheap, and you only need to scan the table once. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 06/20/2015 04:17 PM, Robert Haas wrote: > On Wed, Jun 17, 2015 at 1:52 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> I'm currently running some tests on a 3TB TPC-H data set, and I >> tripped over a pretty bad n_distinct underestimate, causing OOM in>> HashAgg (which somehow illustrates the importanceof the>> memory-bounded hashagg patch Jeff Davis is working on). > > Stupid question, but why not just override it using ALTER TABLE ... > ALTER COLUMN ... SET (n_distinct = ...)? Sure, I'll do that, and it's probably the only way to do that at the moment. But I wasn't really sure why we're producing so poor estimate initially, even considering how small the sample is, it seemed a bit too bad, and the proportionality to statistics target seemed really strange. Also, if we could produce better n_distinct estimate, wouldn't that be awesome? I'd much rather not force users to use the manual override if possible. > > I think it's been discussed quite often on previous threads that you > need to sample an awful lot of the table to get a good estimate for > n_distinct. We could support that, but it would be expensive, and it > would have to be done again every time the table is auto-analyzed. > The above syntax supports nailing the estimate to either an exact > value or a percentage of the table, and I'm not sure why that isn't > good enough. Not really. Using a larger sample would certainly help in most cases, in this case the main problem is that the sample is not as random needed. It produces way more duplicate values than it should, which then utterly confuses the estimator which assumes random sample. This happens because the column is I've lobotomized the sampling a bit to really produce a random set of blocks first, and that produces way better estimates: statistics target estimate random ----------------------------------------------------------------- 100 429491 (10000x) 334430766 (14x) 1000 4240418 (1000x) 439010499 (10x) Also, the number of sampled blocks is not that different. With target 100, the current sampler reads ~2900 blocks, while a completely random sampler uses 3000 blocks. So, where's the benefit? I'm sure there are counter-examples, but that's true for all estimators. I do realize the random sampler assumes the tuple density is about the same on all blocks (so that it can sample the blocks directly), but ISTM the current block sampler has to do basically the same assumption to skip some of the blocks. Have to read the previous threads on this topic I guess ... -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jun 20, 2015 at 7:56 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,
On 06/20/2015 08:54 AM, Feng Tian wrote:
While better sample/stats is important for choosing a good plan, in
this query, hash agg is really the right plan. If a sort agg is
chosen, the performance will be really really bad. The patch that
Jeff is working on is critical for a decent TPCH number (unless you
have unlimited amount of memory).
I do agree that Jeff's memory-bounded hashagg patch is very important feature, and in fact we spent a fair amount of time discussing it in Ottawa. So I'm looking forward to getting that soon ;-)
But I don't think hashagg is going to be very good in this particular case. With a 3TB dataset, the query runs out of memory on a machine with 256GB of RAM. So let's assume a complete hash table has ~256GB. With work_mem=1GB that means only ~1/256 of the table can be processed in one batch, so we'll process the first 1/256 of the table, and write out the remaining 99% into batches. Then we'll read the batches one by one, and process those. The table has ~2.5TB, so we'll read 2.5TB, write out ~2.49TB into batches, and then read those ~2.49TB again. At least that's how I understand Jeff's memory-bounded hashagg proposal.
The sort may perform worse in the general case, but in this case there's an index on the column, and the table is almost perfectly correlated by that column (due to generating the orders one by one, but it seems plausible it'd be the same in reality, assuming the orders are numbered using a sequence). So doing the sort by an indexscan seems rather cheap, and you only need to scan the table once.
regards
--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I have not read Jeff's patch, but here is how I think hash agg should work,
Hash agg scan lineitem table, perform aggregation in memory. Once workmem is exhausted, it write intermediate state to disk, bucket by bucket. When lineitem table is finished, it reads all tuples from one bucket back, combining intermediate state and finalize the aggregation. I saw a quite extensive discussion on combining aggregation on the dev list, so I assume it will be added.
Assume after modulo an efficient size for I/O, workmem is bigger than the square root of data after aggregation, the above algorithm can finish by write out and read back only once.
For TPCH 3T, lineitem table has about 20 billion rows, 4 or 5 billion orders. For the simple subquery, one need to
1. scan table, 3TB I/O
2. write out intermediate state, 4 billion * size of (key column + intermediate state ~ 20 bytes) = 80GB
3. read back 80GB.
If sort is used, also assume workmem bigger than sqrt of data, you need to scan table, write out about 20B * 20 ~ 400GB, read back 400GB. Sort may have to do extra rounds of merge, but let's ignore that.
Hash agg has better performace, because,
1. less I/O
2. hash is a linear algorithm, compared to sort at n*lg(n).
Feng Tian wrote: > I have not read Jeff's patch, but here is how I think hash agg should work, I think you should discuss that in Jeff's thread, not here. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 06/20/2015 05:29 PM, Feng Tian wrote: > > I have not read Jeff's patch, but here is how I think hash agg should work, > > Hash agg scan lineitem table, perform aggregation in memory. Once > workmem is exhausted, it write intermediate state to disk, bucket by > bucket. When lineitem table is finished, it reads all tuples from one > bucket back, combining intermediate state and finalize the aggregation. > I saw a quite extensive discussion on combining aggregation on the > dev list, so I assume it will be added. That's not really how the proposed patch works, and the fact that we don't have a good way to serialize/deserialize the aggregate state etc. There are also various corner cases how you can end up with writing much more data than you assumed, but let's discuss that in the thread about the patch, not here. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 06/20/2015 03:01 AM, Jeff Janes wrote: > > Hmmm, that's probably true. OTOH correlated columns are not all that > uncommon (e.g. table storing time-series data etc.), and this blowup > is quite bad ... > > > True, but we don't know how big of a problem the density-skew > problem might be (since the current algorithm isn't sensitive to it). > It might be just as big of a problem. Tom mentioned some ancient > history in the above mentioned thread that made me think the density > skew was enough of a problem to motivate the current system. > > > I don't think we need to really assume the density to be constant, > maybe we can verify that while collecting the sample? I mean, we're > already reading the pages, so we can track the density, and either > do the correction or not. > > > Maybe. I don't know how that would work. I was thinking about something like (1) compute average tuples per page (2) generate random sample of 'samplesize' blocks (3) for each block, see what is the actual number of tuples on the page and perform correction (e.g. if there's only1/2 the average tuples, toss a coin and decide whether to really sample the block) (4) repeat until you have sample of the requested size The first weak spot is that this requires a knowledge of reltuples, which is something we currently estimate based on the sample, so it would have to use the old values I guess. Also, I'm not sure this works that great with blocks that have higher density - you'd like to choose them with a higher probability, but if you've already selected them there's no point in repeating that. > We would have to keep two samples, and dynamically decide which to > use. And what if the decision is that both density skew is a problem > and that value clustering is also a problem? I don't see why we would have to keep two samples? The idea is that sampling the blocks truly randomly solves the clustering issue, and the correction I mentioned above solves the density skew problem. > I wonder if the n_distinct could be tweaked so that it counted any > given value only once for each block it finds it in? So instead of > asking "how many times was this value sampled", ask "in how many > blocks was this value sampled at least once"? I don't think that's a really good idea. Imagine a column with a very low cardinality, so that every block you sample has just a very few distinct values. That's going to get very nasty very soon, especially when combined with estimators assuming a truly random sample (which is exactly the issue with the current ndistinct estimator). > > Also, doesn't Vitter do pretty much the same assumption implicitly, > otherwise it couldn't skipping some of the blocks? > > > Vitter samples an unstructured stream in a single pass, and is > unbiased. The explicit block sampling is not part of Vitter, it is > something we bolted on top of it. Yeah, you're right of course. Sorry for not being quite precise. We're using a variant of the S algorithm, based on the idea that we know the number of blocks at the beginning - but that's pretty much the place where we assume uniform density of rows per block, no? Because S is skipping blocks using that assumption implicitly. > > My solution was to just unbolt the block sampling from the top, and let > it sample the rows (still 300 * stats_target of them) from the whole > table rather than a random 300 * 100 blocks of the table. On tables of > the size I was worried about, block sampling was not very helpful > anyway. Reading 30,000 blocks out of 250,000 blocks lead to no > meaningful IO advantage on my hardware. Any advantage from skipped > blocks was made up for (and sometimes exceeded) by fouling up the > read-ahead mechanisms.> > With 1000 times more blocks, that probably won't work for you. > But I do wonder, how much time does it take to read a random 1/10, > 1/100, 1/1000, 1/10000 of a table of your size, just from an IO > perspective? How much are we gaining by doing the block sample? Well, actually I think it would be even more appropriate for very large tables. With a 2.5TB table, you don't really care whether analyze collects 5GB or 8GB sample, the difference is rather minor compared to I/O generated by the other queries etc. The current sample is already random enough not to work well with read-ahead, and it scans only a slightly lower number of blocks. And if the larger "random" sample results in better plans (especially plans without OOM) ... -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 6/20/15 12:55 PM, Tomas Vondra wrote: > Well, actually I think it would be even more appropriate for very large > tables. With a 2.5TB table, you don't really care whether analyze > collects 5GB or 8GB sample, the difference is rather minor compared to > I/O generated by the other queries etc. The current sample is already > random enough not to work well with read-ahead, and it scans only a > slightly lower number of blocks. Have we ever looked at generating new stats as part of a seqscan? I don't know how expensive the math is but if it's too much to push to a backend perhaps a bgworker could follow behind the seqscan. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
On Sat, Jun 20, 2015 at 9:55 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,
On 06/20/2015 03:01 AM, Jeff Janes wrote:
I don't think we need to really assume the density to be constant,
maybe we can verify that while collecting the sample? I mean, we're
already reading the pages, so we can track the density, and either
do the correction or not.
Maybe. I don't know how that would work.
I was thinking about something like
(1) compute average tuples per page
(2) generate random sample of 'samplesize' blocks
(3) for each block, see what is the actual number of tuples on the
page and perform correction (e.g. if there's only 1/2 the average
tuples, toss a coin and decide whether to really sample the block)
(4) repeat until you have sample of the requested size
Are you producing a block sample or a row sample? If a block is sampled, then what do you feed it to?
Are you just picking one row out of each block that survives step 3? If so, that would be similar to my idea of counting a given value only once per block it was in, except I was thinking of applying that only to n_distinct, not to the entire stats collection process.
...
>
My solution was to just unbolt the block sampling from the top, and let
it sample the rows (still 300 * stats_target of them) from the whole
table rather than a random 300 * 100 blocks of the table. On tables of
the size I was worried about, block sampling was not very helpful
anyway. Reading 30,000 blocks out of 250,000 blocks lead to no
meaningful IO advantage on my hardware. Any advantage from skipped
blocks was made up for (and sometimes exceeded) by fouling up the
read-ahead mechanisms.With 1000 times more blocks, that probably won't work for you.
But I do wonder, how much time does it take to read a random 1/10,
1/100, 1/1000, 1/10000 of a table of your size, just from an IO
perspective? How much are we gaining by doing the block sample?
Well, actually I think it would be even more appropriate for very large tables. With a 2.5TB table, you don't really care whether analyze collects 5GB or 8GB sample, the difference is rather minor compared to I/O generated by the other queries etc.
My answer was to take out the block sampling entirely and read the whole table. That is what probably won't work for you. (Come to think of it, I was hesitant to deploy custom code to production, so I never actually deployed that. Instead I cranked up the stats target and let the chips fall where they may)
I think you want to go back to the table to find new blocks to replace ones which were included in the original block sample but ended up having no tuples in the end tuple sample, either because they were low density blocks, or just through the luck of the draw. But I don't see how fixes the problem, unless you also prevent a block from contributing more than 1 tuple. And at that point, I worry about the uneven density again. If a block has 2 rows selected by pure random chance, I think it would be fine to keep only one (or, go find another random block to pull one row from as a replacement). But if it has 2 rows selected because it has more rows than average, then we would have to pull the replacement from a random block *of similar density* to avoid swapping one bias for another one.
Cheers,
Jeff
On Sat, Jun 20, 2015 at 8:28 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi Tomas,
I've lobotomized the sampling a bit to really produce a random set of blocks first, and that produces way better estimates:
statistics target estimate random
-----------------------------------------------------------------
100 429491 (10000x) 334430766 (14x)
1000 4240418 (1000x) 439010499 (10x)
Also, the number of sampled blocks is not that different. With target 100, the current sampler reads ~2900 blocks, while a completely random sampler uses 3000 blocks. So, where's the benefit?
I don't know you did. The block sampling is already random, unless there is some overlooked bug, it can't be made more random. Could you post the patch?
As I understand it, with a target of 100, it should be sampling exactly 30,000 blocks. Some of those blocks will end up having no rows chosen from them (just by chance), but the blocks were still read. If a few thousand of those blocks end up with no tuples in the final sample, the motivation for that was not to save IO, is just an artifact of how the random sampling work.
Thanks,
Jeff
On 06/22/2015 07:21 AM, Jeff Janes wrote: > On Sat, Jun 20, 2015 at 9:55 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > Hi, > > On 06/20/2015 03:01 AM, Jeff Janes wrote: > > > > I don't think we need to really assume the density to be > constant, > maybe we can verify that while collecting the sample? I > mean, we're > already reading the pages, so we can track the density, and > either > do the correction or not. > > > Maybe. I don't know how that would work. > > > I was thinking about something like > > (1) compute average tuples per page > > (2) generate random sample of 'samplesize' blocks > > (3) for each block, see what is the actual number of tuples on the > page and perform correction (e.g. if there's only 1/2 the average > tuples, toss a coin and decide whether to really sample the > block) > > (4) repeat until you have sample of the requested size > > > Are you producing a block sample or a row sample? If a block is > sampled, then what do you feed it to? I'm not sure what's the question. The ultimate goal is to get a random row sample, but we need to choose which blocks to sample first. So you can just generate K random numbers (where K is the requested sample size). If you get the block number once, sample one row, if you get the block twice, sample two rows, etc. Of course, this is based on the assumption of uniform tuple density, which allows you to choose block first and then independently a row from that block. Hence the correction idea I outlined in the previous message. > Are you just picking one row out of each block that survives step 3? > If so, that would be similar to my idea of counting a given value > only once per block it was in, except I was thinking of applying that > only to n_distinct, not to the entire stats collection process. No, I'm not doing that, and I think it's a bad idea in general. The problem with the current sampling is that it does not produce a random sample of rows, but something skewed, which causes serious trouble in the estimators processing the sample. I think this 'count only once' would result in similar issues (e.g. for low-cardinality columns). > > My answer was to take out the block sampling entirely and read the > whole table. That is what probably won't work for you. (Come to think > of it, I was hesitant to deploy custom code to production, so I never > actually deployed that. Instead I cranked up the stats target and let > the chips fall where they may) Yeah, reading the whole table is not going to fly I guess. But the estimates would be very accurate! ;-) > > I think you want to go back to the table to find new blocks to > replace ones which were included in the original block sample but > ended up having no tuples in the end tuple sample, either because > they were low density blocks, or just through the luck of the draw. > But I don't see how fixes the problem, unless you also prevent a > block from contributing more than 1 tuple. And at that point, I worry > about the uneven density again. If a block has 2 rows selected by > pure random chance, I think it would be fine to keep only one (or, go > find another random block to pull one row from as a replacement). But > if it has 2 rows selected because it has more rows than average, then > we would have to pull the replacement from a random block *of similar > density* to avoid swapping one bias for another one. Hmmm, maybe limiting the sample to just 1 tuple is a good idea, but I have to think about that a bit more. The basic idea behind density correction is that when you get a block with density significantly different from the average (let's say more than 10%?), you can either treat it as a partial block (density below average), or a collection of multiple blocks (density above average), and see if it really sampled. Handling the 'partial block' seems rather simple - if the block has half the average density, treat do a random coin toss and either keep or discard the tuple. With high density blocks, it's probably a complicated, and I think it really means treating it as multiple 'virtual' blocks, and only keep samples from the first one. And of course, this needs to be repeated if some of the rows get evicted because of the correction mechanism. Which will lead to slightly larger samples, but I don't see that as a major problem (and the difference is not that big, at least not in the case discussed in this thread previously). -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 06/22/2015 07:47 AM, Jeff Janes wrote: > On Sat, Jun 20, 2015 at 8:28 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > Hi Tomas, > > > I've lobotomized the sampling a bit to really produce a random set > of blocks first, and that produces way better estimates: > > statistics target estimate random > ----------------------------------------------------------------- > 100 429491 (10000x) 334430766 (14x) > 1000 4240418 (1000x) 439010499 (10x) > > Also, the number of sampled blocks is not that different. With > target 100, the current sampler reads ~2900 blocks, while a > completely random sampler uses 3000 blocks. So, where's the benefit? > > > I don't know you did. The block sampling is already random, unless > there is some overlooked bug, it can't be made more random. Could you > post the patch? Attached is an early experimental version of the sampling with density correction. It breaks the SYSTEM sampling for TABLESAMPLE, because it changes signature of one of the BlockSampler methods, but that can be fixed later. I also suspect it fails with some small tables that would get sampled completely with the current patch. The density-correction idea is quite simple, and the sampling works like this: (1) randomly sample the blocks Generate random sequence of block numbers, and track how many times a particular block was sampled - so if we got block number K twice, we know we should sample 2 tuples from that block. (2) sample the tuples For each of the blocks, sample as many tuples as needed. Also note the number of tuples on the block. (3) determine maximum density Find how what is the maximum number of tuples per block. (4) resample the tuples For each block, resample the tuples using ratio vs. maximum density. So if a block has 50% density, each sampled tuple will be evicted with ~50% probability. (5) replace the evicted tuples (not implemented) In step (4) some of the tuples will get evicted from the sample, so we should replace them with other tuples. Not yet done, so the sample may get smaller than requested. And it seems to be working quite fine. I've done various tests with tables specifically crafted to use different tuple widths for different values, like for example this: create table test as select (case when i < 500000 then 1 else 2 end) AS id, (case when i < 500000 then 1 else null end)::bigint as val1, (case when i < 500000 then 1 else null end)::bigint as val2, (case when i < 500000 then 1 else null end)::bigint as val3 from generate_series(1,1000000) s(i); which creates a table with tuples containing 50% id=1 and 50% id=2, with the tuples id=1 being much wider (as id=2 have NULLs at the end). And it works fine - the numbers are pretty much the same as current estimates. It also significantly improves the ndistinct estimates - for example on the TPC-H data set, I do get ndistinct=-1 for the l_orderkey column even with statistics target=100, which is way better than the 10000x under-estimate produced by current algorithm. I do take this as a demonstration that our ndistinct estimator is not such crap as some claim it to be, but that we're merely feeding it skewed samples. I still think the adaptive estimator is a neat idea, but it can't be fixed while keeping the same skewed sampling. That has to be addressed first ... regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services