Re: Yet another abort-early plan disaster on 9.3 - Mailing list pgsql-performance

From Tomas Vondra
Subject Re: Yet another abort-early plan disaster on 9.3
Date
Msg-id 54429CE6.8030005@fuzzy.cz
Whole thread Raw
In response to Re: Yet another abort-early plan disaster on 9.3  (Greg Stark <stark@mit.edu>)
Responses Re: Yet another abort-early plan disaster on 9.3  (Greg Stark <stark@mit.edu>)
List pgsql-performance
On 17.10.2014 19:25, Greg Stark wrote:
> On Wed, Oct 15, 2014 at 7:02 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> If you know the title of the article, it's usually available
>> elsewhere on the web - either at the university site, or elsewhere.
>> I found these two articles about block-based sampling:
>>
>>
>> http://ranger.uta.edu/~gdas/websitepages/preprints-papers/p287-chaudhuri.pdf
>
>>
> There are a series of papers with Chaudhuri as lead author which I
> agree sounds like what Josh is talking about. Note that he's
> Microsoft Research's database group lead and it would be a pretty
> safe bet anything published from there is going to be covered by
> patents from here till next Tuesday (and seventeen years beyond).

Hmmm. I have 0 experience with handling patents and related issues. Any
idea how to address that?

> I think this is all putting the cart before the horse however. If we
> could fix our current sampling to use the data more efficiently that
> would be a good start before we start trying to read even more data.
>
> We currently read just one row from each block on average instead of
> using the whole block. That's what would be needed in the worst case
> if the blocks were a very biased sample (which indeed they probably
> are in most databases due to the way Postgres handles updates). But
> we could at least give users the option to use more than one row per
> block when they know it's ok (such as data that hasn't been updated)
> or detect when it's ok (such as by carefully thinking about how
> Postgres's storage mechanism would bias the data).

I think this will be very tricky, and in fact it may make the estimates
much worse easily, because  all the algorithms assume random sampling.

For example the ndistinct estimator uses the low-frequency values (that
were observed only once or twice in the sample). By using multiple rows
from each block, you'll significantly influence this probability for
columns with values correlated to block (which is quite common.

Take for example fact tables in data warehouses - those are usually
denormalized, mostly append-only. Say each row has "date_id" which is a
sequential number of a day, with 0 sometime in the past. Daily
increments are usually stored on many consecutive blocks, so on each
block there's usually a single date_id value.

By sampling all rows on a block you gain exactly nothing, and in fact it
results in observing no low-frequency values, making the estimator
absolutely useless.

I can imagine fixing this (although I don't see how exactly), but the
thing is we need to fix *all* the estimators we have, not just
ndistinct. And that's going to be tough.

I don't think adding a knob to tune the number of tuples sampled per
block is a good approach. Either we can solve the issues I described
(and in that case it's unnecessary), or we can't solve them and it turns
into a massive foot gun.

> But I looked into this and ran into a problem. I think our algorithm
> for calculating the most frequent values list is bunk. The more rows
> I picked from each block the more biased that list was towards
> values seen earlier in the table. What's worse, when that list was
> biased it threw off the histogram since we exclude the most frequent
> values from the histogram, so all estimates were thrown off.

I think the 'minimal' stats (when we have just '=' for the type) behaves
like this, but fixing it by switching to a two-pass approach should not
be that difficult (but would cost a few more CPU cycles).

Or do you suggest that even the scalar MCV algorithm behaves has this
bias issue? I doubt that, because the MCV works with an array sorted by
number of occurences, so the position within the table is irrelevant.

> If we could fix the most frequent values collection to not be biased
> when it sees values in a clumpy way then I think we would be okay to
> set the row sample size in Vitter's algorithm to a factor of N
> larger than the block sample size where N is somewhat smaller than
> the average number of rows per block. In fact even if we used all the
> rows in the block I think I've convinced myself that the results
> would be accurate in most circumstances.

I don't expect fixing the MCV to be overly difficult (although it will
need a few more CPU cycles).

But making it work with the block sampling will be much harder, because
of the bias. The phrase 'in most circumstances' doesn't sound really
convincing to me ...

> I think to calcualte the most frequent values more accurately it
> would take a two pass approach. Scan some random sample of blocks
> with a counting bloom filter then do a second pass (possibly for the
> same sample?) keeping counts only for values that the counting bloom
> filter said hashed to the most common hash values. That might not be
> exactly the most common values but should be at least a
> representative sample of the most common values.

I don't see why the counting bloom filter would be necessary, in a two
pass approach?

regards
Tomas


pgsql-performance by date:

Previous
From: Greg Stark
Date:
Subject: Re: Yet another abort-early plan disaster on 9.3
Next
From: Laurent Martelli
Date:
Subject: IS NOT NULL and LEFT JOIN