Thread: estimating # of distinct values

estimating # of distinct values

From
Tomas Vondra
Date:
Hi everyone,

about two weeks ago I've started a thread about cross-column stats. One
of the proposed solutions is based on number of distinct values, and the
truth is the current distinct estimator has some problems.

I've done some small research - I have read about 20 papers on this, and
I'd like to present a short summary here, possible solutions etc.

The simple truth is

1) sampling-based estimators are a dead-end
2) there are very interesting stream-based estimators
3) the stream-based estimators have some disadvantages

I wrote a more thorough summary on a wiki
(http://wiki.postgresql.org/wiki/Estimating_Distinct) with a list of the
most interesting papers etc. so just a very short explanation.

1) sampling-based estimators are a dead-end
  The paper from Charikar & Chaudhuri states (and proves) that for  each sampling-based estimator, there's a data set
wherethe estimator  produces arbitrary error (with an indispensable probability). So  replacing one sampling-based
estimatorwith another generally does  not improve the situation - fixes one dataset, breaks another one.
 
  The error is directly related to the size of the sample, so the  truth is that to get a good distinct estimate you
needto scan a  significant portion of the table (almost all of it).
 
  So the estimators based on tiny samples are a dead end.

2) there are very interesting stream-based estimators
  A very interesting idea comes from the field of data streams. These  estimates are based no a one pass through the
data,and then  incremental updates. A very nice thing is that these algorithms  don't need that much space, as they
don'tstore the actual values.
 
  The idea behind this is similar to the Bloom filter, i.e. set bits  of a bitmap using a hash of the value. But in
thiscase it's not  required to be able to answer questions like 'is this an element  of the set X?' so it's actually
evenmore space efficient. For an  introduction see the first paper published in 1985 by Flajolet
(http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.7100).
  One of the recent articles (published in June 2010) actually presents  an optimal algorithm with O(e^-2 + log(n))
spacecomplexity and O(1)  update complexity. The "e" means precision, i.e. that the estimate  will be within [(1-e)F,
(1+e)F]where F is the real value.
 
  The paper on self-learning bitmap states that to track 10^9 distinct  values with 1% relative error you need about 61
kilobitsof space  (which is absolutely awesome). The optimal algorithm needs a bit more  space I think,
 
  A very interesting solution id "distinct sampling" that somehow  extends the Wegman's adaptive sampling approach.

3) the stream-based estimators have some disadvantages
  Not everything is perfect, though - the most serious disadvantage is  that those algorithms (usually) don't handle
removalof elements.  Inserts are easy to handle, but deletes are hard (just as in case of  Bloom filters).
 
  So using this stream-based approach might lead to degradation in  case of heavily updated tables :-(
  I see two possible solutions here:
  (a) use counters instead of bits, and track actual counts - but this      is tricky, especially with 'probabilistic'
algorithmsand needs      much more space (but still, 32x 61kb is just 250kB)
 
  (b) count how many deletes/updates were performed, and if needed      rebuild the whole bitmap
  But even though these disadvantages, there really is no other  way to enhance the estimates. I don't think this
shouldbe a  default behavior - just as in case of cross-column stats this should  be optional when the current
estimatordoes not work well.
 

regards
Tomas


Re: estimating # of distinct values

From
Robert Haas
Date:
2010/12/27 Tomas Vondra <tv@fuzzy.cz>:
>   But even though these disadvantages, there really is no other
>   way to enhance the estimates. I don't think this should be a
>   default behavior - just as in case of cross-column stats this should
>   be optional when the current estimator does not work well.

This is going to be a lot of work to implement, so before you do it,
we should try to reach a consensus that (a) it's part of an overall
strategy that the community generally supports and (b) we have
consensus on the design for this part.

With respect to (a), I have to admit I've found the discussion on
cross-column stats to be quite difficult to follow.  I'd like to see a
very simple description of exactly what information we're going to
store, under what circumstances we'll store it, and how we'll use it
to compute selectivity estimates.

With respect to (b), I think I'd need to see a much more detailed
design for how you intend to make this work.  Off the top of my head
there seems to be some pretty serious feasibility problems.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: estimating # of distinct values

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> With respect to (b), I think I'd need to see a much more detailed
> design for how you intend to make this work.  Off the top of my
> head there seems to be some pretty serious feasibility problems.
I had one random thought on that -- it seemed like a large concern
was that there would need to be at least an occasional scan of the
entire table to rebuild the distinct value information.  Don't we
already require an occasional scan of the entire table for freezing
transaction IDs?  Could this be part of any vacuum of the entire
table?
-Kevin


Re: estimating # of distinct values

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Robert Haas <robertmhaas@gmail.com> wrote:
>> With respect to (b), I think I'd need to see a much more detailed
>> design for how you intend to make this work.  Off the top of my
>> head there seems to be some pretty serious feasibility problems.
> I had one random thought on that -- it seemed like a large concern
> was that there would need to be at least an occasional scan of the
> entire table to rebuild the distinct value information.  Don't we
> already require an occasional scan of the entire table for freezing
> transaction IDs?  Could this be part of any vacuum of the entire
> table?

Well, first, those scans occur only once every few hundred million
transactions, which is not likely a suitable timescale for maintaining
statistics.  And second, we keep on having discussions about rejiggering
the whole tuple-freezing strategy.  Even if piggybacking on those scans
looked useful, it'd be unwise to assume it'll continue to work the same
way it does now.
        regards, tom lane


Re: estimating # of distinct values

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, first, those scans occur only once every few hundred million
> transactions, which is not likely a suitable timescale for
> maintaining statistics.
I was assuming that the pass of the entire table was priming for the
incremental updates described at the start of this thread.  I'm not
clear on how often the base needs to be updated for the incremental
updates to keep the numbers "close enough".
> And second, we keep on having discussions about rejiggering
> the whole tuple-freezing strategy.  Even if piggybacking on those
> scans looked useful, it'd be unwise to assume it'll continue to
> work the same way it does now.
Sure, it might need to trigger its own scan in the face of heavy
deletes anyway, since the original post points out that the
algorithm handles inserts better than deletes, but as long as we
currently have some sequential pass of the data, it seemed sane to
piggyback on it when possible.  And maybe we should be considering
things like this when we weigh the pros and cons of rejiggering. 
This issue of correlated values comes up pretty often....
-Kevin


Re: estimating # of distinct values

From
Tomas Vondra
Date:
Dne 27.12.2010 22:46, Robert Haas napsal(a):
> 2010/12/27 Tomas Vondra <tv@fuzzy.cz>:
>>   But even though these disadvantages, there really is no other
>>   way to enhance the estimates. I don't think this should be a
>>   default behavior - just as in case of cross-column stats this should
>>   be optional when the current estimator does not work well.
> 
> This is going to be a lot of work to implement, so before you do it,
> we should try to reach a consensus that (a) it's part of an overall
> strategy that the community generally supports and (b) we have
> consensus on the design for this part.

Yes, it is going to be a lot of work. But I don't think we need a
consensus of the whole community prior to building something that works.
I plan to build a very simple prototype soon, so let's talk about this
later.

I've started these two threads mainly as a 'brainstorming' - to present
what I've learned from the various papers, propose a possible solution,
highlight issues I see, and get ideas on how to solve them.

> With respect to (a), I have to admit I've found the discussion on
> cross-column stats to be quite difficult to follow.  I'd like to see a
> very simple description of exactly what information we're going to
> store, under what circumstances we'll store it, and how we'll use it
> to compute selectivity estimates.

Yes, I know it was difficult to follow that discussion. That's why I
created a 'summary page' on wiki
  http://wiki.postgresql.org/wiki/Cross_Columns_Stats

Generally we need to gather two types of stats:
 (a) multi-dimensional histogram - this can be done about the same way     we create single-dimensional histograms,
that'snot a big problem     (although more data might be necessary to get accurate histograms)
 
 (b) # of distinct values - this is the tricky part, as described in     this problem

> With respect to (b), I think I'd need to see a much more detailed
> design for how you intend to make this work.  Off the top of my head
> there seems to be some pretty serious feasibility problems.

Well, it's not going to be easy, that's for sure. My "big plan" is
something like this:

(a) Build a simple 'contrib-like' module that allows manual collection   of stats and computing of estimates (not
integratedwith the   optimizer in any way).
 
   This should help us better understand what stats do we need etc.

(b) Minimal integration with the core - still manual collection fo   stats, the optimizer uses the stats if available.

(c) Enhancements - automatic updates of the stats, etc.

And all this should be implemented so that you don't have to pay unless
you want to use the new stats.

regards
Tomas


Re: estimating # of distinct values

From
Tomas Vondra
Date:
Dne 28.12.2010 00:04, Kevin Grittner napsal(a):
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>  
>> Well, first, those scans occur only once every few hundred million
>> transactions, which is not likely a suitable timescale for
>> maintaining statistics.
>  
> I was assuming that the pass of the entire table was priming for the
> incremental updates described at the start of this thread.  I'm not
> clear on how often the base needs to be updated for the incremental
> updates to keep the numbers "close enough".

Well, that really depends on the workload. If you never remove all
occurences of a given value (e.g. a ZIP code), you don't need to rescan
the table at all.

All you need is to build the stats once and then update them
incrementally - and I belive this could be handled by autovacuum.

>> And second, we keep on having discussions about rejiggering
>> the whole tuple-freezing strategy.  Even if piggybacking on those
>> scans looked useful, it'd be unwise to assume it'll continue to
>> work the same way it does now.
>  
> Sure, it might need to trigger its own scan in the face of heavy
> deletes anyway, since the original post points out that the
> algorithm handles inserts better than deletes, but as long as we
> currently have some sequential pass of the data, it seemed sane to
> piggyback on it when possible.  And maybe we should be considering
> things like this when we weigh the pros and cons of rejiggering. 
> This issue of correlated values comes up pretty often....

Again, there are two types of stats - one of them needs to scan the
whole table (estimate of distinct values), the other one does not
(multi-dimentional histograms).

These two cases are independent - you don't necessarily need both.

Better ndistinct estimates would actually solve some of the current
issues, it's not just a matter of cross-column stats.

regards
Tomas


Re: estimating # of distinct values

From
Josh Berkus
Date:
> The simple truth is
> 
> 1) sampling-based estimators are a dead-end

While I don't want to discourage you from working on steam-based
estimators ... I'd love to see you implement a proof-of-concept for
PostgreSQL, and test it ... the above is a non-argument.  It requires us
to accept that sample-based estimates cannot ever be made to work,
simply because you say so.

The Charikar and Chaudhuri paper does not, in fact, say that it is
impossible to improve sampling-based estimators as you claim it does. In
fact, the authors offer several ways to improve sampling-based
estimators.  Further, 2000 was hardly the end of sampling-estimation
paper publication; there are later papers with newer ideas.

For example, I still think we could tremendously improve our current
sampling-based estimator without increasing I/O by moving to block-based
estimation*.  The accuracy statistics for block-based samples of 5% of
the table look quite good.

I would agree that it's impossible to get a decent estimate of
n-distinct from a 1% sample.  But there's a huge difference between 5%
or 10% and "a majority of the table".

Again, don't let this discourage you from attempting to write a
steam-based estimator.  But do realize that you'll need to *prove* its
superiority, head-to-head, against sampling-based estimators.

[* http://www.jstor.org/pss/1391058 (unfortunately, no longer
public-access)]

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: estimating # of distinct values

From
Robert Haas
Date:
On Tue, Dec 28, 2010 at 1:39 AM, Josh Berkus <josh@agliodbs.com> wrote:
> While I don't want to discourage you from working on steam-based
> estimators ... I'd love to see you implement a proof-of-concept for
> PostgreSQL, and test it ... the above is a non-argument.  It requires us
> to accept that sample-based estimates cannot ever be made to work,
> simply because you say so.

This argument has been made on this mailing list and on
pgsql-performance many times, so I have been assuming that this is
settled mathematics.  Admittedly, I don't have a citation for this...

> I would agree that it's impossible to get a decent estimate of
> n-distinct from a 1% sample.  But there's a huge difference between 5%
> or 10% and "a majority of the table".

Considering the way that disks work, it's not that huge.  If the table
is small enough to fit in memory conveniently, it probably wouldn't be
unacceptably costly even to read the whole thing.  But if it's a
terabyte on disk, reading every tenth block is probably going to carry
most of the I/O cost of reading every block.  Even reading every
twentieth or hundredth block is going to be significantly more
expensive than what we do now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: estimating # of distinct values

From
tv@fuzzy.cz
Date:
>
>> The simple truth is
>>
>> 1) sampling-based estimators are a dead-end
>
> The Charikar and Chaudhuri paper does not, in fact, say that it is
> impossible to improve sampling-based estimators as you claim it does. In
> fact, the authors offer several ways to improve sampling-based
> estimators.  Further, 2000 was hardly the end of sampling-estimation
> paper publication; there are later papers with newer ideas.

Well, the paper states that there is a lower bound of the possible error
of a sampling based estimator, depending on the sample size. The actual
inequality is
  error(d) >= sqrt( (n-r)/2r * 1/q)

where error(D) is "ratio error" (d - estimated number of distinct values,
D - actual number of distinct values)
  error(d) = max{ D/d, d/D }

And all this is with probability q >= e^{-1000} (you can choose this).

Say you have a table  with 1.000.000 rows and you use a sample of 1.000
rows to do an estimate. In that case you get
 erorr(d) >= 99      with   q = 0.05 erorr(d) >= 70      with   q = 0.1 error(d) >= 31      with   q = 0.5

if you can 10% of the table, you get this
 error(d) >= 9.5     with   q = 0.05 error(d) >= 6.7     with   q = 0.1 error(d) >= 3       with   q = 0.5

So even with 10% of the table, there's a 10% probability to get an
estimate that's 7x overestimated or underestimated. With lower probability
the interval is much wider.

> For example, I still think we could tremendously improve our current
> sampling-based estimator without increasing I/O by moving to block-based
> estimation*.  The accuracy statistics for block-based samples of 5% of
> the table look quite good.

Well, that's certainly possible. But you can only achieve the error(d)
lower boundary consistently (for all datasets), you can't do better. And
they've already presented an estimator that does exactly this (called AE -
Adaptive Estimator in the paper).

> I would agree that it's impossible to get a decent estimate of
> n-distinct from a 1% sample.  But there's a huge difference between 5%
> or 10% and "a majority of the table".

Sure we can do better. But there's a limit we can't cross no matter what
estimator we choose and how large the sample will be.

I've seen several post-2000 paper on sample-based estimators, but those
are mostly hybrid estimators, i.e. estimators composed of several simple
estimators. So in the end it's pretty complicated, you need to gather a
lot of different stats, and still you can't get better error than the
lower bound :-(

So yes, we can improve the current estimator (making it more complex), but
it does not solve the problem actually.

> Again, don't let this discourage you from attempting to write a
> steam-based estimator.  But do realize that you'll need to *prove* its
> superiority, head-to-head, against sxampling-based estimators.
>
> [* http://www.jstor.org/pss/1391058 (unfortunately, no longer
> public-access)]

It's available here http://www.stat.washington.edu/research/reports/1990s/

But it does not present an estimator contradicting the Charikar and
Chaudhuri paper - it's still 'just' a sample-based estimator, alough they
draw the sample at the block level. But yes, that's good idea and I've
already mentioned it in the cross-column stats thread I think.

The question is if a sample obtained in this way will be as good as the
current samples. This way you could get quite separate 'clusters' of
values, one cluster for each block, although in reality the values are
uniformly distributed. And the resulting histogram would be crippled by
this I guess too.

But if you know about interesting papers on sample-based estimators
(especially post-2000), let me know. I've searched for them, but those I
found were not very interesting IMHO.

Tomas



Re: estimating # of distinct values

From
"Kevin Grittner"
Date:
<tv@fuzzy.cz> wrote:
> So even with 10% of the table, there's a 10% probability to get an
> estimate that's 7x overestimated or underestimated. With lower
> probability the interval is much wider.
Hmmm...  Currently I generally feel I'm doing OK when the estimated
rows for a step are in the right order of magnitude -- a 7% error
would be a big improvement in most cases.  Let's not lose track of
the fact that these estimates are useful even when they are not
dead-on accurate.
-Kevin


Re: estimating # of distinct values

From
tv@fuzzy.cz
Date:
> <tv@fuzzy.cz> wrote:
>
>> So even with 10% of the table, there's a 10% probability to get an
>> estimate that's 7x overestimated or underestimated. With lower
>> probability the interval is much wider.
>
> Hmmm...  Currently I generally feel I'm doing OK when the estimated
> rows for a step are in the right order of magnitude -- a 7% error
> would be a big improvement in most cases.  Let's not lose track of
> the fact that these estimates are useful even when they are not
> dead-on accurate.

Well, but that's not 7%, thats 7x! And the theorem says 'greater or equal'
so this is actually the minimum - you can get a much bigger difference
with lower probability. So you can easily get an estimate that is a few
orders off.

Anyway I really don't want precise values, just a reasonable estimate. As
I said, we could use the AE estimate they proposed in the paper. It has
the nice feature that it actually reaches the low boundary (thus the
inequality changes to equality). The downside is that there are estimators
with better behavior on some datasets.

Tomas



Re: estimating # of distinct values

From
Josh Berkus
Date:
> Well, but that's not 7%, thats 7x! And the theorem says 'greater or equal'
> so this is actually the minimum - you can get a much bigger difference
> with lower probability. So you can easily get an estimate that is a few
> orders off.

FWIW, based on query performance, estimates which are up to 5X off are
tolerable, and anything within 3X is considered "accurate".  Above 5X
the probability of bad query plans becomes problematically high.

Of course, if you're doing cross-column stats, the accuracy of each
individual column becomes critical since estimation error could be
combiniational in the worst case (i.e. if colA is 3X and colB is 0.3X
then colA<->colB will be 9X off).

Anyway, I look forward to your experiments with stream-based estimators.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: estimating # of distinct values

From
Florian Pflug
Date:
On Dec27, 2010, at 23:49 , Kevin Grittner wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
> 
>> With respect to (b), I think I'd need to see a much more detailed
>> design for how you intend to make this work.  Off the top of my
>> head there seems to be some pretty serious feasibility problems.
> 
> I had one random thought on that -- it seemed like a large concern
> was that there would need to be at least an occasional scan of the
> entire table to rebuild the distinct value information.

I believe we could actually avoid that.

First, the paper "An Optimal Algorithm for the Distinct Elements Problem"
actually contains an algorithm with *does* handle deletes - it's called
"L_0" estimate there.

Second, as Tomas pointed out, the stream-based estimator is essentially a
simplified version of a bloom filter. It starts out with a field of
N zero bits, and sets K of them to 1 for each value v in the stream.
Which bits are set to 1 depends on some hash function(s) H_i(v). It's
then easy to compute how many 1-bits you'd expect to find in the bit
field after seeing D distinct values, and by reversing that you can
estimate D from the number of 1-bits in the bit field.

To avoid having to rescan large tables, instead of storing one such
bit field, we'd store one per B pages of data. We'd then only need
to scan a range of B pages around every updated or deleted tuple,
and could afterwards compute a new global estimate of D by combining
the individual bit fields with bitwise and.

Since the need to regularly VACUUM tables hit by updated or deleted
won't go away any time soon, we could piggy-back the bit field
rebuilding onto VACUUM to avoid a second scan.

A good value for B would probably be around
1000*<size of bitfield>/<page size>. If the bitfield needs ~100k, that'd
make B ~= 12000 pages ~= 100MB.

best regards,
Florian Pflug



Re: estimating # of distinct values

From
Tomas Vondra
Date:
Dne 30.12.2010 15:43, Florian Pflug napsal(a):
> On Dec27, 2010, at 23:49 , Kevin Grittner wrote:
>> Robert Haas <robertmhaas@gmail.com> wrote:
>>
>>> With respect to (b), I think I'd need to see a much more detailed
>>> design for how you intend to make this work.  Off the top of my
>>> head there seems to be some pretty serious feasibility problems.
>>
>> I had one random thought on that -- it seemed like a large concern
>> was that there would need to be at least an occasional scan of the
>> entire table to rebuild the distinct value information.
> 
> I believe we could actually avoid that.
> 
> First, the paper "An Optimal Algorithm for the Distinct Elements Problem"
> actually contains an algorithm with *does* handle deletes - it's called
> "L_0" estimate there.

Hmmm, that's interesting. I know there's a part about L_0 estimation,
but that's about estimating Hamming norm of a vector - so I've ignored
it as I thought we can't use it to estimate number of distinct values.
But if it really handles deletions and if we can use it, then it's
really interesting.

> Second, as Tomas pointed out, the stream-based estimator is essentially a
> simplified version of a bloom filter. It starts out with a field of
> N zero bits, and sets K of them to 1 for each value v in the stream.
> Which bits are set to 1 depends on some hash function(s) H_i(v). It's
> then easy to compute how many 1-bits you'd expect to find in the bit
> field after seeing D distinct values, and by reversing that you can
> estimate D from the number of 1-bits in the bit field.

No, I haven't said the stream-based estimators are simplified versions
of a Bloom filter. I said the approach is very similar - all the
algorithms use bitmaps and hash functions, but the algorithms (Bloom
filter vs. probabilistic counting and adaptive sampling) are very different.

The Bloom filter is much more straightforward. The other algorithms are
much more sophisticated which allows to use less space.

> To avoid having to rescan large tables, instead of storing one such
> bit field, we'd store one per B pages of data. We'd then only need
> to scan a range of B pages around every updated or deleted tuple,
> and could afterwards compute a new global estimate of D by combining
> the individual bit fields with bitwise and.

I don't think this could help.

1) This works just with the Bloom filters, not with the other  algorithms (you can't combine the segments using bitmap
OR).

2) With heavily modified tables the updates are usually 'spread'  through the whole table, so you'll have to rebuild
allthe  segments anyway.
 

> Since the need to regularly VACUUM tables hit by updated or deleted
> won't go away any time soon, we could piggy-back the bit field
> rebuilding onto VACUUM to avoid a second scan.

Well, I guess it's a bit more complicated. First of all, there's a local
VACUUM when doing HOT updates. Second, you need to handle inserts too
(what if the table just grows?).

But I'm not a VACUUM expert, so maybe I'm wrong and this is the right
place to handle rebuilds of distinct stats.

I was thinking about something else - we could 'attach' the rebuild to
an actual seq scan if the amount of changes reaches some threshold
(since the last rebuild). Only in case the amount of changes reaches a
higher threshold, we would rebuild the stats on our own.

Something like

IF (# of updates * deletes > 5%) THEN wait for seq scan
IF (# of updates * deletes > 10%) THEN rebuild the stats

I've found a nice ppt describing how Oracle does this:
  http://www.oraclegeek.net/downloads/One_Pass_Distinct_Sampling.ppt

and there's PDF version
  http://www.oraclegeek.net/downloads/OnePassDistinctSampling.pdf

According to this, Oracle is using the probabilistic counting approach
(see slide 26). And once they find out there were to many changes in the
table, they rebuild the whole thing.

I'm not saying we should do exactly the same, just that this might be a
good direction.

regards
Tomas


Re: estimating # of distinct values

From
Alvaro Herrera
Date:
Excerpts from Tomas Vondra's message of jue dic 30 16:38:03 -0300 2010:

> > Since the need to regularly VACUUM tables hit by updated or deleted
> > won't go away any time soon, we could piggy-back the bit field
> > rebuilding onto VACUUM to avoid a second scan.
> 
> Well, I guess it's a bit more complicated. First of all, there's a local
> VACUUM when doing HOT updates. Second, you need to handle inserts too
> (what if the table just grows?).
> 
> But I'm not a VACUUM expert, so maybe I'm wrong and this is the right
> place to handle rebuilds of distinct stats.

I was thinking that we could have two different ANALYZE modes, one
"full" and one "incremental"; autovacuum could be modified to use one or
the other depending on how many changes there are (of course, the user
could request one or the other, too; not sure what should be the default
behavior).  So the incremental one wouldn't worry about deletes, only
inserts, and could be called very frequently.  The other one would
trigger a full table scan (or nearly so) to produce a better estimate in
the face of many deletions.

I haven't followed this discussion closely so I'm not sure that this
would be workable.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: estimating # of distinct values

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> I was thinking that we could have two different ANALYZE modes, one
> "full" and one "incremental"; autovacuum could be modified to use one or
> the other depending on how many changes there are (of course, the user
> could request one or the other, too; not sure what should be the default
> behavior).

How is an incremental ANALYZE going to work at all?  It has no way to
find out the recent changes in the table, for *either* inserts or
deletes.  Unless you want to seqscan the whole table looking for tuples
with xmin later than something-or-other ... which more or less defeats
the purpose.
        regards, tom lane


Re: estimating # of distinct values

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of jue dic 30 23:02:04 -0300 2010:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > I was thinking that we could have two different ANALYZE modes, one
> > "full" and one "incremental"; autovacuum could be modified to use one or
> > the other depending on how many changes there are (of course, the user
> > could request one or the other, too; not sure what should be the default
> > behavior).
> 
> How is an incremental ANALYZE going to work at all?  It has no way to
> find out the recent changes in the table, for *either* inserts or
> deletes.  Unless you want to seqscan the whole table looking for tuples
> with xmin later than something-or-other ... which more or less defeats
> the purpose.

Yeah, I was thinking that this incremental ANALYZE would be the stream
in the "stream-based estimator" but evidently it doesn't work that way.
The stream that needs to be passed to the estimator consists of new
tuples as they are being inserted into the table, so this would need to
be done by the inserter process ... or it'd need to transmit the CTIDs
for someone else to stream them ... not an easy thing, in itself.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: estimating # of distinct values

From
Jim Nasby
Date:
On Dec 31, 2010, at 7:34 AM, Alvaro Herrera wrote:
> Excerpts from Tom Lane's message of jue dic 30 23:02:04 -0300 2010:
>> Alvaro Herrera <alvherre@commandprompt.com> writes:
>>> I was thinking that we could have two different ANALYZE modes, one
>>> "full" and one "incremental"; autovacuum could be modified to use one or
>>> the other depending on how many changes there are (of course, the user
>>> could request one or the other, too; not sure what should be the default
>>> behavior).
>>
>> How is an incremental ANALYZE going to work at all?  It has no way to
>> find out the recent changes in the table, for *either* inserts or
>> deletes.  Unless you want to seqscan the whole table looking for tuples
>> with xmin later than something-or-other ... which more or less defeats
>> the purpose.
>
> Yeah, I was thinking that this incremental ANALYZE would be the stream
> in the "stream-based estimator" but evidently it doesn't work that way.
> The stream that needs to be passed to the estimator consists of new
> tuples as they are being inserted into the table, so this would need to
> be done by the inserter process ... or it'd need to transmit the CTIDs
> for someone else to stream them ... not an easy thing, in itself.

Perhaps listen/notify could be used for this, now that it allows passing a payload.

BTW, if we reduce the frequency at which full scans of large tables are needed then presumably the cost of the scans
couldbe largely ignored. If we don't need to scan frequently then we shouldn't care very much about how long a scan
takes,which means we could throttle it heavily. Presumably even a heavily used system can spare 500kB/s of IO to
performbackground scanning... 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: estimating # of distinct values

From
Csaba Nagy
Date:
On Thu, 2010-12-30 at 21:02 -0500, Tom Lane wrote:
> How is an incremental ANALYZE going to work at all?

How about a kind of continuous analyze ?

Instead of analyzing just once and then drop the intermediate results,
keep them on disk for all tables and then piggyback the background
writer (or have a dedicated process if that's not algorithmically
feasible) and before writing out stuff update the statistics based on
the values found in modified buffers. Probably it could take a random
sample of buffers to minimize overhead, but if it is done by a
background thread the overhead could be minimal anyway on multi-core
systems.

Not sure this makes sense at all, but if yes it would deliver the most
up to date statistics you can think of.

Cheers,
Csaba.




Re: estimating # of distinct values

From
tv@fuzzy.cz
Date:
> On Thu, 2010-12-30 at 21:02 -0500, Tom Lane wrote:
>> How is an incremental ANALYZE going to work at all?
>
> How about a kind of continuous analyze ?
>
> Instead of analyzing just once and then drop the intermediate results,
> keep them on disk for all tables and then piggyback the background
> writer (or have a dedicated process if that's not algorithmically
> feasible) and before writing out stuff update the statistics based on
> the values found in modified buffers. Probably it could take a random
> sample of buffers to minimize overhead, but if it is done by a
> background thread the overhead could be minimal anyway on multi-core
> systems.

Hi,

the problem is you will eventually need to drop the results and rebuild
it, as the algorithms do not handle deletes (ok, Florian mentioned an
algorithm L_0 described in one of the papers, but I'm not sure we can use
it).

I'm not sure a constantly running background process is a good idea. I'd
prefer storing an info about the modified tuples somewhere, and starting
analyze only when a given threshold is reached. I'm not sure how to do
that, though.

Another thing I'm not sure about is where to store those intermediate
stats (used to get the current estimate, updated incrementally). I was
thinking about pg_stats but I'm not sure it's the right place - depending
on the algorithm, this may be a fet kilobytes up to several megabytes. And
it's not needed except when updating it. Any ideas?

regards
Tomas



Re: estimating # of distinct values

From
Jim Nasby
Date:
On Jan 7, 2011, at 5:32 AM, tv@fuzzy.cz wrote:
> Another thing I'm not sure about is where to store those intermediate
> stats (used to get the current estimate, updated incrementally). I was
> thinking about pg_stats but I'm not sure it's the right place - depending
> on the algorithm, this may be a fet kilobytes up to several megabytes. And
> it's not needed except when updating it. Any ideas?

If you're using it essentially as a queue, maybe a resource fork makes sense?
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: estimating # of distinct values

From
Tomas Vondra
Date:
Dne 7.1.2011 20:56, Jim Nasby napsal(a):
> On Jan 7, 2011, at 5:32 AM, tv@fuzzy.cz wrote:
>> Another thing I'm not sure about is where to store those intermediate
>> stats (used to get the current estimate, updated incrementally). I was
>> thinking about pg_stats but I'm not sure it's the right place - depending
>> on the algorithm, this may be a fet kilobytes up to several megabytes. And
>> it's not needed except when updating it. Any ideas?
> 
> If you're using it essentially as a queue, maybe a resource fork makes sense?

A resource fork? Not sure what you mean, could you describe it in more
detail?

regards
Tomas


Re: estimating # of distinct values

From
Jim Nasby
Date:
> A resource fork? Not sure what you mean, could you describe it in more
> detail?

Ooops, resource forks are a filesystem thing; we call them relation forks. From src/backend/storage/smgr/README:

Relation Forks
==============

Since 8.4, a single smgr relation can be comprised of multiple physical
files, called relation forks. This allows storing additional metadata like
Free Space information in additional forks, which can be grown and truncated
independently of the main data file, while still treating it all as a single
physical relation in system catalogs.

It is assumed that the main fork, fork number 0 or MAIN_FORKNUM, always
exists. Fork numbers are assigned in src/include/storage/relfilenode.h.
Functions in smgr.c and md.c take an extra fork number argument, in addition
to relfilenode and block number, to identify which relation fork you want to
access. Since most code wants to access the main fork, a shortcut version of
ReadBuffer that accesses MAIN_FORKNUM is provided in the buffer manager for
convenience.
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: estimating # of distinct values

From
tv@fuzzy.cz
Date:
> On Fri, 2011-01-07 at 12:32 +0100, tv@fuzzy.cz wrote:
>> the problem is you will eventually need to drop the results and rebuild
>> it, as the algorithms do not handle deletes (ok, Florian mentioned an
>> algorithm L_0 described in one of the papers, but I'm not sure we can
>> use
>> it).
>
> Yes, but even then you can start with much better cards if you already
> have an estimate of what it looks like, based on the fact that you did
> continuous updating of it. For example you'll have a pretty good
> estimate of the bounds of the number of distinct values, while if you
> really start from scratch you have nothing to start with but assume that
> you must cope with the complete range between all values are distinct ->
> there's only a few of them.

Sure, using the previous estimate is a good idea. I just wanted to point
out there is no reasonable way to handle deletes, so that you have to drop
the stats are rebuild it from scratch.

The biggest problem is not choosing a reasonable parameters (some of the
parameters can handle a few billions ndistinct values with something like
128kB of memory and less than 5% error). The really serious concern is I/O
generated by rebuilding the stats.

>> Another thing I'm not sure about is where to store those intermediate
>> stats (used to get the current estimate, updated incrementally).
>
> The forks implementation proposed in other responses is probably the
> best idea if usable. It will also solve you the problem of memory
> limitations, at the expense of more resources used if the structure
> grows too big, but it will still be probably fast enough if it stays
> small and in cache.

Hm, the forks seem to be an interesting option. It's probably much better
than storing that directly in the memory (not a good idea if there is a
lot of data). And you don't really need the data to get the estimate, it's
needed just when updating the stats.

So the last thing I'm not sure is how to store the changed rows, so that
the update process can get a list of new values. Someone already offered
LISTEN/NOTIFY, but I guess we could just write the ctids into a file
(maybe another fork)?

regards
Tomas



Re: estimating # of distinct values

From
Csaba Nagy
Date:
On Fri, 2011-01-07 at 12:32 +0100, tv@fuzzy.cz wrote:
> the problem is you will eventually need to drop the results and rebuild
> it, as the algorithms do not handle deletes (ok, Florian mentioned an
> algorithm L_0 described in one of the papers, but I'm not sure we can use
> it).

Yes, but even then you can start with much better cards if you already
have an estimate of what it looks like, based on the fact that you did
continuous updating of it. For example you'll have a pretty good
estimate of the bounds of the number of distinct values, while if you
really start from scratch you have nothing to start with but assume that
you must cope with the complete range between all values are distinct ->
there's only a few of them.

> I'm not sure a constantly running background process is a good idea. I'd
> prefer storing an info about the modified tuples somewhere, and starting
> analyze only when a given threshold is reached. I'm not sure how to do
> that, though.
> 
> Another thing I'm not sure about is where to store those intermediate
> stats (used to get the current estimate, updated incrementally).

The forks implementation proposed in other responses is probably the
best idea if usable. It will also solve you the problem of memory
limitations, at the expense of more resources used if the structure
grows too big, but it will still be probably fast enough if it stays
small and in cache.

Cheers,
Csaba.




Re: estimating # of distinct values

From
Tomas Vondra
Date:
Hi,

a short update regarding the ndistinct stream estimators - I've
implemented the estimators described in the papers I've mentioned in my
previous posts. If you want try it, the sources are available at github,
at http://tvondra.github.com/pg_estimator (ignore the page, I have to
update it, just download the sources).

You can build the estimators as standalone benchmarks (build.sh) or as
aggregate functions (build-agg.sh) - I guess the latter is much easier
to play with. The installation is described on my blog, along with some
two simple benchmarks

http://www.fuzzy.cz/en/articles/aggregate-functions-for-distinct-estimation/

Feel free to play with it with different data, and if you notice
something strange just let me know.

Several comment regarding the implemented estimators:

1) PCSA, Probabilistic Counting (both described in the 1985 paper)
  - quite good performance, especially with respect to the memory  requirements (95 and 495 B)

2) Adaptive Sampling (1990), Self-Learning Bitmap (2009)
  - very different algorithms, almost the same performance (especially  with large number of distinct values)
  - my personal favourites, as the memory requirements are still very  reasonable (compared to the Bloom Counter)

3) Rough Estimator (2010)
  - this algorithm is described as an "optimal algorithm" (not  exactly, it's a building block of the optimal
algorithm),but the  results are not as good as with (1) and (2)
 
  - one of the problems is it needs k-wise independent hash functions,  and the MD5 hashing scheme used with the other
algorithmsprobably  does not conform to this condition :-(
 
  - I've tried to implement such hash functions on my own (using prime  field and polynomials in modulo arithmetics),
butthe results were  quite bad - if you know how to get such hash functions, let me know.
 
  - this algorithm is much more complex than the other algorithms, so  if someone could do a review, that would be nice
(maybethere is a  bug resulting in the big estimate error)
 

4) Bloom Counter
  - basically a Bloom filter + a counter (incremented when an new  element is added to the filter)
  - due to the design (no false negatives, positive false positives)  it's significantly biased (gives underestimates),
butI think that  can be easily fixed with a coefficient (depends on the number of  items / false positive eror rate)
 
  - needs significantly more memory to get similar performance as the  other algorithms (a Bloom counter that can track
1milion distinct  values needs about 1.2MB of memory, while a S-Bitmap that tracks 10  milion items needs just 65kB)
 
  - so unless we can use the special feature of Bloom Filter (i.e. the  ability to detect items that are in the data),
thisis a waste of  memory
 
  I know Google uses Bloom filters in BigTable to limit I/O, i.e.  before doing the I/O they ask the Bloom filter if
thereare such  data - if the answer is no, they don't have to do the I/O (false  negatives not allowed), if the answer
isyes they do the I/O.
 
  Not sure if we can / want to do something like this in PostgreSQL,  as it significantly depends on the workload (and
inmost cases we  know the data are there, so we have to do the I/O anyway). But I  guess it might be a good idea for a
contribmodule (to maintain the  Bloom filter on your own and do the decision on application). The  tricky part here is
tomaintain the bloom filter up to date.
 

I'll try to fix the optimal estimator (not sure how to do that right
now), and I'll investigate the proposed 'relation fork' proposed as a
way to store the estimator.

regards
Tomas


Re: estimating # of distinct values

From
Tomas Vondra
Date:
Dne 9.1.2011 13:58, Jim Nasby napsal(a):
>> A resource fork? Not sure what you mean, could you describe it in more
>> detail?
> 
> Ooops, resource forks are a filesystem thing; we call them relation forks. >From src/backend/storage/smgr/README:

OK, I think using relation forks seems like a good solution. I've done
some basic research and I think these are the basic steps when adding a
new fork:

1) define a new item in the ForkNum enum (relfilenode.h) - this should  be somethink like DISTINCT_FORK I guess

2) modify the ForkNames (catalog.c) and the relpathbackend so that the  proper filename is assigned to the fork

And then it will be accessed through smgr (smgrcreate etc.). Am I right
or is there something else I need to do?

There are a few open questions though:

1) Forks are 'per relation' but the distinct estimators are 'per  column' (or 'per group of columns') so I'm not sure
whetherthe file  should contain all the estimators for the table, or if there should  be one fork for each estimator.
Theformer is a bit difficult to  manage, the latter somehow breaks the current fork naming convention.
 

2) Where to keep the info that there is an estimator for a column? I  guess we could put this into pg_attribute (it's
oneboolean). But  what about the estimators for groups of columns? Because that's why  I'm building this - to get
distinctestimates for groups of columns.
 
  I guess we'll need a new system catalog to track this? (The same  will be true for multi-dimensional histograms
anyway).

3) I still am not sure how to manage the updates, i.e. how to track the  new values.
  One possibility might be to do that synchronously - whenever a new  item is inserted into the table, check if there's
anestimator and  update it. Updating the estimators is quite efficient (e.g. the  bitmap manages to do 10.000.000
insertsin 9 seconds on my ancient  workstation) although there might be issues with locking etc.
 
  The other possibility is to update the estimator asynchronously, i.e.  store the new values somewhere (or just ctid
ofthe row), and then  process it periodically. I'm not sure how to intercept the new rows  and where to store them. In
anotherfork? Somewhere else?
 

regards
Tomas


Re: estimating # of distinct values

From
Jim Nasby
Date:
On Jan 17, 2011, at 6:36 PM, Tomas Vondra wrote:
> 1) Forks are 'per relation' but the distinct estimators are 'per
>   column' (or 'per group of columns') so I'm not sure whether the file
>   should contain all the estimators for the table, or if there should
>   be one fork for each estimator. The former is a bit difficult to
>   manage, the latter somehow breaks the current fork naming convention.

Yeah, when I looked at the fork stuff I was disappointed to find out there's essentially no support for dynamically
addingforks. There's two other possible uses for that I can think of: 

- Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair amount of
overheadwe pay for the current setup. 
- Dynamic forks would make it possible to do a column-store database, or at least something approximating one.

Without some research, there's no way to know if either of the above makes sense; but without dynamic forks we're
prettymuch dead in the water. 

So I wonder what it would take to support dynamically adding forks...
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: estimating # of distinct values

From
Robert Haas
Date:
On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby <jim@nasby.net> wrote:
> - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair amount
ofoverhead we pay for the current setup.
 

That seems like an interesting idea, but I actually don't see why it
would be any more efficient, and it seems like you'd end up
reinventing things like vacuum and free space map management.

> - Dynamic forks would make it possible to do a column-store database, or at least something approximating one.

I've been wondering whether we could do something like this by
treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two
tables t1 and t2, one with columns pk, a1, a2, a3 and the other with
columns pk, b1, b2, b3.  SELECT * FROM t would be translated into
SELECT * FROM t1, t2 WHERE t1.pk = t2.pk.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: estimating # of distinct values

From
tv@fuzzy.cz
Date:
> On Jan 17, 2011, at 6:36 PM, Tomas Vondra wrote:
>> 1) Forks are 'per relation' but the distinct estimators are 'per
>>   column' (or 'per group of columns') so I'm not sure whether the file
>>   should contain all the estimators for the table, or if there should
>>   be one fork for each estimator. The former is a bit difficult to
>>   manage, the latter somehow breaks the current fork naming convention.
>
> Yeah, when I looked at the fork stuff I was disappointed to find out
> there's essentially no support for dynamically adding forks. There's two
> other possible uses for that I can think of:
>
> - Forks are very possibly a more efficient way to deal with TOAST than
> having separate tables. There's a fair amount of overhead we pay for the
> current setup.
> - Dynamic forks would make it possible to do a column-store database, or
> at least something approximating one.
>
> Without some research, there's no way to know if either of the above makes
> sense; but without dynamic forks we're pretty much dead in the water.
>
> So I wonder what it would take to support dynamically adding forks...

Interesting ideas, but a bit out of scope. I think I'll go with one fork
containing all the estimators for now, although it might be inconvenient
in some cases. I was thinking about rebuilding a single estimator with
increased precision - in that case the size changes so that all the other
data has to be shifted. But this won't be very common (usually all the
estimators will be rebuilt at the same time), and it's actually doable.

So the most important question is how to intercept the new/updated rows,
and where to store them. I think each backend should maintain it's own
private list of new records and forward them only in case of commit. Does
that sound reasonable?

regards
Tomas



Re: estimating # of distinct values

From
Robert Haas
Date:
On Tue, Jan 18, 2011 at 4:53 AM,  <tv@fuzzy.cz> wrote:
> So the most important question is how to intercept the new/updated rows,
> and where to store them. I think each backend should maintain it's own
> private list of new records and forward them only in case of commit. Does
> that sound reasonable?

At the risk of sounding demoralizing, nothing about this proposal
sounds very promising to me, and that sounds like a particularly bad
idea.  What happens if the transaction updates a billion records?  Or
even a million records?  Are you going to store all of those in
backend-local memory until commit time?  Or spool them in a
potentially-gigantic disk file somewhere?  That memory allocation - or
file - could grow to be larger than the size of the entire database in
the worst case.  And COMMIT could take an awfully long time if it has
to spool megabytes or gigabytes of data off to some other process.
And what happens if there's a crash after the COMMIT but before all
that data is sent?  The estimates become permanently wrong?

And are we doing all of this just to get a more accurate estimate of
ndistinct?  For the amount of effort that it will probably take to get
this working at all, you could probably implement index-only scans and
have enough bandwidth left over to tackle global temporary tables.
And unless I'm missing something, the chances that the performance
consequences will be tolerable are pretty much zero.  And it would
only benefit the tiny fraction of users for whom bad n_distinct
estimates cause bad plans, and then the even tinier percentage of
those who can't conveniently fix it by using the manual override that
we already have in place - which presumably means people who have
gigantic tables that are regularly rewritten with massive changes in
the data distribution that affect plan choice.  Is that more than the
empty set?

Maybe the idea here is that this wouldn't fix just ndistinct estimates
but would also help with multi-column statistics.  Even if that's the
case, I think it's almost certainly a dead end from a performance
standpoint.  Some kind of manual ANALYZE process that can be invoked
when needed to scan the entire table would be painful but maybe
acceptable for certain people with a problem they can't fix any other
way, but doing this automatically for everyone seems like a really bad
idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: estimating # of distinct values

From
Jim Nasby
Date:
On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby <jim@nasby.net> wrote:
>> - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair amount
ofoverhead we pay for the current setup. 
>
> That seems like an interesting idea, but I actually don't see why it
> would be any more efficient, and it seems like you'd end up
> reinventing things like vacuum and free space map management.

The FSM would take some effort, but I don't think vacuum would be that hard to deal with; you'd just have to free up
thespace in any referenced toast forks at the same time that you vacuumed the heap. 

>> - Dynamic forks would make it possible to do a column-store database, or at least something approximating one.
>
> I've been wondering whether we could do something like this by
> treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two
> tables t1 and t2, one with columns pk, a1, a2, a3 and the other with
> columns pk, b1, b2, b3.  SELECT * FROM t would be translated into
> SELECT * FROM t1, t2 WHERE t1.pk = t2.pk.

Possibly, but you'd be paying tuple overhead twice, which is what I was looking to avoid with forks.
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: estimating # of distinct values

From
Robert Haas
Date:
On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby <jim@nasby.net> wrote:
> On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
>> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby <jim@nasby.net> wrote:
>>> - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair
amountof overhead we pay for the current setup. 
>>
>> That seems like an interesting idea, but I actually don't see why it
>> would be any more efficient, and it seems like you'd end up
>> reinventing things like vacuum and free space map management.
>
> The FSM would take some effort, but I don't think vacuum would be that hard to deal with; you'd just have to free up
thespace in any referenced toast forks at the same time that you vacuumed the heap. 

How's that different from what vacuum does on a TOAST table now?

>>> - Dynamic forks would make it possible to do a column-store database, or at least something approximating one.
>>
>> I've been wondering whether we could do something like this by
>> treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two
>> tables t1 and t2, one with columns pk, a1, a2, a3 and the other with
>> columns pk, b1, b2, b3.  SELECT * FROM t would be translated into
>> SELECT * FROM t1, t2 WHERE t1.pk = t2.pk.
>
> Possibly, but you'd be paying tuple overhead twice, which is what I was looking to avoid with forks.

What exactly do you mean by "tuple overhead"?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: estimating # of distinct values

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby <jim@nasby.net> wrote:
>> On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby <jim@nasby.net> wrote:
>>> - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair
amountof overhead we pay for the current setup.
 
> 
> That seems like an interesting idea, but I actually don't see why it
> would be any more efficient, and it seems like you'd end up
> reinventing things like vacuum and free space map management.
>> 
>> The FSM would take some effort, but I don't think vacuum would be that hard to deal with; you'd just have to free up
thespace in any referenced toast forks at the same time that you vacuumed the heap.
 

> How's that different from what vacuum does on a TOAST table now?

Even more to the point: Jim hasn't provided one single reason to suppose
that this would be better-performing than the existing approach.  It
looks to me like a large amount of work, and loss of on-disk
compatibility, for nothing at all except the sake of change.
        regards, tom lane


Re: estimating # of distinct values

From
Jim Nasby
Date:
On Jan 18, 2011, at 11:24 AM, Robert Haas wrote:
> On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby <jim@nasby.net> wrote:
>> On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
>>> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby <jim@nasby.net> wrote:
>>>> - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair
amountof overhead we pay for the current setup. 
>>>
>>> That seems like an interesting idea, but I actually don't see why it
>>> would be any more efficient, and it seems like you'd end up
>>> reinventing things like vacuum and free space map management.
>>
>> The FSM would take some effort, but I don't think vacuum would be that hard to deal with; you'd just have to free up
thespace in any referenced toast forks at the same time that you vacuumed the heap. 
>
> How's that different from what vacuum does on a TOAST table now?

TOAST vacuum is currently an entirely separate vacuum. It might run at the same time as the main table vacuum, but it
stillhas all the work that would be associated with vacuuming a table with the definition of a toast table. In fact, at
onepoint vacuuming toast took two passes: the first deleted the toast rows that were no longer needed, then you had to
goback and vacuum those deleted rows. 

>>>> - Dynamic forks would make it possible to do a column-store database, or at least something approximating one.
>>>
>>> I've been wondering whether we could do something like this by
>>> treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two
>>> tables t1 and t2, one with columns pk, a1, a2, a3 and the other with
>>> columns pk, b1, b2, b3.  SELECT * FROM t would be translated into
>>> SELECT * FROM t1, t2 WHERE t1.pk = t2.pk.
>>
>> Possibly, but you'd be paying tuple overhead twice, which is what I was looking to avoid with forks.
>
> What exactly do you mean by "tuple overhead"?

typedef struct HeapTupleHeaderData. With only two tables it might not be that bad, depending on the fields. Beyond two
tablesit's almost certainly a loser. 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: estimating # of distinct values

From
Jim Nasby
Date:
On Jan 18, 2011, at 11:32 AM, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby <jim@nasby.net> wrote:
>>> On Jan 17, 2011, at 8:11 PM, Robert Haas wrote:
>> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby <jim@nasby.net> wrote:
>>>> - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair
amountof overhead we pay for the current setup. 
>>
>> That seems like an interesting idea, but I actually don't see why it
>> would be any more efficient, and it seems like you'd end up
>> reinventing things like vacuum and free space map management.
>>>
>>> The FSM would take some effort, but I don't think vacuum would be that hard to deal with; you'd just have to free
upthe space in any referenced toast forks at the same time that you vacuumed the heap. 
>
>> How's that different from what vacuum does on a TOAST table now?
>
> Even more to the point: Jim hasn't provided one single reason to suppose
> that this would be better-performing than the existing approach.  It
> looks to me like a large amount of work, and loss of on-disk
> compatibility, for nothing at all except the sake of change.

Yes, I was only pointing out that there were possible uses for allowing a variable number of forks per relation if
Tomasfelt the need to go that direction. Changing toast would certainly require some very convincing arguments as to
thebenefits. 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: estimating # of distinct values

From
Robert Haas
Date:
On Tue, Jan 18, 2011 at 12:32 PM, Jim Nasby <jim@nasby.net> wrote:
>> How's that different from what vacuum does on a TOAST table now?
>
> TOAST vacuum is currently an entirely separate vacuum. It might run at the same time as the main table vacuum, but it
stillhas all the work that would be associated with vacuuming a table with the definition of a toast table. In fact, at
onepoint vacuuming toast took two passes: the first deleted the toast rows that were no longer needed, then you had to
goback and vacuum those deleted rows. 

I don't know whether it still works that way or not, but if it does,
fixing it could perhaps be done with far less drastic surgery than
what you're proposing here.

>>> Possibly, but you'd be paying tuple overhead twice, which is what I was looking to avoid with forks.
>>
>> What exactly do you mean by "tuple overhead"?
>
> typedef struct HeapTupleHeaderData. With only two tables it might not be that bad, depending on the fields. Beyond
twotables it's almost certainly a loser. 

A struct definition by itself doesn't cause overhead.  Are you talking
about storage on disk?  CPU usage for MVCC visibility testing?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: estimating # of distinct values

From
Tomas Vondra
Date:
Dne 18.1.2011 18:12, Robert Haas napsal(a):
> On Tue, Jan 18, 2011 at 4:53 AM,  <tv@fuzzy.cz> wrote:
>> So the most important question is how to intercept the new/updated rows,
>> and where to store them. I think each backend should maintain it's own
>> private list of new records and forward them only in case of commit. Does
>> that sound reasonable?
> 
> At the risk of sounding demoralizing, nothing about this proposal
> sounds very promising to me, and that sounds like a particularly bad
> idea.  What happens if the transaction updates a billion records?  Or
> even a million records?  Are you going to store all of those in
> backend-local memory until commit time?  Or spool them in a
> potentially-gigantic disk file somewhere?  That memory allocation - or
> file - could grow to be larger than the size of the entire database in
> the worst case.  And COMMIT could take an awfully long time if it has
> to spool megabytes or gigabytes of data off to some other process.
> And what happens if there's a crash after the COMMIT but before all
> that data is sent?  The estimates become permanently wrong?

Yes, I was aware of this problem (amount of memory consumed with lots of
updates), and I kind of hoped someone will come up with a reasonable
solution.

I was thinking about it and I believe at least one of the algorithms
(the "adaptive sampling") might handle "merging" i.e. the backends would
not need to store the list of values, just a private copy of the
estimator and update it. And at the end (after commit), the estimator
would be merged back into the actual one.

So the algorithm would be something like this:

1. create copy for all distinct estimators influenced by the INSERT
2. update the local copy
3. commit and merge the local copies back into the original estimator

The amount of copied data is very small, depending on the number of
distinct items it can handle and precision (see the adaptive.c for
details). Some examples

2^48 items + 1% error => 86 kB
2^48 items + 5% error => 3.4 kB
2^64 items + 1% error => 115 kB
2^64 items + 5% error => 4.6 kB

so it's much better than storing all the data.

But I really need to check this is possible (right now I'm quite busy
with organizing a local conference, so maybe next week).

Regarding the crash scenario - if the commit fails, just throw away the
local estimator copy, it's not needed. I'm not sure how to take care of
the case when commit succeeds and the write of the merged estimator
fails, but I think it might be possible to write the estimator to xlog
or something like that. So it would be replayed during recovery etc. Or
is it a stupid idea?

> And are we doing all of this just to get a more accurate estimate of
> ndistinct?  For the amount of effort that it will probably take to get
> this working at all, you could probably implement index-only scans and
> have enough bandwidth left over to tackle global temporary tables.
> And unless I'm missing something, the chances that the performance
> consequences will be tolerable are pretty much zero.  And it would
> only benefit the tiny fraction of users for whom bad n_distinct
> estimates cause bad plans, and then the even tinier percentage of
> those who can't conveniently fix it by using the manual override that
> we already have in place - which presumably means people who have
> gigantic tables that are regularly rewritten with massive changes in
> the data distribution that affect plan choice.  Is that more than the
> empty set?

First of all, I've never proposed to use this as a default behaviour.
Actually I've strongly argued agains that idea on several occasions. So
the user would have to enable that explicitly for some columns, I really
am not an idiot to force everyone to pay for this.

So the goal is to help the "small fraction" of users and keep the
current solution for the rest of them.

And yes, the main reason why I am working on this are the multi-column
stats (in case of discrete columns). You can fix the ndistinct estimate
by manually overriding the estimate, but for the multi-column stats you
need an estimate of ndistinct for the combination of columns, and that's
a bit difficult to do manually.

The other reason is I've studied statistics (and I enjoy it, which seems
weird to most people). And it's a good way to learn the internals.

> Maybe the idea here is that this wouldn't fix just ndistinct estimates
> but would also help with multi-column statistics.  Even if that's the
> case, I think it's almost certainly a dead end from a performance
> standpoint.  Some kind of manual ANALYZE process that can be invoked
> when needed to scan the entire table would be painful but maybe
> acceptable for certain people with a problem they can't fix any other
> way, but doing this automatically for everyone seems like a really bad
> idea.

Yes, as I've mentioned above, the multi-column stats are the reason why
I started working on this. And yes, it really should work like this:

1. user realizes there's something wrong as the plans are bad
2. after analysis, the user realizes the reason is an imprecise  estimate due to dependence between columns
3. the user enables cross-column stats on the columns and checks  if it actually solved the issue
4. if the cost outweights the gains, the user drops the stats

Does that sound reasonable?

regards
Tomas


Re: estimating # of distinct values

From
Robert Haas
Date:
On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> Yes, I was aware of this problem (amount of memory consumed with lots of
> updates), and I kind of hoped someone will come up with a reasonable
> solution.

As far as I can see, periodically sampling some or all of the table is
really the only thing that has a chance of working OK.  You could try
to track changes only when they're not too big, but I think that's
still going to have nasty performance consequences.

> I was thinking about it and I believe at least one of the algorithms
> (the "adaptive sampling") might handle "merging" i.e. the backends would
> not need to store the list of values, just a private copy of the
> estimator and update it. And at the end (after commit), the estimator
> would be merged back into the actual one.
>
> So the algorithm would be something like this:
>
> 1. create copy for all distinct estimators influenced by the INSERT
> 2. update the local copy
> 3. commit and merge the local copies back into the original estimator

Well, maybe.  But DELETEs seem like a problem.  And it's still adding
foreground work to every query, which I just have a hard time
believing is ever going to work out

> Regarding the crash scenario - if the commit fails, just throw away the
> local estimator copy, it's not needed. I'm not sure how to take care of
> the case when commit succeeds and the write of the merged estimator
> fails, but I think it might be possible to write the estimator to xlog
> or something like that. So it would be replayed during recovery etc. Or
> is it a stupid idea?

It's not stupid, in the sense that that is what you'd need to do if
you want to avoid ever having to rescan the table.  But it is another
thing that I think is going to be way too expensive.

> Yes, as I've mentioned above, the multi-column stats are the reason why
> I started working on this. And yes, it really should work like this:
>
> 1. user realizes there's something wrong as the plans are bad
> 2. after analysis, the user realizes the reason is an imprecise
>   estimate due to dependence between columns
> 3. the user enables cross-column stats on the columns and checks
>   if it actually solved the issue
> 4. if the cost outweights the gains, the user drops the stats
>
> Does that sound reasonable?

Yes.  The only caveat I'm trying to insert is that it's hard for me to
see how the proposed approach could ever be cheap enough to be a win.
If we need some kind of more expensive kind of ANALYZE that scans the
whole table, and it's optional, sure, why not?  But that's a one-time
hit.  You run it, it sucks, it's over, and now your queries work.
Odds are good you may never need to touch it again.  Now, if we can
convince ourselves that multi-column stats are likely to require
constant updating, then maybe there would be more to recommend the
stream-based approach, although even then it seems dreadfully complex
and expensive to me.  But I bet these things are pretty static.  If
the city and state tend to imply the zip code when there are 10K rows
in the table, they probably still will when there are 100K rows in the
table.  If users with org_id = 1 have a higher karma score on average
than users with org_id != 1, that'll probably still be true when there
are more users in both classes.  Whether the database can understand
that without rescanning the table depends on the data representation,
of course, but I guess I'm just saying I'd think really, really hard
before abandoning the idea of periodic sampling.  You have to get an
awful lot of benefit out of those cross-column stats to make it worth
paying a run-time cost for them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: estimating # of distinct values

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> ... I guess I'm just saying I'd think really, really hard
> before abandoning the idea of periodic sampling.  You have to get an
> awful lot of benefit out of those cross-column stats to make it worth
> paying a run-time cost for them.

I've been trying to not discourage Tomas from blue-sky speculation,
but I have to say I agree with Robert that the chance of any usable
result from this approach is very very small.  Feel free to do some
experimentation to prove us wrong --- but don't put a lot of effort
into it before that.
        regards, tom lane


Re: estimating # of distinct values

From
Tomas Vondra
Date:
Dne 19.1.2011 20:25, Robert Haas napsal(a):
> On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> Yes, I was aware of this problem (amount of memory consumed with lots of
>> updates), and I kind of hoped someone will come up with a reasonable
>> solution.
> 
> As far as I can see, periodically sampling some or all of the table is
> really the only thing that has a chance of working OK.  You could try
> to track changes only when they're not too big, but I think that's
> still going to have nasty performance consequences.

Yes, sampling all the table is the only way to get reliable ndistinct
estimates. I'm not sure what you mean by 'tracking changes' - as I've
mentioned in the previous post, this might be solved by updating a local
copy. Which requires a constant space (a few kB, see the previous post).
Is that acceptable? I don't know, that's up to the user if he want's to
pay this price.

>> So the algorithm would be something like this:
>>
>> 1. create copy for all distinct estimators influenced by the INSERT
>> 2. update the local copy
>> 3. commit and merge the local copies back into the original estimator
> 
> Well, maybe.  But DELETEs seem like a problem.  And it's still adding
> foreground work to every query, which I just have a hard time
> believing is ever going to work out

Yes, deletes are difficult to handle. My idea was to compute something
like ((tuples changed + tuples deleted) / tuples total), and indicate
that a rebuild of the estimator is needed if this coefficient changes
too much - e.g. log a message or something like that.

>> Regarding the crash scenario - if the commit fails, just throw away the
>> local estimator copy, it's not needed. I'm not sure how to take care of
>> the case when commit succeeds and the write of the merged estimator
>> fails, but I think it might be possible to write the estimator to xlog
>> or something like that. So it would be replayed during recovery etc. Or
>> is it a stupid idea?
> 
> It's not stupid, in the sense that that is what you'd need to do if
> you want to avoid ever having to rescan the table.  But it is another
> thing that I think is going to be way too expensive.

Way too expensive? All you need to put into the logfile is a copy of the
estimator, which is a few kBs. How is that 'way too expensive'? Sure, it
might be expensive when the user does a lot of small changes in separate
transactions, that's true, but I guess we could minimize the amount of
data written to the xlog by doing a diff of the estimators or something
like that.

>> Yes, as I've mentioned above, the multi-column stats are the reason why
>> I started working on this. And yes, it really should work like this:
>>
>> 1. user realizes there's something wrong as the plans are bad
>> 2. after analysis, the user realizes the reason is an imprecise
>>   estimate due to dependence between columns
>> 3. the user enables cross-column stats on the columns and checks
>>   if it actually solved the issue
>> 4. if the cost outweights the gains, the user drops the stats
>>
>> Does that sound reasonable?
> 
> Yes.  The only caveat I'm trying to insert is that it's hard for me to
> see how the proposed approach could ever be cheap enough to be a win.

IMHO the question is not 'How expensive is that?' but rather 'How
expensive is it compared to the gains?' If the user gets much better
estimates and a then a much better plan, then this may be a completely
acceptable price.

> If we need some kind of more expensive kind of ANALYZE that scans the
> whole table, and it's optional, sure, why not?  But that's a one-time
> hit.  You run it, it sucks, it's over, and now your queries work.
> Odds are good you may never need to touch it again.  Now, if we can
> convince ourselves that multi-column stats are likely to require
> constant updating, then maybe there would be more to recommend the
> stream-based approach, although even then it seems dreadfully complex
> and expensive to me.  But I bet these things are pretty static.  If

No, the multi-column statistics do not require constant updating. There
are cases where a sampling is perfectly fine, although you may need a
bit larger sample. Generally if you can use a multi-dimensional
histogram, you don't need to scan the whole table.

But the multi-dimensional histograms are not applicable to some cases.
Especially to the ZIP code fail case, that was repeatedly discussed. So
in case of discrete data, we need a different approach - and the
solution I've proposed is based on using ndistinct estimates to get the
estimates (actually it's based on one of the papers listed on wiki).

> and expensive to me.  But I bet these things are pretty static.  If
> the city and state tend to imply the zip code when there are 10K rows
> in the table, they probably still will when there are 100K rows in the
> table.  If users with org_id = 1 have a higher karma score on average

OK, how will you measure the "implicativeness"?

We have discussed this in another thread and there is a lot of gotchas
although it seems like a really simple problem. The solution based on
ndistinct estimates turner out to be a reasonable approach that's worth
a try.

regards
Tomas


Re: estimating # of distinct values

From
Tomas Vondra
Date:
Dne 19.1.2011 20:46, Tom Lane napsal(a):
> Robert Haas <robertmhaas@gmail.com> writes:
>> ... I guess I'm just saying I'd think really, really hard
>> before abandoning the idea of periodic sampling.  You have to get an
>> awful lot of benefit out of those cross-column stats to make it worth
>> paying a run-time cost for them.
> 
> I've been trying to not discourage Tomas from blue-sky speculation,
> but I have to say I agree with Robert that the chance of any usable
> result from this approach is very very small.  Feel free to do some
> experimentation to prove us wrong --- but don't put a lot of effort
> into it before that.
> 
>             regards, tom lane

Discourage? Not really - I have not expected this to be a simple thing
to implement. And yes, it might turn out to be a dead end eventually.

OTOH the approach "it seems really expensive so let's not try to build
it" is a certain dead end, so I'm not going to surrender due to such
speculations (althouh those are valid concerns and I'll have to address
them in the future).

regards
Tomas


Re: estimating # of distinct values

From
Nathan Boley
Date:
On Wed, Jan 19, 2011 at 2:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> Dne 19.1.2011 20:25, Robert Haas napsal(a):
>> On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> Yes, I was aware of this problem (amount of memory consumed with lots of
>>> updates), and I kind of hoped someone will come up with a reasonable
>>> solution.
>>
>> As far as I can see, periodically sampling some or all of the table is
>> really the only thing that has a chance of working OK.  You could try
>> to track changes only when they're not too big, but I think that's
>> still going to have nasty performance consequences.
>
> Yes, sampling all the table is the only way to get reliable ndistinct
> estimates.

IMHO continuing to focus on worst case results is a dead end. Yes, for
some distributions, ndistinct is very hard to estimate well. However,
let us not forget that the current ndistinct estimator is very bad but
the postgres planner is, on the whole, very good.  As far as I can see
this is due to two facts:

1) The distribution of values in a table is rarely pathological, and
usually follows one of several common patterns. ( IOW, we have good
heuristics )

2) The choice of plan is fairly robust to mis-estimation of ndistinct,
because there are only a few things the executer can choose. It
doesn't usually matter if a value composes 5% or 20%  ( or 99% ) of
the table, we still usually need to scan the entire table.

Again, in my *very* humble opinion, If the ultimate goal is to improve
the planner, we should try to cut down on the number of cases in which
a poor estimate of ndistinct gives a really bad plan instead of
chasing after marginal gains from a better ndistinct estimator. Maybe
( as I've advocated for before ) this means parameterizing the
distribution of values ( ie, find that the values are roughly uniform
), maybe this means estimating the error of our statistics and using
the most robust rather than the best plan, or maybe it means something
totally different. But: ( and the literature seems to support me )
pounding away at the ndistinct estimator seems like a dead end. If you
think about it, it's a bit ridiculous to look at the whole table
*just* to "estimate" ndistinct - if we go that far why dont we just
store the full distribution and be done with it?

Best,
Nathan


Re: estimating # of distinct values

From
Tomas Vondra
Date:
Dne 19.1.2011 23:44, Nathan Boley napsal(a):
> 1) The distribution of values in a table is rarely pathological, and
> usually follows one of several common patterns. ( IOW, we have good
> heuristics )

Not true. You're missing the goal of this effort - to get ndistinct
estimate for combination of multiple columns. Which is usually
pathological in case of dependent columns. Which is exactly the case
when the user will explicitly enable this feature to get better
estimates (and thus better plans).

> 2) The choice of plan is fairly robust to mis-estimation of ndistinct,
> because there are only a few things the executer can choose. It
> doesn't usually matter if a value composes 5% or 20%  ( or 99% ) of
> the table, we still usually need to scan the entire table.

Again, don't think about a single column (although even in that case
there are known fail cases). Think about multiple columns and the number
of distinct combinations. With attribute value independence assumption,
this is usually significantly underestimated. And that's a problem as it
often leads to an index scan instead of sequential scan (or something
like that).

> Again, in my *very* humble opinion, If the ultimate goal is to improve
> the planner, we should try to cut down on the number of cases in which
> a poor estimate of ndistinct gives a really bad plan instead of
> chasing after marginal gains from a better ndistinct estimator. Maybe

What is a marginal gain? The ultimate goal is to build cross-column
stats, which in case of dependent columns usually results is poor plans.
How is fixing that a marginal gain?

Yes, it may turn out it's not worth it, but it's a bit early to say that
without an implementation and some hard data.

> ( as I've advocated for before ) this means parameterizing the
> distribution of values ( ie, find that the values are roughly uniform
> ), maybe this means estimating the error of our statistics and using
> the most robust rather than the best plan, or maybe it means something
> totally different. But: ( and the literature seems to support me )

Which literature? I've read an awful lot of papers on this topic lately,
and I don't remember anything like that. If there's an interesting
paper, I'd like to read it. Yes, all the papers state estimating a
ndistinct is a particularly tricky task, but that's somehow expected?

I've even checked how other databases do this - e.g. Oracle does it very
similarly to what I proposed (the user has to enable the feature, it has
to be rebuilt from time to time, etc.). I'm not saying we should do
everything like Oracle, but they are not clueless idiots.

> pounding away at the ndistinct estimator seems like a dead end. If you
> think about it, it's a bit ridiculous to look at the whole table
> *just* to "estimate" ndistinct - if we go that far why dont we just
> store the full distribution and be done with it?

No, I'm not trying to do this just to improve the ndistinct estimate.
The goal is to improve the cardinality estimate in case of dependent
columns, and one of the papers depends on good ndistinct estimates.

And actually it does not depend on ndistinct for the columns only, it
depends on ndistinct estimates for the combination of columns. So
improving the ndistinct estimates for columns is just a necessary first
step (and only if it works reasonably well, we can do the next step).

regards
Tomas


Re: estimating # of distinct values

From
Florian Pflug
Date:
On Jan19, 2011, at 23:44 , Nathan Boley wrote:
> If you think about it, it's a bit ridiculous to look at the whole table
> *just* to "estimate" ndistinct - if we go that far why dont we just
> store the full distribution and be done with it?

The crucial point that you're missing here is that ndistinct provides an
estimate even if you *don't* have a specific value to search for at hand.
This is way more common than you may think, it e.g. happens every you time
PREPARE are statement with parameters. Even knowing the full distribution
has no advantage in this case - the best you could do is to average the
individual probabilities which gives ... well, 1/ndistinct.

best regards,
Florian Pflug



Re: estimating # of distinct values

From
Nathan Boley
Date:
>> If you think about it, it's a bit ridiculous to look at the whole table
>> *just* to "estimate" ndistinct - if we go that far why dont we just
>> store the full distribution and be done with it?
>
> - the best you could do is to average the
> individual probabilities which gives ... well, 1/ndistinct.
>

That is certainly *not* the best you could do in every case. The mean
is only the best estimate in L2, which is definitely not the case
here.

Consider a table with 10K values, 9,990 of which are 1 and the rest of
which are 2, 3, ..., 10, versus a table that has the same 10 distinct
values evenly distributed. For a simple equality query, in the first
case, a bitmap scan might be best. In the second case, a sequential
scan would always be best.

This is precisely the point I was trying to make in my email: the loss
function is very complicated. Improving the ndistinct estimator could
significantly improve the estimates of ndistinct ( in the quadratic
loss sense ) while only marginally improving the plans.

-Nathan


Re: estimating # of distinct values

From
Nathan Boley
Date:
>
> Not true. You're missing the goal of this effort - to get ndistinct
> estimate for combination of multiple columns. Which is usually
> pathological in case of dependent columns.

<snip>

> Again, don't think about a single column (although even in that case
> there are known fail cases). Think about multiple columns and the number
> of distinct combinations. With attribute value independence assumption,
> this is usually significantly underestimated. And that's a problem as it
> often leads to an index scan instead of sequential scan (or something
> like that).

My point was that, in the case of single columns, we've done pretty
well despite the poor ndistinct estimates. I suspect the same will be
true in the case of multiple columns; good heuristics will trump
minimax estimators.

>> ( as I've advocated for before ) this means parameterizing the
>> distribution of values ( ie, find that the values are roughly uniform
>> ), maybe this means estimating the error of our statistics and using
>> the most robust rather than the best plan, or maybe it means something
>> totally different. But: ( and the literature seems to support me )
>
> Which literature? I've read an awful lot of papers on this topic lately,
> and I don't remember anything like that. If there's an interesting
> paper, I'd like to read it. Yes, all the papers state estimating a
> ndistinct is a particularly tricky task, but that's somehow expected?

It is definitely expected - non-parametric minimax results are always
*very* weak. The papers that you've cited seem to confirm this;
bounding ndistinct estimation error is tricky in the general case (
even with very simple loss functions, which we do not have ). The same
is true for other non-parametric methods. Think about kernel density
estimation: how many data points do you need to estimate the density
of a normal distribution well? What about if you *know* that the data
is normal ( or even that it comes from a big family? ).

> No, I'm not trying to do this just to improve the ndistinct estimate.
> The goal is to improve the cardinality estimate in case of dependent
> columns, and one of the papers depends on good ndistinct estimates.
>
> And actually it does not depend on ndistinct for the columns only, it
> depends on ndistinct estimates for the combination of columns. So
> improving the ndistinct estimates for columns is just a necessary first
> step (and only if it works reasonably well, we can do the next step).

I think that any approach which depends on precise estimates of
ndistinct is not practical.

I am very happy that you've spent so much time on this, and I'm sorry
if my previous email came off as combative. My point was only that
simple heuristics have served us well in the past and, before we go to
the effort of new, complicated schemes, we should see how well similar
heuristics work in the multiple column case. I am worried that if the
initial plan is too complicated then nothing will happen and, even if
something does happen, it will be tough to get it committed ( check
the archives for cross column stat threads - there are a lot ).

Best,
Nathan Boley


Re: estimating # of distinct values

From
Florian Pflug
Date:
On Jan20, 2011, at 02:41 , Nathan Boley wrote:
>>> If you think about it, it's a bit ridiculous to look at the whole table
>>> *just* to "estimate" ndistinct - if we go that far why dont we just
>>> store the full distribution and be done with it?
>> 
>> - the best you could do is to average the
>> individual probabilities which gives ... well, 1/ndistinct.
> 
> That is certainly *not* the best you could do in every case. The mean
> is only the best estimate in L2, which is definitely not the case
> here.

No, not in every case. But on average it comes out best, no?

> Consider a table with 10K values, 9,990 of which are 1 and the rest of
> which are 2, 3, ..., 10, versus a table that has the same 10 distinct
> values evenly distributed. For a simple equality query, in the first
> case, a bitmap scan might be best. In the second case, a sequential
> scan would always be best.

True. But selectivity estimates alone don't show the difference there.

Also, in my personal experience this isn't really the area we've got
problems now. The problem cases for me always were queries with a rather
large number of joins (7 to 10 or so) plus rather complex search
conditions. The join order, not the access strategy, then becomes the
deciding factor in plan performance. And the join order *is* driven purely
off the selectivity estimates, unlike the access strategy which even today
takes other factors into account (like clusteredness, I believe)

> This is precisely the point I was trying to make in my email: the loss
> function is very complicated. Improving the ndistinct estimator could
> significantly improve the estimates of ndistinct ( in the quadratic
> loss sense ) while only marginally improving the plans.


The point of this exercise isn't really to improve the ndistinct estimates
for single columns. Rather, we'd like to get ndistinct estimates for
groups of columns because that allows us to use the uniform bayesian
approach to multi-column selectivity estimation that Tomas found literature
on. Which on the whole seems like it *does* have a chance of greatly
improving the plans in some cases. We could, of course, estimate multi-column
ndistinct the same way we estimate single-column ndistincts, but quite a few
people raised concerns that this wouldn't work due to the large error our
current ndistinct estimates have. 

best regards,
Florian Pflug



Re: estimating # of distinct values

From
Robert Haas
Date:
On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> Regarding the crash scenario - if the commit fails, just throw away the
>>> local estimator copy, it's not needed. I'm not sure how to take care of
>>> the case when commit succeeds and the write of the merged estimator
>>> fails, but I think it might be possible to write the estimator to xlog
>>> or something like that. So it would be replayed during recovery etc. Or
>>> is it a stupid idea?
>>
>> It's not stupid, in the sense that that is what you'd need to do if
>> you want to avoid ever having to rescan the table.  But it is another
>> thing that I think is going to be way too expensive.
>
> Way too expensive? All you need to put into the logfile is a copy of the
> estimator, which is a few kBs. How is that 'way too expensive'?

At this point, this is all a matter of religion, right?  Neither of us
has a working implementation we've benchmarked.  But yes, I believe
you're going to find that implementing some kind of streaming
estimator is going to impose a...  <pulls number out of rear end> 6%
performance penalty, even after you've optimized the living daylights
out of it.  Now you might say... big deal, it improves my problem
queries by 100x.  OK, but if you could get the same benefit by doing
an occasional full table scan during off hours, you could have the
same performance with a *0%* performance penalty.  Even better, the
code changes would be confined to ANALYZE rather than spread out all
over the system, which has positive implications for robustness and
likelihood of commit.

I'm not trying to argue you out of working on this.  It's obviously
your time to spend, and if works better than I think it will, great!
I'm merely offering you an opinion on what will probably happen if you
go this route - namely, it'll carry an unpalatable run-time penalty.
That opinion may be worth no more than what you paid for it, but there
you have it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: estimating # of distinct values

From
Robert Haas
Date:
On Wed, Jan 19, 2011 at 9:35 PM, Florian Pflug <fgp@phlo.org> wrote:
> Also, in my personal experience this isn't really the area we've got
> problems now. The problem cases for me always were queries with a rather
> large number of joins (7 to 10 or so) plus rather complex search
> conditions. The join order, not the access strategy, then becomes the
> deciding factor in plan performance.

This certainly matches my experience (except sometimes with more joins).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: estimating # of distinct values

From
Nathan Boley
Date:
On Wed, Jan 19, 2011 at 6:35 PM, Florian Pflug <fgp@phlo.org> wrote:
> On Jan20, 2011, at 02:41 , Nathan Boley wrote:
>>>> If you think about it, it's a bit ridiculous to look at the whole table
>>>> *just* to "estimate" ndistinct - if we go that far why dont we just
>>>> store the full distribution and be done with it?
>>>
>>> - the best you could do is to average the
>>> individual probabilities which gives ... well, 1/ndistinct.
>>
>> That is certainly *not* the best you could do in every case. The mean
>> is only the best estimate in L2, which is definitely not the case
>> here.
>
> No, not in every case. But on average it comes out best, no?

In the sense of minimizing the average plan cost over all values?
Definitely not. Consider the trivial case where a bitmap scan is the
cost of a sequential scan + epsilon.

>
>> Consider a table with 10K values, 9,990 of which are 1 and the rest of
>> which are 2, 3, ..., 10, versus a table that has the same 10 distinct
>> values evenly distributed. For a simple equality query, in the first
>> case, a bitmap scan might be best. In the second case, a sequential
>> scan would always be best.
>
> True. But selectivity estimates alone don't show the difference there.

Yet the full distribution would - supporting my argument that even in
cases where we dont know a specific value, the full distribution is
more informative.

>
> Also, in my personal experience this isn't really the area we've got
> problems now. The problem cases for me always were queries with a rather
> large number of joins (7 to 10 or so) plus rather complex search
> conditions. The join order, not the access strategy, then becomes the
> deciding factor in plan performance. And the join order *is* driven purely
> off the selectivity estimates, unlike the access strategy which even today
> takes other factors into account (like clusteredness, I believe)
>

I think I'm losing you. Why would ndistinct be complete sufficient for
estimating the optimal join order?

>> This is precisely the point I was trying to make in my email: the loss
>> function is very complicated. Improving the ndistinct estimator could
>> significantly improve the estimates of ndistinct ( in the quadratic
>> loss sense ) while only marginally improving the plans.
>
>
> The point of this exercise isn't really to improve the ndistinct estimates
> for single columns. Rather, we'd like to get ndistinct estimates for
> groups of columns because that allows us to use the uniform bayesian
> approach to multi-column selectivity estimation that Tomas found literature
> on. Which on the whole seems like it *does* have a chance of greatly
> improving the plans in some cases. We could, of course, estimate multi-column
> ndistinct the same way we estimate single-column ndistincts, but quite a few
> people raised concerns that this wouldn't work due to the large error our
> current ndistinct estimates have.

Sure. But estimating multi-column ndistinct is surely not the only way
to approach this problem.

The comment I made, which you objected to, was that it's silly to look
at the whole table and then throw away all information *except*
ndistinct. If we really think that looking at the whole table is the
best approach, then shouldn't we be able to get better statistics? Is
this really an open question?

-Nathan


Re: estimating # of distinct values

From
Heikki Linnakangas
Date:
On 20.01.2011 04:36, Robert Haas wrote:
> ... Even better, the
> code changes would be confined to ANALYZE rather than spread out all
> over the system, which has positive implications for robustness and
> likelihood of commit.

Keep in mind that the administrator can already override the ndistinct 
estimate with ALTER TABLE. If he needs to manually run a special ANALYZE 
command to make it scan the whole table, he might as well just use ALTER 
TABLE to tell the system what the real (or good enough) value is. A DBA 
should have a pretty good feeling of what the distribution of his data 
is like.

And how good does the estimate need to be? For a single-column, it's 
usually not that critical, because if the column has only a few distinct 
values then we'll already estimate that pretty well, and OTOH if 
ndistinct is large, it doesn't usually affect the plans much if it's 10% 
of the number of rows or 90%.

It seems that the suggested multi-column selectivity estimator would be 
more sensitive to ndistinct of the individual columns. Is that correct? 
How is it biased? If we routinely under-estimate ndistinct of individual 
columns, for example, does the bias accumulate or cancel itself in the 
multi-column estimate?

I'd like to see some testing of the suggested selectivity estimator with 
the ndistinct estimates we have. Who knows, maybe it works fine in practice.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: estimating # of distinct values

From
Csaba Nagy
Date:
Hi Tomas,

On Wed, 2011-01-19 at 23:13 +0100, Tomas Vondra wrote:
> No, the multi-column statistics do not require constant updating. There
> are cases where a sampling is perfectly fine, although you may need a
> bit larger sample. Generally if you can use a multi-dimensional
> histogram, you don't need to scan the whole table.

In the cases where sampling is enough, you can do that to the updates
too: do a sampling on the changes, in that you only process every Nth
change to make it to the estimator. If you can also dynamically tune the
N to grow it as the statistics stabilize, and lower it if you detect
high variance, even better.

If the analyze process could be decoupled from the backends, and maybe
just get the data passed over to be processed asynchronously, then that
could be a feasible way to have always up to date statistics when the
bottleneck is IO and CPU power is in excess. If that then leads to
better plans, it could really be a win exceeding the overhead.

If this analyze process (or more of them) could also just get the data
from the modified buffers in a cyclic way, so that backends need nothing
extra to do, then I don't see any performance disadvantage other than
possible extra locking contention on the buffers and non-determinism of
the actual time when a change makes it to the statistics. Then you just
need to get more CPU power and higher memory bandwidth to pay for the
accurate statistics.

Cheers,
Csaba.




Re: estimating # of distinct values

From
Tomas Vondra
Date:
Dne 20.1.2011 03:06, Nathan Boley napsal(a):
>> And actually it does not depend on ndistinct for the columns only, it
>> depends on ndistinct estimates for the combination of columns. So
>> improving the ndistinct estimates for columns is just a necessary first
>> step (and only if it works reasonably well, we can do the next step).
> 
> I think that any approach which depends on precise estimates of
> ndistinct is not practical.

I'm not aware of any other approach to the 'discrete fail case' (where
the multi-dimensional histograms are not applicable). If someone finds a
better solution, I'll be the first one to throw away this stuff.

> I am very happy that you've spent so much time on this, and I'm sorry
> if my previous email came off as combative. My point was only that
> simple heuristics have served us well in the past and, before we go to
> the effort of new, complicated schemes, we should see how well similar
> heuristics work in the multiple column case. I am worried that if the
> initial plan is too complicated then nothing will happen and, even if
> something does happen, it will be tough to get it committed ( check
> the archives for cross column stat threads - there are a lot ).

If I've leaned one thing over the years in IT, it's not to take critique
personally. All the problems mentioned in this thread are valid
concerns, pointing out weak points of the approach. And I'm quite happy
to receive this feedback - that's why I started it.

On the other hand - Jara Cimrman (a famous Czech fictional character,
depicted as the best scientist/poet/teacher/traveller/... - see [1])
once said that you can't be really sure you don't get gold by blowing
cigarette smoke into a basin drain, until you actually try it. So I'm
blowing cigaretter smoke into the drain ...

It may wery vell happen this will be a dead end, but I'll do my best to
fix all the issues or to prove that the pros outweight the cons. And
even if it will be eventually rejected, I hope to get -1 from TL to be
eligible for that t-shirt ...

[1] http://en.wikipedia.org/wiki/Jara_Cimrman

regards
Tomas


Re: estimating # of distinct values

From
Tomas Vondra
Date:
Dne 20.1.2011 03:36, Robert Haas napsal(a):
> On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>>> Regarding the crash scenario - if the commit fails, just throw away the
>>>> local estimator copy, it's not needed. I'm not sure how to take care of
>>>> the case when commit succeeds and the write of the merged estimator
>>>> fails, but I think it might be possible to write the estimator to xlog
>>>> or something like that. So it would be replayed during recovery etc. Or
>>>> is it a stupid idea?
>>>
>>> It's not stupid, in the sense that that is what you'd need to do if
>>> you want to avoid ever having to rescan the table.  But it is another
>>> thing that I think is going to be way too expensive.
>>
>> Way too expensive? All you need to put into the logfile is a copy of the
>> estimator, which is a few kBs. How is that 'way too expensive'?
> 
> At this point, this is all a matter of religion, right?  Neither of us
> has a working implementation we've benchmarked.  But yes, I believe
> you're going to find that implementing some kind of streaming
> estimator is going to impose a...  <pulls number out of rear end> 6%
> performance penalty, even after you've optimized the living daylights
> out of it.  Now you might say... big deal, it improves my problem
> queries by 100x.  OK, but if you could get the same benefit by doing
> an occasional full table scan during off hours, you could have the
> same performance with a *0%* performance penalty.  Even better, the
> code changes would be confined to ANALYZE rather than spread out all
> over the system, which has positive implications for robustness and
> likelihood of commit.

Good point. What I was trying to do was to continuously update the
estimator with new data - that was the whole idea behind the collecting
of new values (which might lead to problems with memory etc. as you've
pointed out) and updating a local copy of the estimator (which is a good
idea I think).

But this might be another option - let the user decide if he wants to
continuously update the estimates (and pay the price) or do that off the
hours (and pay almost nothing). That sounds as a very good solution to me.

> I'm not trying to argue you out of working on this.  It's obviously
> your time to spend, and if works better than I think it will, great!
> I'm merely offering you an opinion on what will probably happen if you
> go this route - namely, it'll carry an unpalatable run-time penalty.
> That opinion may be worth no more than what you paid for it, but there
> you have it.

Yes, and I appreciate all feedback. But I still believe this can be done
so that users that don't need the feature don't pay for it.

regards
Tomas


Re: estimating # of distinct values

From
Tomas Vondra
Date:
Dne 20.1.2011 09:10, Heikki Linnakangas napsal(a):
> It seems that the suggested multi-column selectivity estimator would be
> more sensitive to ndistinct of the individual columns. Is that correct?
> How is it biased? If we routinely under-estimate ndistinct of individual
> columns, for example, does the bias accumulate or cancel itself in the
> multi-column estimate?
> 
> I'd like to see some testing of the suggested selectivity estimator with
> the ndistinct estimates we have. Who knows, maybe it works fine in
> practice.

The estimator for two columns and query 'A=a AND B=b' is about
0.5 * (dist(A)/dist(A,B) * Prob(A=a) + dist(B)/dist(A,B) * Prob(B=b))

so it's quite simple. It's not that sensitive to errors or ndistinct
estimates for individual columns, but the problem is in the multi-column
ndistinct estimates. It's very likely that with dependent colunms (e.g.
with the ZIP codes / cities) the distribution is so pathological that
the sampling-based estimate will be very off.

I guess this was a way too short analysis, but if you can provide more
details of the expected tests etc. I'll be happy to provide that.

regards
Tomas


Re: estimating # of distinct values

From
Tomas Vondra
Date:
Dne 20.1.2011 11:05, Csaba Nagy napsal(a):
> Hi Tomas,
> 
> On Wed, 2011-01-19 at 23:13 +0100, Tomas Vondra wrote:
>> No, the multi-column statistics do not require constant updating. There
>> are cases where a sampling is perfectly fine, although you may need a
>> bit larger sample. Generally if you can use a multi-dimensional
>> histogram, you don't need to scan the whole table.
> 
> In the cases where sampling is enough, you can do that to the updates
> too: do a sampling on the changes, in that you only process every Nth
> change to make it to the estimator. If you can also dynamically tune the
> N to grow it as the statistics stabilize, and lower it if you detect
> high variance, even better.
> 
> If the analyze process could be decoupled from the backends, and maybe
> just get the data passed over to be processed asynchronously, then that
> could be a feasible way to have always up to date statistics when the
> bottleneck is IO and CPU power is in excess. If that then leads to
> better plans, it could really be a win exceeding the overhead.

OK, this sounds interesting. I'm not sure how to do that but it might be
a good solution. What about transactions? If the client inserts data
(and it will be sent asynchronously to update the estimator) and then
rolls back, is the estimator 'rolled back' or what happens?

This was exactly the reason why I initially wanted to collect all the
data at the backend (and send them to the estimator at commit time).
Which was then replaced by the idea to keep a local estimator copy and
merge it back to the original estimator at commit time.

> If this analyze process (or more of them) could also just get the data
> from the modified buffers in a cyclic way, so that backends need nothing
> extra to do, then I don't see any performance disadvantage other than
> possible extra locking contention on the buffers and non-determinism of
> the actual time when a change makes it to the statistics. Then you just
> need to get more CPU power and higher memory bandwidth to pay for the
> accurate statistics.

Well, the possible locking contention sounds like a quite significant
problem to me :-(

The lag between an update and a change to the stats is not that big deal
I guess - we have the same behaviour with the rest of the stats (updated
by autovacuum every once a while).

Tomas