Re: Thoughts on statistics for continuously advancing columns - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Thoughts on statistics for continuously advancing columns
Date
Msg-id 6650.1262202920@sss.pgh.pa.us
Whole thread Raw
In response to Re: Thoughts on statistics for continuously advancing columns  (Peter Eisentraut <peter_e@gmx.net>)
Responses Re: Thoughts on statistics for continuously advancing columns  (Dimitri Fontaine <dfontaine@hi-media.com>)
Re: Thoughts on statistics for continuously advancing columns  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Thoughts on statistics for continuously advancing columns  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Peter Eisentraut <peter_e@gmx.net> writes:
> On tis, 2009-12-29 at 22:08 -0500, Tom Lane wrote:
>> This seems like a fundamentally broken approach, first because "time
>> between analyzes" is not even approximately a constant, and second
>> because it assumes that we have a distance metric for all datatypes.

> Maybe you could compute a correlation between the column values and the
> transaction numbers to recognize a continuously advancing column.  It
> wouldn't tell you much about how fast they are advancing, but at least
> the typical use cases of serial and current timestamp columns should
> clearly stick out.  And then instead of assuming that a value beyond the
> histogram bound doesn't exist, you assume for example the average
> frequency, which should be pretty good for the serial and timestamp
> cases.  (Next step: Fourier analysis ;-) )

Actually, the histogram hasn't got much of anything to do with estimates
of the number of occurrences of a single value.

Josh hasn't shown us his specific problem query, but I would bet that
it's roughly like WHERE update_time > now() - interval 'something',
that is, the estimate that's problematic is an inequality not an
equality.  When the range being asked for is outside the histogram
bounds, it really is rather difficult to come up with a reasonable
estimate --- you'd need a specific idea of how far outside the upper
bound it is, how fast the upper bound has been advancing, and how long
it's been since the last analyze.  (I find the last bit particularly
nasty, because it will mean that plans change even when "nothing is
changing" in the database.)

[ thinks for awhile ... ]

Actually, in the problematic cases, it's interesting to consider the
following strategy: when scalarineqsel notices that it's being asked for
a range estimate that's outside the current histogram bounds, first try
to obtain the actual current max() or min() of the column value --- this
is something we can get fairly cheaply if there's a btree index on the
column.  If we can get it, plug it into the histogram, replacing the
high or low bin boundary.  Then estimate as we currently do.  This would
work reasonably well as long as re-analyzes happen at a time scale such
that the histogram doesn't move much overall, ie, the number of
insertions between analyzes isn't a lot compared to the number of rows
per bin.  We'd have some linear-in-the-bin-size estimation error because
the modified last or first bin actually contains more rows than other
bins, but it would certainly work a lot better than it does now.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: KNNGiST for knn-search (WIP)
Next
From: Robert Haas
Date:
Subject: Re: pg_read_file() and non-ascii input file