Re: Thoughts on statistics for continuously advancing columns - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Thoughts on statistics for continuously advancing columns
Date
Msg-id 20469.1262573769@sss.pgh.pa.us
Whole thread Raw
In response to Re: Thoughts on statistics for continuously advancing columns  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Thoughts on statistics for continuously advancing columns  (Josh Berkus <josh@agliodbs.com>)
List pgsql-hackers
I wrote:
> Actually, in the problematic cases, it's interesting to consider the
> following strategy: when scalarineqsel notices that it's being asked for
> a range estimate that's outside the current histogram bounds, first try
> to obtain the actual current max() or min() of the column value --- this
> is something we can get fairly cheaply if there's a btree index on the
> column.  If we can get it, plug it into the histogram, replacing the
> high or low bin boundary.  Then estimate as we currently do.  This would
> work reasonably well as long as re-analyzes happen at a time scale such
> that the histogram doesn't move much overall, ie, the number of
> insertions between analyzes isn't a lot compared to the number of rows
> per bin.  We'd have some linear-in-the-bin-size estimation error because
> the modified last or first bin actually contains more rows than other
> bins, but it would certainly work a lot better than it does now.

I've applied a patch to HEAD that does the above.  Can you test it to
see how well it fixes your problem?

Looking at the current uses of the histogram stats, there is another
place that could possibly benefit from this type of on-demand index
search: mergejoinscansel.  That code attempts to estimate how much of a
column actually needs to be scanned by a merge join, recognizing that a
merge will stop reading either input once the other is exhausted.
Having an accurate idea of the join keys' upper bounds is fairly
important here, since the supposed cost reduction from not scanning all
of a table disappears if there's even one large-valued key on the other
side of the join.  On the other hand, making use of index searches in
mergejoinscansel would put these index searches into the standard, very
heavily used join planning path, not into a corner case which is all
they are right now.  So I'm fairly nervous about the potential hit on
join planning time.  Is there anyone around who can do planner timing
measurements on complex join queries involving large tables?  If so,
please try CVS HEAD as-is and after enabling this bit in selfuncs.c:
   /*    * XXX It's very tempting to try to use the actual column min and max,    * if we can get them
relatively-cheaplywith an index probe.  However,    * since this function is called many times during join planning,
*that could have unpleasant effects on planning speed.  Need more    * investigation before enabling this.    */
 
#ifdef NOT_USED   if (get_actual_variable_range(root, vardata, sortop, min, max))       return true;
#endif
        regards, tom lane


pgsql-hackers by date:

Previous
From: Takahiro Itagaki
Date:
Subject: Re: New VACUUM FULL
Next
From: Takahiro Itagaki
Date:
Subject: Re: Buffer statistics for pg_stat_statements