Thread: Thoughts on statistics for continuously advancing columns

Thoughts on statistics for continuously advancing columns

From
Josh Berkus
Date:
All,

One of our clients is having query plan issues with a table with a
continuously advancing timestamp column (i.e. one with default now()).
The newest rows, which are the most in demand, are always estimated to
be fewer than they are or even non-existant.  As a result, the user has
to analyze the table every hour ... and it's a very large table.

I've seen this in a lot of other databases, both with timestamp columns
and with SERIALs -- both of which are very common table structures.
From my reading of the planner code, the problem seems to be the
histgram bounds ... if a requested value is above the high bound, it's
assumed to be extremely uncommon or not exist.   This leads to bad plans
if analyze hasn't been run very recently.

My thoughts on dealing with this intelligently without a major change to
statstics gathering went along these lines:

1. add columns to pg_statistic to hold estimates of upper and lower
bounds growth between analyzes.

2. every time analyze is run, populate these columns with 1/2 of the
proprotion of values above or below the previously stored bounds,
averaged with the existing value for the new columns.

3. use this factor instead of the existing algorithm to calculate the
row estimate for out-of-bounds values.

This is obviously a very rough idea, but I wanted to get feedback on the
general problem and my approach before going further with it.

Thanks!

--Josh Berkus


Re: Thoughts on statistics for continuously advancing columns

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> My thoughts on dealing with this intelligently without a major change to
> statstics gathering went along these lines:

> 1. add columns to pg_statistic to hold estimates of upper and lower
> bounds growth between analyzes.

This seems like a fundamentally broken approach, first because "time
between analyzes" is not even approximately a constant, and second
because it assumes that we have a distance metric for all datatypes.
(Note that convert_to_scalar does not assume that it can measure
arbitrary distances, but only fractions *within* a histogram bucket;
and even that is pretty shaky.)

I don't have a better idea at the moment :-(
        regards, tom lane


Re: Thoughts on statistics for continuously advancing columns

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> My thoughts on dealing with this intelligently without a major
>> change to statstics gathering went along these lines:
> 
>> 1. add columns to pg_statistic to hold estimates of upper and
>> lower bounds growth between analyzes.
> 
> This seems like a fundamentally broken approach
> I don't have a better idea at the moment :-(
It's been a while since I've been bitten by this issue -- the last
time was under Sybase.  The Sybase suggestion was to either add
"dummy rows" [YUCK!] to set the extreme bounds or to "lie to the
optimizer" by fudging the statistics after each generation.  Perhaps
we could do better by adding columns for high and low bounds to
pg_statistic.  These would not be set by ANALYZE, but
user-modifiable to cover exactly this problem?  NULL would mean
current behavior?
-Kevin


Re: Thoughts on statistics for continuously advancing columns

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't have a better idea at the moment :-(
> It's been a while since I've been bitten by this issue -- the last
> time was under Sybase.  The Sybase suggestion was to either add
> "dummy rows" [YUCK!] to set the extreme bounds or to "lie to the
> optimizer" by fudging the statistics after each generation.  Perhaps
> we could do better by adding columns for high and low bounds to
> pg_statistic.  These would not be set by ANALYZE, but
> user-modifiable to cover exactly this problem?  NULL would mean
> current behavior?

Well, the problem Josh has got is exactly that a constant high bound
doesn't work.

What I'm wondering about is why he finds that re-running ANALYZE
isn't an acceptable solution.  It's supposed to be a reasonably
cheap thing to do.

I think the cleanest solution to this would be to make ANALYZE
cheaper, perhaps by finding some way for it to work incrementally.
        regards, tom lane


Re: Thoughts on statistics for continuously advancing columns

From
"Joshua D. Drake"
Date:
On Wed, 30 Dec 2009 11:16:45 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I don't have a better idea at the moment :-(
>  
>> It's been a while since I've been bitten by this issue -- the last
>> time was under Sybase.  The Sybase suggestion was to either add
>> "dummy rows" [YUCK!] to set the extreme bounds or to "lie to the
>> optimizer" by fudging the statistics after each generation.  Perhaps
>> we could do better by adding columns for high and low bounds to
>> pg_statistic.  These would not be set by ANALYZE, but
>> user-modifiable to cover exactly this problem?  NULL would mean
>> current behavior?
> 
> Well, the problem Josh has got is exactly that a constant high bound
> doesn't work.
> 
> What I'm wondering about is why he finds that re-running ANALYZE
> isn't an acceptable solution.  It's supposed to be a reasonably
> cheap thing to do.

What makes ANALYZE cheap is that two things:

1. It uses read only bandwidth (for the most part), which is the most
bandwidth we have
2. It doesn't take a lock that bothers anything

On the other hand ANALYZE also:

1. Uses lots of memory
2. Lots of processor
3. Can take a long time

We normally don't notice because most sets won't incur a penalty. We got a
customer who
has a single table that is over 1TB in size... We notice. Granted that is
the extreme
but it would only take a quarter of that size (which is common) to start
seeing issues.

> 
> I think the cleanest solution to this would be to make ANALYZE
> cheaper, perhaps by finding some way for it to work incrementally.
> 

That could be interesting. What about a running statistics set that has
some kind of 
threshold? What I mean is, we run our normal analyze but we can mark a
table "HOT" 
(yeah bad term). If we mark the table HOT statistics are generated on the
fly for
the planner and updated every X interval. Perhaps then written out at a
checkpoint?

This is just off the top of my head.

JD

>             regards, tom lane

-- 
PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org  Consulting, Development, Support, Training  503-667-4564 -
http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
 


Re: Thoughts on statistics for continuously advancing columns

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, the problem Josh has got is exactly that a constant high
> bound doesn't work.
I thought the problem was that the high bound in the statistics fell
too far below the actual high end in the data.  This tends (in my
experience) to be much more painful than an artificially extended
high end in the statistics.  (YMMV, of course.)
> What I'm wondering about is why he finds that re-running ANALYZE
> isn't an acceptable solution.  It's supposed to be a reasonably
> cheap thing to do.
Good point.  We haven't hit this problem in PostgreSQL precisely
because we can run ANALYZE often enough to prevent the skew from
becoming pathological.
> I think the cleanest solution to this would be to make ANALYZE
> cheaper, perhaps by finding some way for it to work incrementally.
Yeah, though as you say above, it'd be good to know why frequent
ANALYZE is a problem as it stands.
-Kevin


Re: Thoughts on statistics for continuously advancing columns

From
Greg Smith
Date:
Joshua D. Drake wrote:
> We normally don't notice because most sets won't incur a penalty. We got a customer who
> has a single table that is over 1TB in size... We notice. Granted that is the extreme
> but it would only take a quarter of that size (which is common) to start seeing issues.
>   

Right, and the only thing that makes this case less painful is that you 
don't really need the stats to be updated quite as often in situations 
with that much data.  If, say, your stats say there's 2B rows in the 
table but there's actually 2.5B, that's a big error, but unlikely to 
change the types of plans you get.  Once there's millions of distinct 
values it's takes a big change for plans to shift, etc.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com



Re: Thoughts on statistics for continuously advancing columns

From
Tom Lane
Date:
Greg Smith <greg@2ndquadrant.com> writes:
> Right, and the only thing that makes this case less painful is that you 
> don't really need the stats to be updated quite as often in situations 
> with that much data.  If, say, your stats say there's 2B rows in the 
> table but there's actually 2.5B, that's a big error, but unlikely to 
> change the types of plans you get.  Once there's millions of distinct 
> values it's takes a big change for plans to shift, etc.

Normally, yeah.  I think Josh's problem is that he's got
performance-critical queries that are touching the "moving edge" of the
data set, and so the part of the stats that are relevant to them is
changing fast, even though in an overall sense the table contents might
not be changing much.
        regards, tom lane


Re: Thoughts on statistics for continuously advancing columns

From
"Kevin Grittner"
Date:
Greg Smith <greg@2ndquadrant.com> wrote:
> If, say, your stats say there's 2B rows in the table but there's
> actually 2.5B, that's a big error, but unlikely to change the
> types of plans you get.  Once there's millions of distinct values
> it's takes a big change for plans to shift, etc.
Well, the exception to that is if the stats say that your highest
value is x, and there are actually 500 million rows with values
greater than x, you can get some very bad plans for queries
requiring a range of values above x.
-Kevin


Re: Thoughts on statistics for continuously advancing columns

From
Alvaro Herrera
Date:
Tom Lane escribió:
> Greg Smith <greg@2ndquadrant.com> writes:
> > Right, and the only thing that makes this case less painful is that you 
> > don't really need the stats to be updated quite as often in situations 
> > with that much data.  If, say, your stats say there's 2B rows in the 
> > table but there's actually 2.5B, that's a big error, but unlikely to 
> > change the types of plans you get.  Once there's millions of distinct 
> > values it's takes a big change for plans to shift, etc.
> 
> Normally, yeah.  I think Josh's problem is that he's got
> performance-critical queries that are touching the "moving edge" of the
> data set, and so the part of the stats that are relevant to them is
> changing fast, even though in an overall sense the table contents might
> not be changing much.

Maybe only tangentially related: if this was a setup partitioned by a
timestamp, it would be very useful to be able to analyze only the
current partition and have updated stats for the parent relation as
well.  However AFAICT with your proposed changes in this area this would
not work, right?  You'd need an analyze on the parent relation, which is
painful.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Thoughts on statistics for continuously advancing columns

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane escribi�:
>> Normally, yeah.  I think Josh's problem is that he's got
>> performance-critical queries that are touching the "moving edge" of the
>> data set, and so the part of the stats that are relevant to them is
>> changing fast, even though in an overall sense the table contents might
>> not be changing much.

> Maybe only tangentially related: if this was a setup partitioned by a
> timestamp, it would be very useful to be able to analyze only the
> current partition and have updated stats for the parent relation as
> well.  However AFAICT with your proposed changes in this area this would
> not work, right?  You'd need an analyze on the parent relation, which is
> painful.

Yeah, I was just thinking about that myself.  The parent-level ANALYZE
would approximately double the work involved, assuming that your total
data set is large enough to max out the number of blocks sampled.
So it'd be painful but not catastrophic.  Maybe the way to think about
the "incremental update" problem is to find a way to let ANALYZE
calculate parent-relation stats from the stats of the individual
partitions.  Not that I know how to do that either, but at least it's
a clearly stated task.
        regards, tom lane


Re: Thoughts on statistics for continuously advancing columns

From
Greg Stark
Date:
On Wed, Dec 30, 2009 at 4:31 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> On the other hand ANALYZE also:
>
> 1. Uses lots of memory
> 2. Lots of processor
> 3. Can take a long time
>
> We normally don't notice because most sets won't incur a penalty. We got a
> customer who
> has a single table that is over 1TB in size... We notice. Granted that is
> the extreme
> but it would only take a quarter of that size (which is common) to start
> seeing issues.

I'm a bit puzzled by people's repeated suggestion here that large
tables take a long time to analyze. The sample analyze takes to
generate statistics is not heavily influenced by the size of the
table. Your 1TB table should take basically the same amount of time as
a 1GB table or a 1MB table (if it wasn't already in cache).

Unless the reason why it's 1TB is that the columns are extremely wide
rather than that it has a lot of rows? Or unless you've raised the
statistics target in (a misguided*) belief that larger tables require
larger statistics targets to achieve the same level of accuracy. Or
unless when you say "ANALYZE" you're really running "VACUUM ANALYZE".

[*] except for ndistinct estimates :(

-- 
greg


Re: Thoughts on statistics for continuously advancing columns

From
"Joshua D. Drake"
Date:
On Wed, 30 Dec 2009 18:42:38 +0000, Greg Stark <gsstark@mit.edu> wrote:

> I'm a bit puzzled by people's repeated suggestion here that large
> tables take a long time to analyze. The sample analyze takes to
> generate statistics is not heavily influenced by the size of the
> table. Your 1TB table should take basically the same amount of time as
> a 1GB table or a 1MB table (if it wasn't already in cache).

No. 

postgres=# analyze verbose test_one_million;
INFO:  analyzing "public.test_one_million"
INFO:  "test_one_million": scanned 3000 of 4425 pages, containing 677950
live rows and 0 dead rows; 3000 rows in sample, 999976 estimated total rows
ANALYZE
Time: 168.009 ms
postgres=# analyze verbose test_one_million;
INFO:  analyzing "public.test_one_million"
INFO:  "test_one_million": scanned 3000 of 4425 pages, containing 677950
live rows and 0 dead rows; 3000 rows in sample, 999976 estimated total rows
ANALYZE
Time: 104.006 ms
postgres=# analyze verbose test_ten_million;
INFO:  analyzing "public.test_ten_million"
INFO:  "test_ten_million": scanned 3000 of 44248 pages, containing 678000
live rows and 0 dead rows; 3000 rows in sample, 10000048 estimated total
rows
ANALYZE
Time: 20145.148 ms
postgres=# analyze verbose test_ten_million;
INFO:  analyzing "public.test_ten_million"
INFO:  "test_ten_million": scanned 3000 of 44248 pages, containing 678000
live rows and 0 dead rows; 3000 rows in sample, 10000048 estimated total
rows
ANALYZE
Time: 18481.053 ms
postgres=# analyze verbose test_ten_million;
INFO:  analyzing "public.test_ten_million"
INFO:  "test_ten_million": scanned 3000 of 44248 pages, containing 678000
live rows and 0 dead rows; 3000 rows in sample, 10000048 estimated total
rows
ANALYZE
Time: 17653.006 ms

The test_one_million when in cache and out is very quick. I don't think
the ten million is actually able to get into cache (small box) but either
way
if you look at the on disk number for the one million 168ms versus the on
disk number for the ten million, they are vastly different.

postgres=# select
pg_size_pretty(pg_total_relation_size('test_one_million'));pg_size_pretty
----------------35 MB
(1 row)

Time: 108.006 ms
postgres=# select
pg_size_pretty(pg_total_relation_size('test_ten_million'));pg_size_pretty
----------------346 MB
(1 row)


> 
> Unless the reason why it's 1TB is that the columns are extremely wide
> rather than that it has a lot of rows?

I should have qualified, yes they are very wide.

JD

-- 
PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org  Consulting, Development, Support, Training  503-667-4564 -
http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
 


Re: Thoughts on statistics for continuously advancing columns

From
Peter Eisentraut
Date:
On tis, 2009-12-29 at 22:08 -0500, Tom Lane wrote:
> This seems like a fundamentally broken approach, first because "time
> between analyzes" is not even approximately a constant, and second
> because it assumes that we have a distance metric for all datatypes.

Maybe you could compute a correlation between the column values and the
transaction numbers to recognize a continuously advancing column.  It
wouldn't tell you much about how fast they are advancing, but at least
the typical use cases of serial and current timestamp columns should
clearly stick out.  And then instead of assuming that a value beyond the
histogram bound doesn't exist, you assume for example the average
frequency, which should be pretty good for the serial and timestamp
cases.  (Next step: Fourier analysis ;-) )



Re: Thoughts on statistics for continuously advancing columns

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> On tis, 2009-12-29 at 22:08 -0500, Tom Lane wrote:
>> This seems like a fundamentally broken approach, first because "time
>> between analyzes" is not even approximately a constant, and second
>> because it assumes that we have a distance metric for all datatypes.

> Maybe you could compute a correlation between the column values and the
> transaction numbers to recognize a continuously advancing column.  It
> wouldn't tell you much about how fast they are advancing, but at least
> the typical use cases of serial and current timestamp columns should
> clearly stick out.  And then instead of assuming that a value beyond the
> histogram bound doesn't exist, you assume for example the average
> frequency, which should be pretty good for the serial and timestamp
> cases.  (Next step: Fourier analysis ;-) )

Actually, the histogram hasn't got much of anything to do with estimates
of the number of occurrences of a single value.

Josh hasn't shown us his specific problem query, but I would bet that
it's roughly like WHERE update_time > now() - interval 'something',
that is, the estimate that's problematic is an inequality not an
equality.  When the range being asked for is outside the histogram
bounds, it really is rather difficult to come up with a reasonable
estimate --- you'd need a specific idea of how far outside the upper
bound it is, how fast the upper bound has been advancing, and how long
it's been since the last analyze.  (I find the last bit particularly
nasty, because it will mean that plans change even when "nothing is
changing" in the database.)

[ thinks for awhile ... ]

Actually, in the problematic cases, it's interesting to consider the
following strategy: when scalarineqsel notices that it's being asked for
a range estimate that's outside the current histogram bounds, first try
to obtain the actual current max() or min() of the column value --- this
is something we can get fairly cheaply if there's a btree index on the
column.  If we can get it, plug it into the histogram, replacing the
high or low bin boundary.  Then estimate as we currently do.  This would
work reasonably well as long as re-analyzes happen at a time scale such
that the histogram doesn't move much overall, ie, the number of
insertions between analyzes isn't a lot compared to the number of rows
per bin.  We'd have some linear-in-the-bin-size estimation error because
the modified last or first bin actually contains more rows than other
bins, but it would certainly work a lot better than it does now.
        regards, tom lane


Re: Thoughts on statistics for continuously advancing columns

From
Chris Browne
Date:
jd@commandprompt.com ("Joshua D. Drake") writes:
> On the other hand ANALYZE also:
>
> 1. Uses lots of memory
> 2. Lots of processor
> 3. Can take a long time
>
> We normally don't notice because most sets won't incur a penalty. We got a
> customer who
> has a single table that is over 1TB in size... We notice. Granted that is
> the extreme
> but it would only take a quarter of that size (which is common) to start
> seeing issues.

I find it curious that ANALYZE *would* take a long time to run.

After all, its sampling strategy means that, barring having SET
STATISTICS to some ghastly high number, it shouldn't need to do
materially more work to analyze a 1TB table than is required to analyze
a 1GB table.

With the out-of-the-box (which may have changed without my notice ;-))
default of 10 bars in the histogram, it should search for 30K rows,
which, while not "free," doesn't get enormously more expensive as tables
grow.
-- 
"cbbrowne","@","gmail.com"
http://linuxfinances.info/info/linuxdistributions.html
Rules  of  the  Evil  Overlord   #179.  "I  will  not  outsource  core
functions." <http://www.eviloverlord.com/>


Re: Thoughts on statistics for continuously advancing columns

From
Greg Stark
Date:
well that's interesting because they claim to be doing exactly the same amount of I/O in terms of pages.

In the first case it's reading 3/4 of the table so it's effectively doing a sequential scan. In the second case it's
onlyscanning 7.5% so you would expect it to be slower but not that much slower. 

If as you say the rows are very wide then the other part of the equation will be TOAST table I/O though. I'm not sure
whatit would look like but I bet analyze isn't optimized to handle well -- not much of postgres really knows about
TOAST.It'll be accessing the same number of TOAST records but out of a much bigger TOAST table.  
--
greg

Re: Thoughts on statistics for continuously advancing columns

From
Tom Lane
Date:
Chris Browne <cbbrowne@acm.org> writes:
> I find it curious that ANALYZE *would* take a long time to run.

> After all, its sampling strategy means that, barring having SET
> STATISTICS to some ghastly high number, it shouldn't need to do
> materially more work to analyze a 1TB table than is required to analyze
> a 1GB table.

Right.  The example JD quotes in this thread compares a 35MB table
to a 350MB one, and the difference is all about having crossed the
threshold of what would fit in his available RAM.  There isn't going
to be much difference in the ANALYZE time for "big" versus "very big"
tables.  (There might, however, be a difference in the quality of
the resulting stats :-()
        regards, tom lane


Re: Thoughts on statistics for continuously advancing columns

From
Greg Smith
Date:
Joshua D. Drake wrote:
> postgres=# analyze verbose test_ten_million;
> INFO:  analyzing "public.test_ten_million"
> INFO:  "test_ten_million": scanned 3000 of 44248 pages, containing 678000
> live rows and 0 dead rows; 3000 rows in sample, 10000048 estimated total
> rows
> ANALYZE
> Time: 20145.148 ms
>   

At an ever larger table sizes, this would turn into 3000 random seeks 
all over the drive, one at a time because there's no async I/O here to 
queue requests better than that for this access pattern.  Let's say they 
take 10ms each, not an unrealistic amount of time on current hardware.  
That's 30 seconds, best case, which is similar to what JD's example is 
showing even on a pretty small data set.  Under load it could easily 
take over a minute, hammering the disks the whole time, and in a TOAST 
situation you're doing even more work.  It's not outrageous and it 
doesn't scale linearly with table size, but it's not something you want 
to happen any more than you have to either--consider the poor client who 
is trying to get their work done while that is going on.

On smaller tables, you're both more likely to grab a useful next page 
via readahead, and to just have the data you need cached in RAM 
already.  There's a couple of "shelves" in the response time to finish 
ANALYZE as you exceed L1/L2 CPU cache size and RAM size, then it trails 
downward as the seeks get longer and longer once the data you need is 
spread further across the disk(s).  That the logical beginning of a 
drive is much faster than the logical end doesn't help either.  I should 
generate that graph again one day somewhere I can release it at...

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com



Re: Thoughts on statistics for continuously advancing columns

From
Craig Ringer
Date:
On 31/12/2009 12:33 AM, Kevin Grittner wrote:
> Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>
>> Well, the problem Josh has got is exactly that a constant high
>> bound doesn't work.
>
> I thought the problem was that the high bound in the statistics fell
> too far below the actual high end in the data.  This tends (in my
> experience) to be much more painful than an artificially extended
> high end in the statistics.  (YMMV, of course.)
>
>> What I'm wondering about is why he finds that re-running ANALYZE
>> isn't an acceptable solution.  It's supposed to be a reasonably
>> cheap thing to do.
>
> Good point.  We haven't hit this problem in PostgreSQL precisely
> because we can run ANALYZE often enough to prevent the skew from
> becoming pathological.

While regular ANALYZE seems to be pretty good ... is it insane to 
suggest determining the min/max bounds of problem columns by looking at 
a btree index on the column in ANALYZE, instead of relying on random 
data sampling? An ANALYZE that didn't even have to scan the indexes but 
just look at the ends might be something that could be run much more 
frequently with less I/O and memory cost than a normal ANALYZE, just to 
selectively update key stats that are an issue for such continuously 
advancing columns.

--
Craig Ringer


Re: Thoughts on statistics for continuously advancing columns

From
Craig Ringer
Date:
> While regular ANALYZE seems to be pretty good ... is it insane to
> suggest determining the min/max bounds of problem columns by looking at
> a btree index on the column in ANALYZE, instead of relying on random
> data sampling? An ANALYZE that didn't even have to scan the indexes but
> just look at the ends might be something that could be run much more
> frequently with less I/O and memory cost than a normal ANALYZE, just to
> selectively update key stats that are an issue for such continuously
> advancing columns.

... which Tom Lane already raised in a smarter way in a message I hadn't 
read yet. Sorry for the noise.

--
Craig Ringer


Re: Thoughts on statistics for continuously advancing columns

From
Dimitri Fontaine
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> Actually, in the problematic cases, it's interesting to consider the
> following strategy: when scalarineqsel notices that it's being asked for
> a range estimate that's outside the current histogram bounds, first try
> to obtain the actual current max() or min() of the column value --- this
> is something we can get fairly cheaply if there's a btree index on the
> column.  If we can get it, plug it into the histogram, replacing the
> high or low bin boundary.  Then estimate as we currently do.  This would
> work reasonably well as long as re-analyzes happen at a time scale such
> that the histogram doesn't move much overall, ie, the number of
> insertions between analyzes isn't a lot compared to the number of rows
> per bin.  We'd have some linear-in-the-bin-size estimation error because
> the modified last or first bin actually contains more rows than other
> bins, but it would certainly work a lot better than it does now.

I know very little about statistics in general, but your proposal seems
straigth enough for me to understand it, and looks good: +1.

Regards,
-- 
dim


Re: Thoughts on statistics for continuously advancing columns

From
Simon Riggs
Date:
On Wed, 2009-12-30 at 14:55 -0500, Tom Lane wrote:

> Actually, in the problematic cases, it's interesting to consider the
> following strategy: when scalarineqsel notices that it's being asked for
> a range estimate that's outside the current histogram bounds, first try
> to obtain the actual current max() or min() of the column value --- this
> is something we can get fairly cheaply if there's a btree index on the
> column.  If we can get it, plug it into the histogram, replacing the
> high or low bin boundary.  Then estimate as we currently do.  This would
> work reasonably well as long as re-analyzes happen at a time scale such
> that the histogram doesn't move much overall, ie, the number of
> insertions between analyzes isn't a lot compared to the number of rows
> per bin.  We'd have some linear-in-the-bin-size estimation error because
> the modified last or first bin actually contains more rows than other
> bins, but it would certainly work a lot better than it does now.

Histograms often move quickly, but they seldom change shape.

Why not get both max() and min(), then rebase the histogram according to
those values. That way the histogram can still move significantly and
the technique will still work.

-- Simon Riggs           www.2ndQuadrant.com



Re: Thoughts on statistics for continuously advancing columns

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> Why not get both max() and min(), then rebase the histogram according to
> those values. That way the histogram can still move significantly and
> the technique will still work.

Define "rebase", keeping in mind that this has to work on datatypes that
we don't have a distance metric for.
        regards, tom lane


Re: Thoughts on statistics for continuously advancing columns

From
Simon Riggs
Date:
On Thu, 2009-12-31 at 15:18 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > Why not get both max() and min(), then rebase the histogram according to
> > those values. That way the histogram can still move significantly and
> > the technique will still work.
> 
> Define "rebase", keeping in mind that this has to work on datatypes that
> we don't have a distance metric for.

Make it work differently according to whether we have, or not, just as
we do elsewhere with stats. No point in limiting ourselves to the lowest
common denominator, especially when the common case is integer keys and
time datatypes.

-- Simon Riggs           www.2ndQuadrant.com



Re: Thoughts on statistics for continuously advancing columns

From
Simon Riggs
Date:
On Thu, 2009-12-31 at 21:29 +0000, Simon Riggs wrote:
> On Thu, 2009-12-31 at 15:18 -0500, Tom Lane wrote:
> > Simon Riggs <simon@2ndQuadrant.com> writes:
> > > Why not get both max() and min(), then rebase the histogram according to
> > > those values. That way the histogram can still move significantly and
> > > the technique will still work.
> > 
> > Define "rebase", keeping in mind that this has to work on datatypes that
> > we don't have a distance metric for.
> 
> Make it work differently according to whether we have, or not, just as
> we do elsewhere with stats. No point in limiting ourselves to the lowest
> common denominator, especially when the common case is integer keys and
> time datatypes.

This seemed obvious but I understand now that you meant we don't know
that from the datatype definition, so we can't do as I suggested, yet.

-- Simon Riggs           www.2ndQuadrant.com



Re: Thoughts on statistics for continuously advancing columns

From
Tom Lane
Date:
I wrote:
> Actually, in the problematic cases, it's interesting to consider the
> following strategy: when scalarineqsel notices that it's being asked for
> a range estimate that's outside the current histogram bounds, first try
> to obtain the actual current max() or min() of the column value --- this
> is something we can get fairly cheaply if there's a btree index on the
> column.  If we can get it, plug it into the histogram, replacing the
> high or low bin boundary.  Then estimate as we currently do.  This would
> work reasonably well as long as re-analyzes happen at a time scale such
> that the histogram doesn't move much overall, ie, the number of
> insertions between analyzes isn't a lot compared to the number of rows
> per bin.  We'd have some linear-in-the-bin-size estimation error because
> the modified last or first bin actually contains more rows than other
> bins, but it would certainly work a lot better than it does now.

I've applied a patch to HEAD that does the above.  Can you test it to
see how well it fixes your problem?

Looking at the current uses of the histogram stats, there is another
place that could possibly benefit from this type of on-demand index
search: mergejoinscansel.  That code attempts to estimate how much of a
column actually needs to be scanned by a merge join, recognizing that a
merge will stop reading either input once the other is exhausted.
Having an accurate idea of the join keys' upper bounds is fairly
important here, since the supposed cost reduction from not scanning all
of a table disappears if there's even one large-valued key on the other
side of the join.  On the other hand, making use of index searches in
mergejoinscansel would put these index searches into the standard, very
heavily used join planning path, not into a corner case which is all
they are right now.  So I'm fairly nervous about the potential hit on
join planning time.  Is there anyone around who can do planner timing
measurements on complex join queries involving large tables?  If so,
please try CVS HEAD as-is and after enabling this bit in selfuncs.c:
   /*    * XXX It's very tempting to try to use the actual column min and max,    * if we can get them
relatively-cheaplywith an index probe.  However,    * since this function is called many times during join planning,
*that could have unpleasant effects on planning speed.  Need more    * investigation before enabling this.    */
 
#ifdef NOT_USED   if (get_actual_variable_range(root, vardata, sortop, min, max))       return true;
#endif
        regards, tom lane


Re: Thoughts on statistics for continuously advancing columns

From
Csaba Nagy
Date:
On Wed, 2009-12-30 at 17:16 +0100, Tom Lane wrote:
> I think the cleanest solution to this would be to make ANALYZE
> cheaper, perhaps by finding some way for it to work incrementally.

What if when inserting/deleting a tuple, some random sample of them
would be passed into an auto-analyze buffer ?

Then a special process (the auto-analyze daemon) would process them and
update the statistics incrementally based on the new values found (which
might or might not be mathematically feasible).

The overhead for each backend process would be kept in limits by the
rate at which you randomly send or not send the change to the analyze
buffer.

The processing overhead would be kept in limits by the processing rate
of the auto-analyze process, which can be made to periodically sleep or
it could be made to span multiple processes (on multiprocessor systems).

If the buffer is full, then you skip putting in it... so it also could
autotune itself to a sustainable rate.


Of course as with all my other posts on hackers, this is all mostly
hand-waving, I have no clue about the feasibility of all this with
regard to the current state of the code (which I didn't read, I
unfortunately found myself hating reading C code beyond reason, and
writing any of it till now resumed to copy-paste-modify).

Cheers,
Csaba.


Csaba Nagy
Software Engineer
eCircle
P: +49 (0)89 / 120 09-783 | F: +49 (0)89 / 120 09-750
E: c.nagy@ecircle.com
Nymphenburger Str. 86, 80636 M�nchen
Stay in touch
Web: www.ecircle.com/de | Newsletter: www.ecircle.com/index.php?id=63&L=0

F�r Hilfe mit dem eC-messenger wenden Sie sich bitte an unseren
Support: support-de@ecircle.com.

Neuste Untersuchungen
Ein unschlagbares Doppel: E-mail-Marketing & Webanalyse
Download Whitepaper: www.ecircle.com/index.php?id=61&L=0
eCircle AG, HRB 136 334, Handelsregister M�nchen Vorstand:
Volker Wiewer (Vorsitzender), Thomas Wilke, Lars W�ssner,
Alexander Meyer Vorsitzender des Aufsichtsrates: Dr. Mark W�ssner

Re: Thoughts on statistics for continuously advancing columns

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> I've applied a patch to HEAD that does the above.  Can you test it to
>> see how well it fixes your problem?

> Sure.  It'll take us a while to set up a test environment; the database
> is 1TB in size so converting it to 8.5 isn't quick.

Great.  When you have it set up, you might want to play with enabling
the mergejoinscansel change as well, and see if that is a net plus or
minus for you.
        regards, tom lane


Re: Thoughts on statistics for continuously advancing columns

From
Josh Berkus
Date:
> I've applied a patch to HEAD that does the above.  Can you test it to
> see how well it fixes your problem?

Sure.  It'll take us a while to set up a test environment; the database
is 1TB in size so converting it to 8.5 isn't quick.

Will report back.

--Josh



Re: Thoughts on statistics for continuously advancing columns

From
Josh Berkus
Date:
> Great.  When you have it set up, you might want to play with enabling
> the mergejoinscansel change as well, and see if that is a net plus or
> minus for you.

How would I *disable* it?

--Josh


Re: Thoughts on statistics for continuously advancing columns

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> Great.  When you have it set up, you might want to play with enabling
>> the mergejoinscansel change as well, and see if that is a net plus or
>> minus for you.

> How would I *disable* it?

It's #ifdef NOT_USED in CVS at the moment.
        regards, tom lane


Re: Thoughts on statistics for continuously advancing columns

From
Chetan Suttraway
Date:
Hi,

My suggestion is to keep two sets of histograms. One which is generated by running ANALYZE and
the other which is dynamically generated histograms using the entries from logging (that is done
in insert/update/delete operations).
I am not sure how difficult is it to read  such record details from logs.

Basically from the details mentioned here what i understand is that the table data (timestamp) is added
in incremental way, ie existing data is not modified to great extent and the new data is
merely appended to old data.
In this case, the only work for analyse/statistics generation is to merge the histograms of newly added records to old histograms.

if we can treat this case as similar to that of merging of histograms in case of joins involving 2 tables and generating the histograms for the cartesian (result) node, then all we need to do is somehow generate temporary histogram for the new set of records and merge them with the old histogram.
The information of interesting columns from the new records can be read from the logging section.
We must be already having the part of merging of histograms and I hope that it wont be very costly
to make these calls so as to effect planner.
(Further my opinion is to calculate this cost of histogram generation and use it in costing in some way)

Further we can put some threshold limit to make this merge happen automatically. Say if the temporary histograms reach some set threshold, only then these will be merged with the older histograms.

Please pass on your inputs.

Regards,
Chetan


On Wed, Dec 30, 2009 at 8:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Josh Berkus <josh@agliodbs.com> writes:
> My thoughts on dealing with this intelligently without a major change to
> statstics gathering went along these lines:

> 1. add columns to pg_statistic to hold estimates of upper and lower
> bounds growth between analyzes.

This seems like a fundamentally broken approach, first because "time
between analyzes" is not even approximately a constant, and second
because it assumes that we have a distance metric for all datatypes.
(Note that convert_to_scalar does not assume that it can measure
arbitrary distances, but only fractions *within* a histogram bucket;
and even that is pretty shaky.)

I don't have a better idea at the moment :-(

                       regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Thoughts on statistics for continuously advancing columns

From
Robert Haas
Date:
On Tue, Jan 5, 2010 at 7:49 AM, Chetan Suttraway
<chetan.suttraway@enterprisedb.com> wrote:
> if we can treat this case as similar to that of merging of histograms in
> case of joins involving 2 tables and generating the histograms for the
> cartesian (result) node,

...which you can't, because it's totally different, so I think the
rest of this is a dead end.

...Robert