Thread: strange pg_stats behaviour?

strange pg_stats behaviour?

From

Hubert depesz Lubaczewski

Date:

29 November 2002, 19:12:21

hi
i have table which contains number of field. one of them is:
data_off      | timestamp with time zone |
with index:
hdl_auction_data_off btree (data_off)

statistics of this table are:
# select count(*) from auction;
 count
-------
 98792
(1 row)

# select count(distinct data_off) from auction;
 count
-------
 98558
(1 row)

now i want a query which will show me all records with old data_off:
SELECT aid, foto FROM auction WHERE data_off < now()::timestamptz - '31 days'::interval;

explain analyze is:
# explain analyze SELECT aid, foto FROM auction WHERE data_off < now() -
# '31 days'::interval;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Seq Scan on auction  (cost=0.00..14985.38 rows=9493 width=11) (actual
time=21.33..1252.29 rows=1293 loops=1)
   Filter: (data_off < (now() - '31 days'::interval))
 Total runtime: 1253.61 msec
(3 rows)

of course estimated cost and rows are worth nothing. with enable_seqscan
false, i get total runtime below 30 msec!.
i tried to alter table auction alter column data_off set statistics 100,
200, and finally 1000.
of course i did analyze and reconnect.
no change. always very slow, seq scan. any idea on what is wrong?
right now information from pg_stats look like:
# select * from pg_stats where tablename='auction' and attname='data_off';
 schemaname | tablename | attname  | null_frac | avg_width | n_distinct |         most_common_vals          |
most_common_freqs|
                                                                  histogram_bounds

          | correlation  

------------+-----------+----------+-----------+-----------+------------+-----------------------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
 public     | auction   | data_off |         0 |         8 |  -0.975561 | {"2002-11-19 12:53:58.540808+01"} |
{0.00266667}     | {"2002-10-29 13:52:44.757049+01","2002-11-01 19:36:40.123367+01","2002-11-04
21:59:39.391224+01","2002-11-1003:56:53.322752+01","2002-11-14 08:00:27.357274+01","2002-11-18
20:55:26.336502+01","2002-11-2209:44:31.660568+01","2002-11-26 10:40:08.078131+01","2002-11-30
17:12:04.360514+01","2002-12-0508:37:02.211342+01","2002-12-20 00:44:18.810695+01"} |    0.196655 
(1 row)

i'm testing it on postgresql 7.4devel, but on production server (7.2.2
if i recall correctly) it behaves exactly the same way.

i read all i could about optimising, and so on, but i can't figure out
what is wrong here. what am i missing?

depesz

--
hubert depesz lubaczewski                          http://www.depesz.pl/
------------------------------------------------------------------------
Mój Boże, spraw abym milczał, dopóki się nie upewnię,  że  naprawdę  mam
coś do powiedzenia.                                      (c) 1998 depesz

Re: strange pg_stats behaviour?

From

Tom Lane

Date:

29 November 2002, 20:04:24

Hubert depesz Lubaczewski <depesz@depesz.pl> writes:
> # explain analyze SELECT aid, foto FROM auction WHERE data_off < now() -
> # '31 days'::interval;
>                                                   QUERY PLAN
> --------------------------------------------------------------------------------------------------------------
>  Seq Scan on auction  (cost=0.00..14985.38 rows=9493 width=11) (actual
> time=21.33..1252.29 rows=1293 loops=1)
>    Filter: (data_off < (now() - '31 days'::interval))
>  Total runtime: 1253.61 msec
> (3 rows)

The planner doesn't know what value data_off will be compared to at
runtime, so it has to fall back on a default selectivity estimate.
Increasing the amount of stats data won't help in the slightest.

A cheat I've occasionally suggested for this is to define a function
like

    create function ago(interval) returns timestamptz as
    'select now() - $1' language sql immutable strict;

Then an expression like "WHERE data_off < ago('31 days')" will be
indexable because the ago() expression will be constant-folded at
the start of planning.  However, this is a cheat because ago() is
*not* really immutable --- you will likely get burnt if you try to use
this technique for queries inside plpgsql functions, for example.

I don't know a good way to solve this problem in the general case.
I'm not willing to make the default selectivity estimate for a one-sided
inequality be low enough to provoke an indexscan; that's just asking for
trouble, because the query could easily be fetching much or all of the
table.

Another workaround that you could look at is

SELECT ...
WHERE data_off <  now() - '31 days'::interval AND
      data_off > '-infinity';

The extra clause doesn't hurt your results, and the default selectivity
estimate for a range-bounded query *is* small enough to provoke an
indexscan (in most cases, anyway).

            regards, tom lane

Re: strange pg_stats behaviour?

From

Hubert depesz Lubaczewski

Date:

30 November 2002, 03:13:59

On Fri, Nov 29, 2002 at 08:04:15PM -0500, Tom Lane wrote:
> Hubert depesz Lubaczewski <depesz@depesz.pl> writes:
> > # explain analyze SELECT aid, foto FROM auction WHERE data_off < now() -
> > # '31 days'::interval;

first, thanks for quick reply.

> The planner doesn't know what value data_off will be compared to at
> runtime, so it has to fall back on a default selectivity estimate.

strange - is there no way we can "teach" him, that this is a constant. i
mean now() - '31 days'::interval will be exactly the same for all rows
of auction, and i belive we could put functionality info planner to
"think" about it - right side of my condition doesn't contain any call
to table fields, so it probably should be treated as constant value. or
am i totally wrong?

> Another workaround that you could look at is
> SELECT ...
> WHERE data_off <  now() - '31 days'::interval AND
>       data_off > '-infinity';

this query looks quite strange, but if it's going to help, then i think
i might use it. right now i just modify enable_seqscan before and after
my query, but this is definitelly not nice.

depesz

--
hubert depesz lubaczewski                          http://www.depesz.pl/
------------------------------------------------------------------------
Mój Boże, spraw abym milczał, dopóki się nie upewnię,  że  naprawdę  mam
coś do powiedzenia.                                      (c) 1998 depesz

Re: strange pg_stats behaviour?

From

Tom Lane

Date:

30 November 2002, 11:05:10

Hubert depesz Lubaczewski <depesz@depesz.pl> writes:
>> The planner doesn't know what value data_off will be compared to at
>> runtime, so it has to fall back on a default selectivity estimate.

> strange - is there no way we can "teach" him, that this is a constant. i
> mean now() - '31 days'::interval will be exactly the same for all rows
> of auction,

It knows that.  What it doesn't know is how many rows of the table the
query is going to select, and that is exactly the thing it needs to know
to choose seqscan vs. indexscan.

            regards, tom lane

Re: strange pg_stats behaviour?

From

Hubert depesz Lubaczewski

Date:

30 November 2002, 13:58:22

On Sat, Nov 30, 2002 at 11:05:01AM -0500, Tom Lane wrote:
> >> The planner doesn't know what value data_off will be compared to at
> >> runtime, so it has to fall back on a default selectivity estimate.
> > strange - is there no way we can "teach" him, that this is a constant. i
> > mean now() - '31 days'::interval will be exactly the same for all rows
> > of auction,
> It knows that.  What it doesn't know is how many rows of the table the
> query is going to select, and that is exactly the thing it needs to know
> to choose seqscan vs. indexscan.

than what purpose serves pg_stats? i belive it could be used to estimate
number of rows returned.
if i know that i will be comparing <field> with some <value>, and i'm
able to tell the value, i should be able to look in pg_stats to estimate
rowcount.
what am i missing?

depesz

--
hubert depesz lubaczewski                          http://www.depesz.pl/
------------------------------------------------------------------------
Mój Boże, spraw abym milczał, dopóki się nie upewnię,  że  naprawdę  mam
coś do powiedzenia.                                      (c) 1998 depesz

Re: strange pg_stats behaviour?

From

Tom Lane

Date:

30 November 2002, 16:27:00

Hubert depesz Lubaczewski <depesz@depesz.pl> writes:
> what am i missing?

"constant at run time" and "constant at plan time" are not the same.

            regards, tom lane