Thread: column without pg_stats entry?!

column without pg_stats entry?!

From
Bernd Heller
Date:
Hello everyone,

I'm having a problem with some of my tables and I'm not sure if
postgres' behaviour is maybe even a bug. I'm (still) using 8.0rc5 at
present.

I have a table that contains among other columns one of the sort:
    purge_date timestamp

most records will have this field set to NULL, at present all of them
really. the table has about 100k row right now. in regular intervals
I'm doing some cleanup on this table using a query like:
    delete from mytable where purge_date is not null and purge_date <
current_date

And I have created these btree indexes:
    create index on mytable (purge_date);
    create index on mytable (purge_date) where purge_date is not null;

my problem is that the planner always chooses a seq scan over an index
scan. only when I set enable_seqscan to false does it use an index
scan. The costs of both plans are extremely different, with the index
scan being 5-10 times more expensive than the seq scan, which is
obviously not true given that all rows have this column set to NULL.

I wondered why the planner was making such bad assumptions about the
number of rows to find and had a look at pg_stats. and there was the
surprise:
there is no entry in pg_stats for that column at all!! I can only
suspect that this has to do with the column being all null. I tried to
change a few records to a not-null value, but re-ANALYZE didn't catch
them apparently.

Is this desired behaviour for analyze? Can I change it somehow? If not,
is there a better way to accomplish what I'm trying? I'm not to keen on
disabling seqscan for that query explicitly. It's a simple enough query
and the planner should be able to find the right plan without help -
and I'm sure it would if it had stats about it.

Any help appreciated.

Bernd


Re: column without pg_stats entry?!

From
Bruno Wolff III
Date:
On Thu, Jan 20, 2005 at 11:14:28 +0100,
  Bernd Heller <bdheller@users.sourceforge.net> wrote:
>
> I wondered why the planner was making such bad assumptions about the
> number of rows to find and had a look at pg_stats. and there was the
> surprise:
> there is no entry in pg_stats for that column at all!! I can only
> suspect that this has to do with the column being all null. I tried to
> change a few records to a not-null value, but re-ANALYZE didn't catch
> them apparently.

Someone else reported this recently and I think it is going to be fixed.

> Is this desired behaviour for analyze? Can I change it somehow? If not,
> is there a better way to accomplish what I'm trying? I'm not to keen on
> disabling seqscan for that query explicitly. It's a simple enough query
> and the planner should be able to find the right plan without help -
> and I'm sure it would if it had stats about it.

In the short run you could add an IS NOT NULL clause to your query.
The optimizer doesn't know that < being TRUE implies IS NOT NULL and
so the partial index won't be used unless you add that clause explicitly.

Re: column without pg_stats entry?!

From
Tom Lane
Date:
Bruno Wolff III <bruno@wolff.to> writes:
>   Bernd Heller <bdheller@users.sourceforge.net> wrote:
>> there is no entry in pg_stats for that column at all!! I can only
>> suspect that this has to do with the column being all null.

> Someone else reported this recently and I think it is going to be fixed.

Yeah, this was griped of a little bit ago, but I felt it was too close
to 8.0 release to risk fooling with for this cycle.

> In the short run you could add an IS NOT NULL clause to your query.
> The optimizer doesn't know that < being TRUE implies IS NOT NULL and
> so the partial index won't be used unless you add that clause explicitly.

Actually, as of 8.0 the optimizer *does* know that.  I'm a bit surprised
that it didn't pick the partial index, since even without any analyze
stats, the small physical size of the partial index should have clued it
that there weren't many such tuples.  Could we see EXPLAIN output for
both cases (both settings of enable_seqscan)?

            regards, tom lane

Re: column without pg_stats entry?!

From
Bernd Heller
Date:
Ah no, I think both of you have mistaken me. The problem here is not
about partial indexes (not really anyway).
I do have a partial index with "WHERE purge_date IS NOT NULL", and my
query does contain "WHERE purge_date IS NOT NULL" as well. The problem
here is, that all rows (or almost all) have the column purge_date set
to NULL. The planner expected the query to return 33% of all rows in
the table. So it made the seq scan MUCH cheaper, which was right in the
planner's way of thinking because it didn't know anything about the
column from pg_stats.

I had a look at the source code of the analyze command meanwhile:
the compute_*_stats functions don't return valid statistics if they
can't find any non-null values, and as a result no statistics tuple for
that column is created in pg_stats. I think this is wrong. Not finding
any non-null values IS a very useful information, it means a
null-fraction of 100%. I have patched my postgres to return valid
statistics even in that case (patch below).
The difference now is that the planner doesn't assume anymore it would
get about 33% of rows back, instead it knows that the null-fraction of
that column is approximately 1.0 and it chooses the index scan because
that is now the by far cheapest plan.

--- analyze.c   Thu Jan 20 11:37:58 2005
+++ analyze.c.orig      Sun Nov 14 03:04:13 2004
@@ -1704,9 +1704,6 @@
                         stats->stavalues[0] = mcv_values;
                         stats->numvalues[0] = num_mcv;
                 }
-       } else {
-               stats->stats_valid = true;
-               stats->stanullfrac = 1.0;
         }

         /* We don't need to bother cleaning up any of our temporary
palloc's */
@@ -2164,9 +2161,6 @@
                         stats->numnumbers[slot_idx] = 1;
                         slot_idx++;
                 }
-       } else {
-               stats->stats_valid = true;
-               stats->stanullfrac = 1.0;
         }

         /* We don't need to bother cleaning up any of our temporary
palloc's */


On 21.01.2005, at 7:02 Uhr, Tom Lane wrote:

> Bruno Wolff III <bruno@wolff.to> writes:
>>   Bernd Heller <bdheller@users.sourceforge.net> wrote:
>>> there is no entry in pg_stats for that column at all!! I can only
>>> suspect that this has to do with the column being all null.
>
>> Someone else reported this recently and I think it is going to be
>> fixed.
>
> Yeah, this was griped of a little bit ago, but I felt it was too close
> to 8.0 release to risk fooling with for this cycle.
>
>> In the short run you could add an IS NOT NULL clause to your query.
>> The optimizer doesn't know that < being TRUE implies IS NOT NULL and
>> so the partial index won't be used unless you add that clause
>> explicitly.
>
> Actually, as of 8.0 the optimizer *does* know that.  I'm a bit
> surprised
> that it didn't pick the partial index, since even without any analyze
> stats, the small physical size of the partial index should have clued
> it
> that there weren't many such tuples.  Could we see EXPLAIN output for
> both cases (both settings of enable_seqscan)?
>
>             regards, tom lane
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster