Thread: Unused index influencing sequential scan plan

Unused index influencing sequential scan plan

From
Thom Brown
Date:
Hi all,

I've created a test table containing 21 million random dates and
times, but I get wildly different results when I introduce a
functional index then ANALYSE again, even though it doesn't use the
index:

postgres=# CREATE TABLE test (id serial, sampledate timestamp);
CREATE TABLE
postgres=# INSERT INTO test (sampledate) SELECT '1970-01-01
00:00:00'::timestamp + (random()*1350561945 || ' seconds')::interval
FROM generate_series(1,21000000);
INSERT 0 21000000
postgres=# VACUUM;
VACUUM
postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*)
FROM test GROUP BY extract(month FROM sampledate);
                              QUERY PLAN
----------------------------------------------------------------------
 HashAggregate  (cost=481014.00..481016.50 rows=200 width=8)
   ->  Seq Scan on test  (cost=0.00..376014.00 rows=21000000 width=8)
(2 rows)

postgres=# ANALYSE;
ANALYZE
postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*)
FROM test GROUP BY extract(month FROM sampledate);
                                 QUERY PLAN
----------------------------------------------------------------------------
 GroupAggregate  (cost=4078473.42..4498473.90 rows=21000024 width=8)
   ->  Sort  (cost=4078473.42..4130973.48 rows=21000024 width=8)
         Sort Key: (date_part('month'::text, sampledate))
         ->  Seq Scan on test  (cost=0.00..376014.30 rows=21000024 width=8)
(4 rows)

postgres=# CREATE INDEX idx_test_sampledate_month ON test
(extract(month FROM sampledate));
CREATE INDEX
postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*)
FROM test GROUP BY extract(month FROM sampledate);
                                 QUERY PLAN
----------------------------------------------------------------------------
 GroupAggregate  (cost=4078470.03..4498470.03 rows=21000000 width=8)
   ->  Sort  (cost=4078470.03..4130970.03 rows=21000000 width=8)
         Sort Key: (date_part('month'::text, sampledate))
         ->  Seq Scan on test  (cost=0.00..376014.00 rows=21000000 width=8)
(4 rows)

postgres=# ANALYSE;
ANALYZE
postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*)
FROM test GROUP BY extract(month FROM sampledate);
                              QUERY PLAN
----------------------------------------------------------------------
 HashAggregate  (cost=481012.85..481013.00 rows=12 width=8)
   ->  Seq Scan on test  (cost=0.00..376013.17 rows=20999934 width=8)
(2 rows)


The estimate is down to almost a 10th of what it was before.  What's going on?

And as a side note, how come it's impossible to get the planner to use
an index-only scan to satisfy the query (disabling sequential and
regular index scans)?

--
Thom


Re: Unused index influencing sequential scan plan

From
Thom Brown
Date:
On 18 October 2012 17:11, Thom Brown <thom@linux.com> wrote:
> Hi all,
>
> I've created a test table containing 21 million random dates and
> times, but I get wildly different results when I introduce a
> functional index then ANALYSE again, even though it doesn't use the
> index:
>
> postgres=# CREATE TABLE test (id serial, sampledate timestamp);
> CREATE TABLE
> postgres=# INSERT INTO test (sampledate) SELECT '1970-01-01
> 00:00:00'::timestamp + (random()*1350561945 || ' seconds')::interval
> FROM generate_series(1,21000000);
> INSERT 0 21000000
> postgres=# VACUUM;
> VACUUM
> postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*)
> FROM test GROUP BY extract(month FROM sampledate);
>                               QUERY PLAN
> ----------------------------------------------------------------------
>  HashAggregate  (cost=481014.00..481016.50 rows=200 width=8)
>    ->  Seq Scan on test  (cost=0.00..376014.00 rows=21000000 width=8)
> (2 rows)
>
> postgres=# ANALYSE;
> ANALYZE
> postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*)
> FROM test GROUP BY extract(month FROM sampledate);
>                                  QUERY PLAN
> ----------------------------------------------------------------------------
>  GroupAggregate  (cost=4078473.42..4498473.90 rows=21000024 width=8)
>    ->  Sort  (cost=4078473.42..4130973.48 rows=21000024 width=8)
>          Sort Key: (date_part('month'::text, sampledate))
>          ->  Seq Scan on test  (cost=0.00..376014.30 rows=21000024 width=8)
> (4 rows)
>
> postgres=# CREATE INDEX idx_test_sampledate_month ON test
> (extract(month FROM sampledate));
> CREATE INDEX
> postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*)
> FROM test GROUP BY extract(month FROM sampledate);
>                                  QUERY PLAN
> ----------------------------------------------------------------------------
>  GroupAggregate  (cost=4078470.03..4498470.03 rows=21000000 width=8)
>    ->  Sort  (cost=4078470.03..4130970.03 rows=21000000 width=8)
>          Sort Key: (date_part('month'::text, sampledate))
>          ->  Seq Scan on test  (cost=0.00..376014.00 rows=21000000 width=8)
> (4 rows)
>
> postgres=# ANALYSE;
> ANALYZE
> postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*)
> FROM test GROUP BY extract(month FROM sampledate);
>                               QUERY PLAN
> ----------------------------------------------------------------------
>  HashAggregate  (cost=481012.85..481013.00 rows=12 width=8)
>    ->  Seq Scan on test  (cost=0.00..376013.17 rows=20999934 width=8)
> (2 rows)
>
>
> The estimate is down to almost a 10th of what it was before.  What's going on?
>
> And as a side note, how come it's impossible to get the planner to use
> an index-only scan to satisfy the query (disabling sequential and
> regular index scans)?

I should perhaps mention this is on 9.3devel as of today.

--
Thom


Re: Unused index influencing sequential scan plan

From
Peter Geoghegan
Date:
On 18 October 2012 17:11, Thom Brown <thom@linux.com> wrote:
> The estimate is down to almost a 10th of what it was before.  What's going on?

Even though the index isn't used, the pg_statistic entries that the
expression index would have made available are. It's as if you
materialised the expression into a column, analyzed and grouped by
that.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Unused index influencing sequential scan plan

From
Tom Lane
Date:
Thom Brown <thom@linux.com> writes:
> I've created a test table containing 21 million random dates and
> times, but I get wildly different results when I introduce a
> functional index then ANALYSE again, even though it doesn't use the
> index:

As Peter said, the existence of the index causes ANALYZE to gather stats
about the expression, which will affect rowcount estimates whether or
not the planner chooses to use the index.

> And as a side note, how come it's impossible to get the planner to use
> an index-only scan to satisfy the query (disabling sequential and
> regular index scans)?

Implementation restriction - we don't yet have a way to match index-only
scans to expressions.

            regards, tom lane


Re: Unused index influencing sequential scan plan

From
Thom Brown
Date:
On 18 October 2012 17:24, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 18 October 2012 17:11, Thom Brown <thom@linux.com> wrote:
>> The estimate is down to almost a 10th of what it was before.  What's going on?
>
> Even though the index isn't used, the pg_statistic entries that the
> expression index would have made available are. It's as if you
> materialised the expression into a column, analyzed and grouped by
> that.

D'oh, of course!  Thanks Peter.

--
Thom


Re: Unused index influencing sequential scan plan

From
Thom Brown
Date:
On 18 October 2012 17:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thom Brown <thom@linux.com> writes:
>> And as a side note, how come it's impossible to get the planner to use
>> an index-only scan to satisfy the query (disabling sequential and
>> regular index scans)?
>
> Implementation restriction - we don't yet have a way to match index-only
> scans to expressions.

Ah, I suspected it might be, but couldn't find notes on what scenarios
it's yet to be able to work in.  Thanks.

--
Thom


Re: Unused index influencing sequential scan plan

From
Tom Lane
Date:
Thom Brown <thom@linux.com> writes:
> On 18 October 2012 17:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thom Brown <thom@linux.com> writes:
>>> And as a side note, how come it's impossible to get the planner to use
>>> an index-only scan to satisfy the query (disabling sequential and
>>> regular index scans)?

>> Implementation restriction - we don't yet have a way to match index-only
>> scans to expressions.

> Ah, I suspected it might be, but couldn't find notes on what scenarios
> it's yet to be able to work in.  Thanks.

I forgot to mention that there is a klugy workaround: add the required
variable(s) as extra index columns.  That is,

    create index i on t (foo(x), x);

The planner isn't terribly bright about this, but it will use that index
for a query that only requires foo(x), and it won't re-evaluate foo()
(though I think it will cost the plan on the assumption it does :-().

            regards, tom lane


Re: Unused index influencing sequential scan plan

From
Peter Geoghegan
Date:
On 18 October 2012 17:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I forgot to mention that there is a klugy workaround: add the required
> variable(s) as extra index columns.  That is,
>
>         create index i on t (foo(x), x);

Is there a case to be made for a index access method whose
pseudo-indexes costs essentially nothing to maintain, and simply
represent an ongoing obligation for ANALYZE to provide statistics for
an expression?

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: Unused index influencing sequential scan plan

From
Thom Brown
Date:
On 18 October 2012 17:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thom Brown <thom@linux.com> writes:
>> On 18 October 2012 17:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Thom Brown <thom@linux.com> writes:
>>>> And as a side note, how come it's impossible to get the planner to use
>>>> an index-only scan to satisfy the query (disabling sequential and
>>>> regular index scans)?
>
>>> Implementation restriction - we don't yet have a way to match index-only
>>> scans to expressions.
>
>> Ah, I suspected it might be, but couldn't find notes on what scenarios
>> it's yet to be able to work in.  Thanks.
>
> I forgot to mention that there is a klugy workaround: add the required
> variable(s) as extra index columns.  That is,
>
>         create index i on t (foo(x), x);
>
> The planner isn't terribly bright about this, but it will use that index
> for a query that only requires foo(x), and it won't re-evaluate foo()
> (though I think it will cost the plan on the assumption it does :-().

Ah, yes, I've tested this and got it using an index-only scan, and it
was faster than than the sequential scan (index only scan 5024.545 ms
vs seq scan 6627.072 ms).

So this is probably a dumb question, but is it possible to achieve the
optimisation provided by index statistics but without the index, and
without a messy workaround using a supplementary column which stores
function-derived values?  If not, is that something which can be
introduced?
--
Thom


Re: Unused index influencing sequential scan plan

From
Thom Brown
Date:
On 18 October 2012 18:00, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 18 October 2012 17:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I forgot to mention that there is a klugy workaround: add the required
>> variable(s) as extra index columns.  That is,
>>
>>         create index i on t (foo(x), x);
>
> Is there a case to be made for a index access method whose
> pseudo-indexes costs essentially nothing to maintain, and simply
> represent an ongoing obligation for ANALYZE to provide statistics for
> an expression?

Heh, that's pretty much the question I posted just a few seconds ago.
--
Thom


Re: Unused index influencing sequential scan plan

From
Tom Lane
Date:
Peter Geoghegan <peter@2ndquadrant.com> writes:
> Is there a case to be made for a index access method whose
> pseudo-indexes costs essentially nothing to maintain, and simply
> represent an ongoing obligation for ANALYZE to provide statistics for
> an expression?

If we were going to support it, I think we'd be better off exposing such
a feature as DDL having nothing to do with indexes.  Not sure it's worth
the trouble though.  The ANALYZE wart to compute stats for index
expressions has been there a long time, and there's been essentially
zero field demand for another way to do it.  What people really seem to
care about is more intelligence about making use of expression indexes
to avoid recalculation of the expression --- something you'd not get
from a stats-only feature.

            regards, tom lane