Thread: Unused index influencing sequential scan plan
Hi all, I've created a test table containing 21 million random dates and times, but I get wildly different results when I introduce a functional index then ANALYSE again, even though it doesn't use the index: postgres=# CREATE TABLE test (id serial, sampledate timestamp); CREATE TABLE postgres=# INSERT INTO test (sampledate) SELECT '1970-01-01 00:00:00'::timestamp + (random()*1350561945 || ' seconds')::interval FROM generate_series(1,21000000); INSERT 0 21000000 postgres=# VACUUM; VACUUM postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*) FROM test GROUP BY extract(month FROM sampledate); QUERY PLAN ---------------------------------------------------------------------- HashAggregate (cost=481014.00..481016.50 rows=200 width=8) -> Seq Scan on test (cost=0.00..376014.00 rows=21000000 width=8) (2 rows) postgres=# ANALYSE; ANALYZE postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*) FROM test GROUP BY extract(month FROM sampledate); QUERY PLAN ---------------------------------------------------------------------------- GroupAggregate (cost=4078473.42..4498473.90 rows=21000024 width=8) -> Sort (cost=4078473.42..4130973.48 rows=21000024 width=8) Sort Key: (date_part('month'::text, sampledate)) -> Seq Scan on test (cost=0.00..376014.30 rows=21000024 width=8) (4 rows) postgres=# CREATE INDEX idx_test_sampledate_month ON test (extract(month FROM sampledate)); CREATE INDEX postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*) FROM test GROUP BY extract(month FROM sampledate); QUERY PLAN ---------------------------------------------------------------------------- GroupAggregate (cost=4078470.03..4498470.03 rows=21000000 width=8) -> Sort (cost=4078470.03..4130970.03 rows=21000000 width=8) Sort Key: (date_part('month'::text, sampledate)) -> Seq Scan on test (cost=0.00..376014.00 rows=21000000 width=8) (4 rows) postgres=# ANALYSE; ANALYZE postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*) FROM test GROUP BY extract(month FROM sampledate); QUERY PLAN ---------------------------------------------------------------------- HashAggregate (cost=481012.85..481013.00 rows=12 width=8) -> Seq Scan on test (cost=0.00..376013.17 rows=20999934 width=8) (2 rows) The estimate is down to almost a 10th of what it was before. What's going on? And as a side note, how come it's impossible to get the planner to use an index-only scan to satisfy the query (disabling sequential and regular index scans)? -- Thom
On 18 October 2012 17:11, Thom Brown <thom@linux.com> wrote: > Hi all, > > I've created a test table containing 21 million random dates and > times, but I get wildly different results when I introduce a > functional index then ANALYSE again, even though it doesn't use the > index: > > postgres=# CREATE TABLE test (id serial, sampledate timestamp); > CREATE TABLE > postgres=# INSERT INTO test (sampledate) SELECT '1970-01-01 > 00:00:00'::timestamp + (random()*1350561945 || ' seconds')::interval > FROM generate_series(1,21000000); > INSERT 0 21000000 > postgres=# VACUUM; > VACUUM > postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*) > FROM test GROUP BY extract(month FROM sampledate); > QUERY PLAN > ---------------------------------------------------------------------- > HashAggregate (cost=481014.00..481016.50 rows=200 width=8) > -> Seq Scan on test (cost=0.00..376014.00 rows=21000000 width=8) > (2 rows) > > postgres=# ANALYSE; > ANALYZE > postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*) > FROM test GROUP BY extract(month FROM sampledate); > QUERY PLAN > ---------------------------------------------------------------------------- > GroupAggregate (cost=4078473.42..4498473.90 rows=21000024 width=8) > -> Sort (cost=4078473.42..4130973.48 rows=21000024 width=8) > Sort Key: (date_part('month'::text, sampledate)) > -> Seq Scan on test (cost=0.00..376014.30 rows=21000024 width=8) > (4 rows) > > postgres=# CREATE INDEX idx_test_sampledate_month ON test > (extract(month FROM sampledate)); > CREATE INDEX > postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*) > FROM test GROUP BY extract(month FROM sampledate); > QUERY PLAN > ---------------------------------------------------------------------------- > GroupAggregate (cost=4078470.03..4498470.03 rows=21000000 width=8) > -> Sort (cost=4078470.03..4130970.03 rows=21000000 width=8) > Sort Key: (date_part('month'::text, sampledate)) > -> Seq Scan on test (cost=0.00..376014.00 rows=21000000 width=8) > (4 rows) > > postgres=# ANALYSE; > ANALYZE > postgres=# EXPLAIN SELECT extract(month FROM sampledate), count(*) > FROM test GROUP BY extract(month FROM sampledate); > QUERY PLAN > ---------------------------------------------------------------------- > HashAggregate (cost=481012.85..481013.00 rows=12 width=8) > -> Seq Scan on test (cost=0.00..376013.17 rows=20999934 width=8) > (2 rows) > > > The estimate is down to almost a 10th of what it was before. What's going on? > > And as a side note, how come it's impossible to get the planner to use > an index-only scan to satisfy the query (disabling sequential and > regular index scans)? I should perhaps mention this is on 9.3devel as of today. -- Thom
On 18 October 2012 17:11, Thom Brown <thom@linux.com> wrote: > The estimate is down to almost a 10th of what it was before. What's going on? Even though the index isn't used, the pg_statistic entries that the expression index would have made available are. It's as if you materialised the expression into a column, analyzed and grouped by that. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Thom Brown <thom@linux.com> writes: > I've created a test table containing 21 million random dates and > times, but I get wildly different results when I introduce a > functional index then ANALYSE again, even though it doesn't use the > index: As Peter said, the existence of the index causes ANALYZE to gather stats about the expression, which will affect rowcount estimates whether or not the planner chooses to use the index. > And as a side note, how come it's impossible to get the planner to use > an index-only scan to satisfy the query (disabling sequential and > regular index scans)? Implementation restriction - we don't yet have a way to match index-only scans to expressions. regards, tom lane
On 18 October 2012 17:24, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 18 October 2012 17:11, Thom Brown <thom@linux.com> wrote: >> The estimate is down to almost a 10th of what it was before. What's going on? > > Even though the index isn't used, the pg_statistic entries that the > expression index would have made available are. It's as if you > materialised the expression into a column, analyzed and grouped by > that. D'oh, of course! Thanks Peter. -- Thom
On 18 October 2012 17:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thom Brown <thom@linux.com> writes: >> And as a side note, how come it's impossible to get the planner to use >> an index-only scan to satisfy the query (disabling sequential and >> regular index scans)? > > Implementation restriction - we don't yet have a way to match index-only > scans to expressions. Ah, I suspected it might be, but couldn't find notes on what scenarios it's yet to be able to work in. Thanks. -- Thom
Thom Brown <thom@linux.com> writes: > On 18 October 2012 17:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Thom Brown <thom@linux.com> writes: >>> And as a side note, how come it's impossible to get the planner to use >>> an index-only scan to satisfy the query (disabling sequential and >>> regular index scans)? >> Implementation restriction - we don't yet have a way to match index-only >> scans to expressions. > Ah, I suspected it might be, but couldn't find notes on what scenarios > it's yet to be able to work in. Thanks. I forgot to mention that there is a klugy workaround: add the required variable(s) as extra index columns. That is, create index i on t (foo(x), x); The planner isn't terribly bright about this, but it will use that index for a query that only requires foo(x), and it won't re-evaluate foo() (though I think it will cost the plan on the assumption it does :-(). regards, tom lane
On 18 October 2012 17:52, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I forgot to mention that there is a klugy workaround: add the required > variable(s) as extra index columns. That is, > > create index i on t (foo(x), x); Is there a case to be made for a index access method whose pseudo-indexes costs essentially nothing to maintain, and simply represent an ongoing obligation for ANALYZE to provide statistics for an expression? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 18 October 2012 17:52, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thom Brown <thom@linux.com> writes: >> On 18 October 2012 17:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Thom Brown <thom@linux.com> writes: >>>> And as a side note, how come it's impossible to get the planner to use >>>> an index-only scan to satisfy the query (disabling sequential and >>>> regular index scans)? > >>> Implementation restriction - we don't yet have a way to match index-only >>> scans to expressions. > >> Ah, I suspected it might be, but couldn't find notes on what scenarios >> it's yet to be able to work in. Thanks. > > I forgot to mention that there is a klugy workaround: add the required > variable(s) as extra index columns. That is, > > create index i on t (foo(x), x); > > The planner isn't terribly bright about this, but it will use that index > for a query that only requires foo(x), and it won't re-evaluate foo() > (though I think it will cost the plan on the assumption it does :-(). Ah, yes, I've tested this and got it using an index-only scan, and it was faster than than the sequential scan (index only scan 5024.545 ms vs seq scan 6627.072 ms). So this is probably a dumb question, but is it possible to achieve the optimisation provided by index statistics but without the index, and without a messy workaround using a supplementary column which stores function-derived values? If not, is that something which can be introduced? -- Thom
On 18 October 2012 18:00, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 18 October 2012 17:52, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I forgot to mention that there is a klugy workaround: add the required >> variable(s) as extra index columns. That is, >> >> create index i on t (foo(x), x); > > Is there a case to be made for a index access method whose > pseudo-indexes costs essentially nothing to maintain, and simply > represent an ongoing obligation for ANALYZE to provide statistics for > an expression? Heh, that's pretty much the question I posted just a few seconds ago. -- Thom
Peter Geoghegan <peter@2ndquadrant.com> writes: > Is there a case to be made for a index access method whose > pseudo-indexes costs essentially nothing to maintain, and simply > represent an ongoing obligation for ANALYZE to provide statistics for > an expression? If we were going to support it, I think we'd be better off exposing such a feature as DDL having nothing to do with indexes. Not sure it's worth the trouble though. The ANALYZE wart to compute stats for index expressions has been there a long time, and there's been essentially zero field demand for another way to do it. What people really seem to care about is more intelligence about making use of expression indexes to avoid recalculation of the expression --- something you'd not get from a stats-only feature. regards, tom lane
On Thu, 18 Oct 2012, 18:01 Thom Brown, <thom@linux.com> wrote:
On 18 October 2012 17:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thom Brown <thom@linux.com> writes:
>> On 18 October 2012 17:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Thom Brown <thom@linux.com> writes:
>>>> And as a side note, how come it's impossible to get the planner to use
>>>> an index-only scan to satisfy the query (disabling sequential and
>>>> regular index scans)?
>
>>> Implementation restriction - we don't yet have a way to match index-only
>>> scans to expressions.
>
>> Ah, I suspected it might be, but couldn't find notes on what scenarios
>> it's yet to be able to work in. Thanks.
>
> I forgot to mention that there is a klugy workaround: add the required
> variable(s) as extra index columns. That is,
>
> create index i on t (foo(x), x);
>
> The planner isn't terribly bright about this, but it will use that index
> for a query that only requires foo(x), and it won't re-evaluate foo()
> (though I think it will cost the plan on the assumption it does :-().
Ah, yes, I've tested this and got it using an index-only scan, and it
was faster than than the sequential scan (index only scan 5024.545 ms
vs seq scan 6627.072 ms).
So this is probably a dumb question, but is it possible to achieve the
optimisation provided by index statistics but without the index, and
without a messy workaround using a supplementary column which stores
function-derived values? If not, is that something which can be
introduced?
A very late thanks for extended statistics, Tomas.
Thom