Thread: extract(year from date) doesn't use index but maybe could?

extract(year from date) doesn't use index but maybe could?

From
Jon Dufresne
Date:
Given the table:

CREATE TABLE dates (id SERIAL, d DATE NOT NULL, t TEXT NOT NULL)

With an *index* on field d. The following two queries are functionally
equivalent:

1. SELECT * FROM dates WHERE d >= '1900-01-01'
2. SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900'

By functionally equivalent, they will return the same result set.

Query 2 does not use the index, adding a performance cost. It seems
there is an opportunity for optimization to handle these two queries
equivalently to take advantage of the index.

Some database abstraction layers have attempted to workaround this
limitation by rewriting EXTRACT(year ...) queries into a query more
like query 1. For example: Django's ORM does exctly this. Rather than
all abstraction layers trying to optimize this case, maybe it could be
pushed to the database layer.

I have written a test script that demonstrates that these functionally
equivalent queries have different performance characteristics. The
script and results are provide below:

RESULTS:

----
EXPLAIN SELECT * FROM dates WHERE d >= '1900-01-01'
                                 QUERY PLAN
----------------------------------------------------------------------------
 Bitmap Heap Scan on dates  (cost=9819.23..26390.15 rows=524233 width=40)
   Recheck Cond: (d >= '1900-01-01'::date)
   ->  Bitmap Index Scan on d_idx  (cost=0.00..9688.17 rows=524233 width=0)
         Index Cond: (d >= '1900-01-01'::date)
(4 rows)

EXPLAIN SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Seq Scan on dates  (cost=0.00..37540.25 rows=524233 width=40)
   Filter: (date_part('year'::text, (d)::timestamp without time zone)
>= 1900::double precision)
(2 rows)

Timing
select_without_extract: 284.233350s
select_with_extract: 323.106491s
----

SCRIPT:

----
#!/usr/bin/python3

import datetime
import subprocess
import random
import timeit
import sys


subprocess.check_call(['psql', 'postgres', '-c', 'DROP DATABASE IF
EXISTS datetest'], stdout=subprocess.DEVNULL)
subprocess.check_call(['psql', 'postgres', '-c', 'CREATE DATABASE
datetest'], stdout=subprocess.DEVNULL)
subprocess.check_call(['psql', 'datetest', '-c', 'CREATE TABLE dates
(id SERIAL, d DATE NOT NULL, t TEXT NOT NULL)'],
stdout=subprocess.DEVNULL)


def chunks(n, l):
    i = 0
    while i < len(l):
        yield l[i:i+n]
        i += n

d = datetime.date(1800, 1, 1)
today = datetime.date.today()
values = []
while d < today:
    values.extend('(\'%s\', \'%s\')' % (d, d) for i in range(20))
    d += datetime.timedelta(days=1)
random.shuffle(values)
for chunk in chunks(1000, values):
    s = ','.join(chunk)
    subprocess.check_call(['psql', 'datetest', '-c', 'INSERT INTO
dates (d, t) VALUES %s' % s], stdout=subprocess.DEVNULL)


subprocess.check_call(['psql', 'datetest', '-c', 'CREATE INDEX d_idx
ON dates (d)'], stdout=subprocess.DEVNULL)
print('EXPLAIN SELECT * FROM dates WHERE d >= \'1900-01-01\'')
sys.stdout.flush()
subprocess.check_call(['psql', 'datetest', '-c', 'EXPLAIN SELECT *
FROM dates WHERE d >= \'1900-01-01\''])
print('EXPLAIN SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900')
sys.stdout.flush()
subprocess.check_call(['psql', 'datetest', '-c', 'EXPLAIN SELECT *
FROM dates WHERE EXTRACT(year from d) >= 1900'])


def select_without_extract():
    subprocess.check_call(['psql', 'datetest', '-c', 'SELECT * FROM
dates WHERE d >= \'1900-01-01\''], stdout=subprocess.DEVNULL)

def select_with_extract():
    subprocess.check_call(['psql', 'datetest', '-c', 'SELECT * FROM
dates WHERE EXTRACT(year from d) >= 1900'], stdout=subprocess.DEVNULL)

print('Timing')
sys.stdout.flush()

v = timeit.timeit('select_without_extract()', setup='from __main__
import select_without_extract', number=100)
print('select_without_extract: %fs' % v)
sys.stdout.flush()

v = timeit.timeit('select_with_extract()', setup='from __main__ import
select_with_extract', number=100)
print('select_with_extract: %fs' % v)
sys.stdout.flush()
---


Re: extract(year from date) doesn't use index but maybe could?

From
Tomas Vondra
Date:

On 04/19/15 19:16, Jon Dufresne wrote:
> Given the table:
>
> CREATE TABLE dates (id SERIAL, d DATE NOT NULL, t TEXT NOT NULL)
>
> With an *index* on field d. The following two queries are functionally
> equivalent:
>
> 1. SELECT * FROM dates WHERE d >= '1900-01-01'
> 2. SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900'
>
> By functionally equivalent, they will return the same result set.
>
> Query 2 does not use the index, adding a performance cost. It seems
> there is an opportunity for optimization to handle these two queries
> equivalently to take advantage of the index.

Or you might try creating an expression index ...

CREATE INDEX date_year_idx ON dates((extract(year from d)));


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: extract(year from date) doesn't use index but maybe could?

From
Jon Dufresne
Date:
On Sun, Apr 19, 2015 at 10:42 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
>
> On 04/19/15 19:16, Jon Dufresne wrote:
>>
>> Given the table:
>>
>> CREATE TABLE dates (id SERIAL, d DATE NOT NULL, t TEXT NOT NULL)
>>
>> With an *index* on field d. The following two queries are functionally
>> equivalent:
>>
>> 1. SELECT * FROM dates WHERE d >= '1900-01-01'
>> 2. SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900'
>>
>> By functionally equivalent, they will return the same result set.
>>
>> Query 2 does not use the index, adding a performance cost. It seems
>> there is an opportunity for optimization to handle these two queries
>> equivalently to take advantage of the index.
>
>
> Or you might try creating an expression index ...
>
> CREATE INDEX date_year_idx ON dates((extract(year from d)));
>

Certainly, but won't this add additional overhead in the form of two
indexes; one for the column and one for the expression?

My point is, why force the user to take these extra steps or add
overhead when the the two queries (or two indexes) are functionally
equivalent. Shouldn't this is an optimization handled by the database
so the user doesn't need to hand optimize these differences?


Re: extract(year from date) doesn't use index but maybe could?

From
Adam Tauno Williams
Date:
On Sun, 2015-04-19 at 13:10 -0700, Jon Dufresne wrote:
> On Sun, Apr 19, 2015 at 10:42 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > On 04/19/15 19:16, Jon Dufresne wrote:
> >> Given the table:
> >> CREATE TABLE dates (id SERIAL, d DATE NOT NULL, t TEXT NOT NULL)
> >> With an *index* on field d. The following two queries are functionally
> >> equivalent:
> >> 1. SELECT * FROM dates WHERE d >= '1900-01-01'
> >> 2. SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900'
> >> By functionally equivalent, they will return the same result set.
> >> Query 2 does not use the index, adding a performance cost. It seems
> >> there is an opportunity for optimization to handle these two queries
> >> equivalently to take advantage of the index.
> > Or you might try creating an expression index ..
> > CREATE INDEX date_year_idx ON dates((extract(year from d)));
> Certainly, but won't this add additional overhead in the form of two
> indexes; one for the column and one for the expression?
> My point is, why force the user to take these extra steps or add
> overhead when the the two queries (or two indexes) are functionally
> equivalent.

But they aren't functionally equivalent.  One is an index on a
datetime/date, the other is an index just on the year [a DOUBLE].
Date/datetimes potentially have time zones, integer values do not - in
general time values are an order of magnitude more complicated than
people expect.

> Shouldn't this is an optimization handled by the database
> so the user doesn't need to hand optimize these differences?

Sometimes "d >= '1900-01-01'" and "EXTRACT(year from d) >= 1900" may be
equivalent; but not always.

--
Adam Tauno Williams <mailto:awilliam@whitemice.org> GPG D95ED383
Systems Administrator, Python Developer, LPI / NCLA



Re: extract(year from date) doesn't use index but maybe could?

From
Tomas Vondra
Date:

On 04/19/15 22:10, Jon Dufresne wrote:
> On Sun, Apr 19, 2015 at 10:42 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>
>> Or you might try creating an expression index ...
>>
>> CREATE INDEX date_year_idx ON dates((extract(year from d)));
>>
>
> Certainly, but won't this add additional overhead in the form of two
> indexes; one for the column and one for the expression?

It will, but it probably will be more efficient than poorly performing
queries. Another option is to use the first type of queries with
explicit date ranges, thus making it possible to use a single index.

>
> My point is, why force the user to take these extra steps or add
> overhead when the the two queries (or two indexes) are functionally
> equivalent. Shouldn't this is an optimization handled by the
> database so the user doesn't need to hand optimize these differences?

Theoretically yes.

But currently the "extract" function call is pretty much a black box for
the planner, just like any other function - it has no idea what happens
inside, what fields are extracted and so on. It certainly is unable to
infer the date range as you propose.

It's possible that in the future someone will implement an optimization
like this, but I'm not aware of anyone working on that and I wouldn't
hold my breath.

Until then you either have to create an expression index, or use queries
with explicit date ranges (without "extract" calls).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: extract(year from date) doesn't use index but maybe could?

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 04/19/15 22:10, Jon Dufresne wrote:
>> My point is, why force the user to take these extra steps or add
>> overhead when the the two queries (or two indexes) are functionally
>> equivalent. Shouldn't this is an optimization handled by the
>> database so the user doesn't need to hand optimize these differences?

> Theoretically yes.

> But currently the "extract" function call is pretty much a black box for
> the planner, just like any other function - it has no idea what happens
> inside, what fields are extracted and so on. It certainly is unable to
> infer the date range as you propose.

> It's possible that in the future someone will implement an optimization
> like this, but I'm not aware of anyone working on that and I wouldn't
> hold my breath.

Yeah.  In principle you could make the planner do this.  As Adam Williams
notes nearby, there's a problem with lack of exact consistency between
extract() semantics and straight timestamp comparisons; but you could
handle that by extracting indexable expressions that are considered lossy,
much as we do with anchored LIKE and regexp patterns.  The problem is that
this would add significant overhead to checking for indexable clauses.
With "x LIKE 'foo%'" you just need to make a direct check whether x is
an indexed column; this is exactly parallel to noting whether x is indexed
in "x >= 'foo'", and it doesn't require much additional machinery or
cycles to reject the common case that there's no match to x.  But if you
want to notice whether d is indexed in "extract(year from d) = 2015",
that requires digging down another level in the expression, so it's going
to add overhead that's not there now, even in cases that have nothing to
do with extract() let alone have any chance of benefiting.

We might still be willing to do it if there were a sufficiently wide range
of examples that could be handled by the same extra machinery, but this
doesn't look too promising from that angle: AFAICS only the "year" case
could yield a useful index restriction.

So the short answer is that whether it's worth saving users from
hand-optimizing such cases depends a lot on what it's going to cost in
added planning time for queries that don't get any benefit.  This example
doesn't look like a case that's going to win that cost/benefit tradeoff
comparison.

            regards, tom lane


Re: extract(year from date) doesn't use index but maybe could?

From
Yves Dorfsman
Date:
On 2015-04-19 15:33, Tom Lane wrote:
>
>> It's possible that in the future someone will implement an optimization
>> like this, but I'm not aware of anyone working on that and I wouldn't
>> hold my breath.
>
> Yeah.  In principle you could make the planner do this.  As Adam Williams
> notes nearby, there's a problem with lack of exact consistency between
> extract() semantics and straight timestamp comparisons; but you could
> handle that by extracting indexable expressions that are considered lossy,

What about functions that are simpler such as upper()/lower()?

On 9.3, this:
`select email from users where lower(first_name) = 'yves';

is not using the index on first_name (Seq Scan on first_name). This should be
easy to implement?

--
http://yves.zioup.com
gpg: 4096R/32B0F416



Re: extract(year from date) doesn't use index but maybe could?

From
Tom Lane
Date:
Yves Dorfsman <yves@zioup.com> writes:
> What about functions that are simpler such as upper()/lower()?

If you think those are simpler, you're much mistaken :-(.  For instance,
"lower(first_name) = 'yves'" would have to be translated to something
like "first_name IN ('yves', 'yveS', 'yvEs', 'yvES', ..., 'YVES')"
-- 16 possibilities altogether, or 2^N for an N-character string.
(And that's just assuming ASCII up/down-casing, never mind the interesting
rules in some non-English languages.)  In a case-sensitive index, those
various strings aren't going to sort consecutively, so we'd end up needing
a separate index probe for each possibility.

extract(year from date) agrees with timestamp comparison up to boundary
cases, that is a few hours either way at a year boundary depending on the
timezone situation.  So you could translate it to a lossy-but-indexable
timestamp comparison condition and not expect to scan too many index items
that don't satisfy the original extract() condition.  But I don't see how
to make something like that work for mapping case-insensitive searches
onto case-sensitive indexes.

            regards, tom lane


Re: extract(year from date) doesn't use index but maybe could?

From
Gavin Flower
Date:
On 20/04/15 10:29, Tom Lane wrote:
> Yves Dorfsman <yves@zioup.com> writes:
>> What about functions that are simpler such as upper()/lower()?
> If you think those are simpler, you're much mistaken :-(.  For instance,
> "lower(first_name) = 'yves'" would have to be translated to something
> like "first_name IN ('yves', 'yveS', 'yvEs', 'yvES', ..., 'YVES')"
> -- 16 possibilities altogether, or 2^N for an N-character string.
> (And that's just assuming ASCII up/down-casing, never mind the interesting
> rules in some non-English languages.)  In a case-sensitive index, those
> various strings aren't going to sort consecutively, so we'd end up needing
> a separate index probe for each possibility.
>
> extract(year from date) agrees with timestamp comparison up to boundary
> cases, that is a few hours either way at a year boundary depending on the
> timezone situation.  So you could translate it to a lossy-but-indexable
> timestamp comparison condition and not expect to scan too many index items
> that don't satisfy the original extract() condition.  But I don't see how
> to make something like that work for mapping case-insensitive searches
> onto case-sensitive indexes.
>
>             regards, tom lane
>
>
Yeah, an event that happened at 2 am Thursday January 1st 2015 in New
Zealand, will be in the year 2014 for people of London in England!


Cheers,
Gavin


Re: extract(year from date) doesn't use index but maybe could?

From
Jim Nasby
Date:
On 4/19/15 4:33 PM, Tom Lane wrote:
> We might still be willing to do it if there were a sufficiently wide range
> of examples that could be handled by the same extra machinery, but this
> doesn't look too promising from that angle: AFAICS only the "year" case
> could yield a useful index restriction.

"date_trunc() op val" is where we'd want to do this. Midnight, month and
quarter boundaries are the cases I've commonly seen. The first is
especially bad, because people got in the habit of doing
timestamptz::date = '2015-1-1' and then got very irritated when that
cast became volatile.

Perhaps this could be better handled by having a way to translate
date/time specifications into a range? So something like
"date_trunc('...', timestamptz) = val" would become "date_trunk('...',
timestamptz) <@ [val, val+<correct interval based on truncation>)".
<,<=,>= and > wouldn't even need to be a range. To cover the broadest
base, we'd want to recognize that extract(year) could transform to
date_trunk('year',...), and timestamp/timestamptz::date transforms to
date_trunc('day',...). I think these transforms would be lossless, so
they could always be made, though we'd not want to do the transformation
if there was already an index on something like date_trunc(...).

I don't readily see any other data types where this would be useful, but
this is so common with timestamps that I think it's worth handling just
that case.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com