extract(year from date) doesn't use index but maybe could? - Mailing list pgsql-performance

From Jon Dufresne
Subject extract(year from date) doesn't use index but maybe could?
Date
Msg-id CADhq2b51G0QrjNz_vH6e_SLyu=tR0A0G0zHhFsP7rqRr+6FA3w@mail.gmail.com
Whole thread Raw
Responses Re: extract(year from date) doesn't use index but maybe could?  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-performance
Given the table:

CREATE TABLE dates (id SERIAL, d DATE NOT NULL, t TEXT NOT NULL)

With an *index* on field d. The following two queries are functionally
equivalent:

1. SELECT * FROM dates WHERE d >= '1900-01-01'
2. SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900'

By functionally equivalent, they will return the same result set.

Query 2 does not use the index, adding a performance cost. It seems
there is an opportunity for optimization to handle these two queries
equivalently to take advantage of the index.

Some database abstraction layers have attempted to workaround this
limitation by rewriting EXTRACT(year ...) queries into a query more
like query 1. For example: Django's ORM does exctly this. Rather than
all abstraction layers trying to optimize this case, maybe it could be
pushed to the database layer.

I have written a test script that demonstrates that these functionally
equivalent queries have different performance characteristics. The
script and results are provide below:

RESULTS:

----
EXPLAIN SELECT * FROM dates WHERE d >= '1900-01-01'
                                 QUERY PLAN
----------------------------------------------------------------------------
 Bitmap Heap Scan on dates  (cost=9819.23..26390.15 rows=524233 width=40)
   Recheck Cond: (d >= '1900-01-01'::date)
   ->  Bitmap Index Scan on d_idx  (cost=0.00..9688.17 rows=524233 width=0)
         Index Cond: (d >= '1900-01-01'::date)
(4 rows)

EXPLAIN SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Seq Scan on dates  (cost=0.00..37540.25 rows=524233 width=40)
   Filter: (date_part('year'::text, (d)::timestamp without time zone)
>= 1900::double precision)
(2 rows)

Timing
select_without_extract: 284.233350s
select_with_extract: 323.106491s
----

SCRIPT:

----
#!/usr/bin/python3

import datetime
import subprocess
import random
import timeit
import sys


subprocess.check_call(['psql', 'postgres', '-c', 'DROP DATABASE IF
EXISTS datetest'], stdout=subprocess.DEVNULL)
subprocess.check_call(['psql', 'postgres', '-c', 'CREATE DATABASE
datetest'], stdout=subprocess.DEVNULL)
subprocess.check_call(['psql', 'datetest', '-c', 'CREATE TABLE dates
(id SERIAL, d DATE NOT NULL, t TEXT NOT NULL)'],
stdout=subprocess.DEVNULL)


def chunks(n, l):
    i = 0
    while i < len(l):
        yield l[i:i+n]
        i += n

d = datetime.date(1800, 1, 1)
today = datetime.date.today()
values = []
while d < today:
    values.extend('(\'%s\', \'%s\')' % (d, d) for i in range(20))
    d += datetime.timedelta(days=1)
random.shuffle(values)
for chunk in chunks(1000, values):
    s = ','.join(chunk)
    subprocess.check_call(['psql', 'datetest', '-c', 'INSERT INTO
dates (d, t) VALUES %s' % s], stdout=subprocess.DEVNULL)


subprocess.check_call(['psql', 'datetest', '-c', 'CREATE INDEX d_idx
ON dates (d)'], stdout=subprocess.DEVNULL)
print('EXPLAIN SELECT * FROM dates WHERE d >= \'1900-01-01\'')
sys.stdout.flush()
subprocess.check_call(['psql', 'datetest', '-c', 'EXPLAIN SELECT *
FROM dates WHERE d >= \'1900-01-01\''])
print('EXPLAIN SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900')
sys.stdout.flush()
subprocess.check_call(['psql', 'datetest', '-c', 'EXPLAIN SELECT *
FROM dates WHERE EXTRACT(year from d) >= 1900'])


def select_without_extract():
    subprocess.check_call(['psql', 'datetest', '-c', 'SELECT * FROM
dates WHERE d >= \'1900-01-01\''], stdout=subprocess.DEVNULL)

def select_with_extract():
    subprocess.check_call(['psql', 'datetest', '-c', 'SELECT * FROM
dates WHERE EXTRACT(year from d) >= 1900'], stdout=subprocess.DEVNULL)

print('Timing')
sys.stdout.flush()

v = timeit.timeit('select_without_extract()', setup='from __main__
import select_without_extract', number=100)
print('select_without_extract: %fs' % v)
sys.stdout.flush()

v = timeit.timeit('select_with_extract()', setup='from __main__ import
select_with_extract', number=100)
print('select_with_extract: %fs' % v)
sys.stdout.flush()
---


pgsql-performance by date:

Previous
From: Andreas Joseph Krogh
Date:
Subject: Performance of vacuumlo
Next
From: Tomas Vondra
Date:
Subject: Re: extract(year from date) doesn't use index but maybe could?