Re: query by partial timestamp - Mailing list pgsql-general

From Tom Lane
Subject Re: query by partial timestamp
Date
Msg-id 6304.1357692512@sss.pgh.pa.us
Whole thread Raw
In response to Re: query by partial timestamp  (Gavan Schneider <pg-gts@snkmail.com>)
Responses Re: query by partial timestamp  (Kirk Wythers <wythe001@umn.edu>)
List pgsql-general
Gavan Schneider <pg-gts@snkmail.com> writes:
> From my perspective there are at least three ways to attack
> this problem:

> (I have not tested these, so apologies for the stupid syntax errors.)

> 1.  SELECT ... WHERE 2011 = extract(YEAR FROM col_of_type_timestamp);

> 2.  SELECT ... WHERE
>          '2011-01-01'::TIMESTAMP <= col_of_type_timestamp
>      AND                            col_of_type_timestamp <= '2011-12-31'::TIMESTAMP;

> 3.  SELECT ... WHERE
>          (col_of_type_timestamp, col_of_type_timestamp) OVERLAPS
>          (DATE '2011-01-01', DATE '2012-01-01');

> Is this the full list?

Another possibility is date_trunc, viz

4. SELECT ... WHERE date_trunc(col_of_type_timestamp, 'year') = '2011-01-01'

You could also use BETWEEN, but that's just syntactic sugar for method 2.

Note that in both methods 2 and 3 it's easy to get the edge cases wrong;
in particular I think your version of method 2 gives the wrong answer
for later-than-midnight times on 2011-12-31, while #3 might (not sure)
give the wrong answer for exactly midnight on 2012-01-01.  These things
are fixable of course with a bit of care.  Personally I'd go with

    col >= '2011-01-01' AND col < '2012-01-01'

> So... generalizing the original question: which approach would
> yield the best performance and/or compliance with SQL standards?

> I note Steve Crawford has (strongly) hinted that direct date
> comparison is more likely to use an index (when available) so I
> suspect this is the way to go, but would an index based on
> extract(YEAR...) negate this difference?

Method 3 is not indexable at all and is unlikely to become so --- the
SQL standard's definition of OVERLAPS is squirrely enough that people
haven't bothered to think about optimizing it.  Method 2 works well with
a plain btree index on the timestamp column.  You can get method 1 to be
indexed if you create a functional index on "extract(year from col)";
but since the index would have pretty much no other use than answering
this exact type of query, that's not a very attractive alternative.
Method 4 is like method 1 --- you'd need a specialized index.

Note that in any case an index is not going to be helpful if the query
would need to fetch more than a few percent of the table.  The OP didn't
suggest how many years his data covers, but it's quite possible that
pulling a full year's worth of data will read enough of the table that
there's no point in worrying about whether an index could be used
anyway.

Another thing to think about is whether you'll have related sorts of
queries that aren't about full years --- maybe sometimes you need a
month's worth of data, for example.  The BETWEEN-style query and a btree
index will adapt easily to non-year intervals, while the EXTRACT
approach will not, and date_trunc is rather limited as well.

            regards, tom lane


pgsql-general by date:

Previous
From: Gavan Schneider
Date:
Subject: Re: query by partial timestamp
Next
From: Jerry Sievers
Date:
Subject: Re: Discerning when functions had execute revoked from public