Thread: Possible to improve optimisation / index usage based on domain properties of a function

Hi all,

Take the following scenario

I have a set of partitions inherited from a parent table, called streams.
One of the properties of these tables is a timestamp field, nothing fancy about it.

I also have a qualified index on this field.


I’ve noticed that if I perform the following query, the planner will correctly use the CHECK constraints to determine the partition, and then use the indexes available to retrieve the streams between the specified dates.


select count(*) from streams where stream_date >= ‘2013-01-08’ and stream_date < ‘2013-01-09’;


If however, I was to provide the below query, it uses a sequential scan based plan.  The planner is unable to utilise any indexes because it can’t know what the function is going to return – thus unable to constrain the range at the time of planning the execution.

select count(*) from streams where date(stream_date) = ‘2013-01-08’;



I’m wondering if we could build into postgres some level of metadata regarding the properties of a function, such that the optimiser could filter against the range of values that the function is expected to return.

In this case, it could deduce that the date function will only ever return a value for stream_date to within a certain maximum and minimum range.
Thus the planner could scan the index for all values of stream_date falling within +/- 24 hours of the right operand, and then check/re-check the results.


I suspect this would only be suitable for very basic functions, such as date(), date_trunc() - I suspect, for any function that reduces cardinality to any predictable degree.

Thoughts?

Tim



Hi all,

Take the following scenario

I have a set of partitions inherited from a parent table, called streams.
One of the properties of these tables is a timestamp field, nothing fancy about it.

I also have a qualified index on this field.


I’ve noticed that if I perform the following query, the planner will correctly use the CHECK constraints to determine the partition, and then use the indexes available to retrieve the streams between the specified dates.


select count(*) from streams where stream_date >= ‘2013-01-08’ and stream_date < ‘2013-01-09’;

This is correct way of writing such queries.
 

If however, I was to provide the below query, it uses a sequential scan based plan.  The planner is unable to utilise any indexes because it can’t know what the function is going to return – thus unable to constrain the range at the time of planning the execution.

select count(*) from streams where date(stream_date) = ‘2013-01-08’;


This will not use index on stream_date. Nothing unusual in this. Nothing special with PostgreSQL. None of the RDBMS I have dealt with are smart enough to transform this query.



I’m wondering if we could build into postgres some level of metadata regarding the properties of a function, such that the optimiser could filter against the range of values that the function is expected to return.

In this case, it could deduce that the date function will only ever return a value for stream_date to within a certain maximum and minimum range.
Thus the planner could scan the index for all values of stream_date falling within +/- 24 hours of the right operand, and then check/re-check the results.

If you can't go for the smarter query, go for more optimum index by "expression based index"

 

I suspect this would only be suitable for very basic functions, such as date(), date_trunc() - I suspect, for any function that reduces cardinality to any predictable degree.


Wrong, there are many expressions which won't use indexes but the moment you shift the calculation from LHS to RHS, indexes will appear in plan. This I have seen with at least 3 other RDBMS.

 
Thoughts?


I don't think the RDBMS optimizer should be overloaded with smartness which is expected from users writing it. If you do it, then there is no end to it.

 
Tim



On 20 Feb 2014, at 7:21, Sameer Kumar <sameer.kumar@ashnik.com> wrote:

> If however, I was to provide the below query, it uses a sequential scan based plan.  The planner is unable to utilise
anyindexes because it can’t know what the function is going to return – thus unable to constrain the range at the time
ofplanning the execution. 
>
> select count(*) from streams where date(stream_date) = ‘2013-01-08’;

The inverse of that expression, if it’s possible to formulate one, would most likely use the index though:

select count(*) from streams where stream_date = inv_date(‘2013-01-08’);

>
>> I’m wondering if we could build into postgres some level of metadata regarding the properties of a function, such
thatthe optimiser could filter against the range of values that the function is expected to return. 
>>
>> In this case, it could deduce that the date function will only ever return a value for stream_date to within a
certainmaximum and minimum range. 
>> Thus the planner could scan the index for all values of stream_date falling within +/- 24 hours of the right
operand,and then check/re-check the results. 
>>
> If you can't go for the smarter query, go for more optimum index by "expression based index"
>
> http://www.postgresql.org/docs/9.1/static/indexes-expressional.html

AFAIK, you can’t use expression based indexes to partition a table, so that won’t help the OP much.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.




On Thu, Feb 20, 2014 at 3:34 PM, Alban Hertroys <haramrae@gmail.com> wrote:
> If however, I was to provide the below query, it uses a sequential scan based plan.  The planner is unable to utilise any indexes because it can’t know what the function is going to return – thus unable to constrain the range at the time of planning the execution.
>
> select count(*) from streams where date(stream_date) = ‘2013-01-08’;

The inverse of that expression, if it’s possible to formulate one, would most likely use the index though:

select count(*) from streams where stream_date = inv_date(‘2013-01-08’);


He has already posted that:
select count(*) from streams where stream_date >= ‘2013-01-08’ and stream_date < ‘2013-01-09’;
This would use index

 
>
>> I’m wondering if we could build into postgres some level of metadata regarding the properties of a function, such that the optimiser could filter against the range of values that the function is expected to return.
>>
>> In this case, it could deduce that the date function will only ever return a value for stream_date to within a certain maximum and minimum range.
>> Thus the planner could scan the index for all values of stream_date falling within +/- 24 hours of the right operand, and then check/re-check the results.
>>
> If you can't go for the smarter query, go for more optimum index by "expression based index"
>
> http://www.postgresql.org/docs/9.1/static/indexes-expressional.html

AFAIK, you can’t use expression based indexes to partition a table, so that won’t help the OP much.

1. I think Tim is talking about index usage [an index which he has on stream_date] and not partition

2. Tim, the index usage or non-usage in your two queries would remain same even if you have a single huge table

3. I believe the partitioning tirgger/rule could re-direct records based on an expression too [e.g. when date(stream_date)=24-Jan-2014, send it to Partition_24Jan14]. I am not sure if query planner would go to a particular partition when we query based on date(stream_date). I need to test this.



Best Regards,

Sameer Kumar | Database Consultant

ASHNIK PTE. LTD.

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069533

M: +65 8110 0350  T: +65 6438 3504 | www.ashnik.com

icons

 

Email patch

 

This email may contain confidential, privileged or copyright material and is solely for the use of the intended recipient(s).

Attachment

Thanks Alban, Sameer.

My use of partitions should have been more of a side note really.
I was particularly interested in wether the query planner could optimise a date_folded equality expression into a range query - for the case where it could benefit from an existing index on the non-folded values.

Granted, an expression based index would solve this.
It just seemed an opportunity to open up opportunities for the QEP – at least for the simple case.

Cheers,

TIm


From: Sameer Kumar <sameer.kumar@ashnik.com>
Date: Thursday, 20 February 2014 07:40
To: Alban Hertroys <haramrae@gmail.com>
Cc: Tim Kane <tim.kane@gmail.com>, pgsql-general General <pgsql-general@postgresql.org>
Subject: Re: [GENERAL] Possible to improve optimisation / index usage based on domain properties of a function


On Thu, Feb 20, 2014 at 3:34 PM, Alban Hertroys <haramrae@gmail.com> wrote:
> If however, I was to provide the below query, it uses a sequential scan based plan.  The planner is unable to utilise any indexes because it can’t know what the function is going to return – thus unable to constrain the range at the time of planning the execution.
>
> select count(*) from streams where date(stream_date) = ‘2013-01-08’;

The inverse of that expression, if it’s possible to formulate one, would most likely use the index though:

select count(*) from streams where stream_date = inv_date(‘2013-01-08’);


He has already posted that:
select count(*) from streams where stream_date >= ‘2013-01-08’ and stream_date < ‘2013-01-09’;
This would use index

 
>
>> I’m wondering if we could build into postgres some level of metadata regarding the properties of a function, such that the optimiser could filter against the range of values that the function is expected to return.
>>
>> In this case, it could deduce that the date function will only ever return a value for stream_date to within a certain maximum and minimum range.
>> Thus the planner could scan the index for all values of stream_date falling within +/- 24 hours of the right operand, and then check/re-check the results.
>>
> If you can't go for the smarter query, go for more optimum index by "expression based index"
>
> http://www.postgresql.org/docs/9.1/static/indexes-expressional.html

AFAIK, you can’t use expression based indexes to partition a table, so that won’t help the OP much.

1. I think Tim is talking about index usage [an index which he has on stream_date] and not partition

2. Tim, the index usage or non-usage in your two queries would remain same even if you have a single huge table

3. I believe the partitioning tirgger/rule could re-direct records based on an expression too [e.g. when date(stream_date)=24-Jan-2014, send it to Partition_24Jan14]. I am not sure if query planner would go to a particular partition when we query based on date(stream_date). I need to test this.



Best Regards,

Sameer Kumar | Database Consultant

ASHNIK PTE. LTD.

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069533

M: +65 8110 0350  T: +65 6438 3504 | www.ashnik.com

icons

 

Email patch

 

This email may contain confidential, privileged or copyright material and is solely for the use of the intended recipient(s).

Attachment