Thread: checking for a NULL date in a partitioned table kills performance

All;

I am running a select from a partitioned table. The table (and all the 
partitions) have an index on contract_date like this:
CREATE INDEX on part_tab (contract_date) where contract_date > 
'2022-01-01'::date

The table (including all partitions) has 32million rows
The db server is an aurora postgresql instance with 128GB of ram and 16 
vcpu's

The shared buffers is set to 90GB and effective_cache_size is also 90GB
I set default_statistics_target to 1000 and ram a vacuum analyze on the 
table

I am selecting a number of columns and specifying this where clause:

WHERE (
                     (contract_date IS NULL)
                     OR
                     (contract_date > '2022-01-01'::date)
              )

This takes 15 seconds to run and an explain says it's doing a table scan 
on all partitions (the query is not specifying the partition key)
If I change the where clause to look like this:

WHERE (
                   (contract_date > '2022-01-01'::date)
              )

Then it performs index scans on all the partitions and runs in about 600ms

If i leave the where clause off entirely it performs table scans of the 
partitions and takes approx 18 seconds to run

I am trying to get the performance to less than 2sec,
I have tried adding indexes on the table and all partitions like this:
CREATE INDEX ON table (contract_date NULLS FIRST) ;
but the performance with the full where clause is the same:

WHERE (
                     (contract_date IS NULL)
                     OR
                     (contract_date > '2022-01-01'::date)
              )

runs in 15 seconds and scans all partitions

I also tried indexes i=on the table and all partitions like this:
CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;

but I get the same result, table scans on all partitions and it runs in 
15 seconds

Any help or advice ?

Thanks in advance




> On Aug 22, 2024, at 5:44 PM, Sbob <sbob@quadratum-braccas.com> wrote:
>
> All;
>
> I am running a select from a partitioned table. The table (and all the partitions) have an index on contract_date
likethis: 
> CREATE INDEX on part_tab (contract_date) where contract_date > '2022-01-01'::date
>
> The table (including all partitions) has 32million rows
> The db server is an aurora postgresql instance with 128GB of ram and 16 vcpu's
>
> The shared buffers is set to 90GB and effective_cache_size is also 90GB
> I set default_statistics_target to 1000 and ram a vacuum analyze on the table
>
> I am selecting a number of columns and specifying this where clause:
>
> WHERE (
>                     (contract_date IS NULL)
>                     OR
>                     (contract_date > '2022-01-01'::date)
>              )
>
> This takes 15 seconds to run and an explain says it's doing a table scan on all partitions (the query is not
specifyingthe partition key) 
> If I change the where clause to look like this:
>
> WHERE (
>                   (contract_date > '2022-01-01'::date)
>              )
>
> Then it performs index scans on all the partitions and runs in about 600ms
>
> If i leave the where clause off entirely it performs table scans of the partitions and takes approx 18 seconds to run
>
> I am trying to get the performance to less than 2sec,
> I have tried adding indexes on the table and all partitions like this:
> CREATE INDEX ON table (contract_date NULLS FIRST) ;
> but the performance with the full where clause is the same:
>
> WHERE (
>                     (contract_date IS NULL)
>                     OR
>                     (contract_date > '2022-01-01'::date)
>              )
>
> runs in 15 seconds and scans all partitions
>
> I also tried indexes i=on the table and all partitions like this:
> CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;
>
> but I get the same result, table scans on all partitions and it runs in 15 seconds
>
> Any help or advice ?
>
> Thanks in advance
>
>

What is contract_date and when will it be null?


On 8/22/24 5:06 PM, Rui DeSousa wrote:
>
>> On Aug 22, 2024, at 5:44 PM, Sbob <sbob@quadratum-braccas.com> wrote:
>>
>> All;
>>
>> I am running a select from a partitioned table. The table (and all the partitions) have an index on contract_date
likethis:
 
>> CREATE INDEX on part_tab (contract_date) where contract_date > '2022-01-01'::date
>>
>> The table (including all partitions) has 32million rows
>> The db server is an aurora postgresql instance with 128GB of ram and 16 vcpu's
>>
>> The shared buffers is set to 90GB and effective_cache_size is also 90GB
>> I set default_statistics_target to 1000 and ram a vacuum analyze on the table
>>
>> I am selecting a number of columns and specifying this where clause:
>>
>> WHERE (
>>                      (contract_date IS NULL)
>>                      OR
>>                      (contract_date > '2022-01-01'::date)
>>               )
>>
>> This takes 15 seconds to run and an explain says it's doing a table scan on all partitions (the query is not
specifyingthe partition key)
 
>> If I change the where clause to look like this:
>>
>> WHERE (
>>                    (contract_date > '2022-01-01'::date)
>>               )
>>
>> Then it performs index scans on all the partitions and runs in about 600ms
>>
>> If i leave the where clause off entirely it performs table scans of the partitions and takes approx 18 seconds to
run
>>
>> I am trying to get the performance to less than 2sec,
>> I have tried adding indexes on the table and all partitions like this:
>> CREATE INDEX ON table (contract_date NULLS FIRST) ;
>> but the performance with the full where clause is the same:
>>
>> WHERE (
>>                      (contract_date IS NULL)
>>                      OR
>>                      (contract_date > '2022-01-01'::date)
>>               )
>>
>> runs in 15 seconds and scans all partitions
>>
>> I also tried indexes i=on the table and all partitions like this:
>> CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;
>>
>> but I get the same result, table scans on all partitions and it runs in 15 seconds
>>
>> Any help or advice ?
>>
>> Thanks in advance
>>
>>
> What is contract_date and when will it be null?


it's a date data type and it allows NULL's not sure why, this is a 
client's system





On 8/22/24 5:26 PM, Sbob wrote:
>
> On 8/22/24 5:06 PM, Rui DeSousa wrote:
>>
>>> On Aug 22, 2024, at 5:44 PM, Sbob <sbob@quadratum-braccas.com> wrote:
>>>
>>> All;
>>>
>>> I am running a select from a partitioned table. The table (and all 
>>> the partitions) have an index on contract_date like this:
>>> CREATE INDEX on part_tab (contract_date) where contract_date > 
>>> '2022-01-01'::date
>>>
>>> The table (including all partitions) has 32million rows
>>> The db server is an aurora postgresql instance with 128GB of ram and 
>>> 16 vcpu's
>>>
>>> The shared buffers is set to 90GB and effective_cache_size is also 90GB
>>> I set default_statistics_target to 1000 and ram a vacuum analyze on 
>>> the table
>>>
>>> I am selecting a number of columns and specifying this where clause:
>>>
>>> WHERE (
>>>                      (contract_date IS NULL)
>>>                      OR
>>>                      (contract_date > '2022-01-01'::date)
>>>               )
>>>
>>> This takes 15 seconds to run and an explain says it's doing a table 
>>> scan on all partitions (the query is not specifying the partition key)
>>> If I change the where clause to look like this:
>>>
>>> WHERE (
>>>                    (contract_date > '2022-01-01'::date)
>>>               )
>>>
>>> Then it performs index scans on all the partitions and runs in about 
>>> 600ms
>>>
>>> If i leave the where clause off entirely it performs table scans of 
>>> the partitions and takes approx 18 seconds to run
>>>
>>> I am trying to get the performance to less than 2sec,
>>> I have tried adding indexes on the table and all partitions like this:
>>> CREATE INDEX ON table (contract_date NULLS FIRST) ;
>>> but the performance with the full where clause is the same:
>>>
>>> WHERE (
>>>                      (contract_date IS NULL)
>>>                      OR
>>>                      (contract_date > '2022-01-01'::date)
>>>               )
>>>
>>> runs in 15 seconds and scans all partitions
>>>
>>> I also tried indexes i=on the table and all partitions like this:
>>> CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;
>>>
>>> but I get the same result, table scans on all partitions and it runs 
>>> in 15 seconds
>>>
>>> Any help or advice ?
>>>
>>> Thanks in advance
>>>
>>>
>> What is contract_date and when will it be null?
>
>
> it's a date data type and it allows NULL's not sure why, this is a 
> client's system
>
>
29 million of the 32 million rows in the table have NULL for contract_date







Re: checking for a NULL date in a partitioned table kills performance

From
"David G. Johnston"
Date:
On Thu, Aug 22, 2024 at 4:32 PM Sbob <sbob@quadratum-braccas.com> wrote:

29 million of the 32 million rows in the table have NULL for contract_date


Your expectation that this query should use an index is flawed.  Indexes are for highly selective queries.  Finding nulls on that table is not selective.

David J.



On Aug 22, 2024, at 7:32 PM, Sbob <sbob@quadratum-braccas.com> wrote:


On 8/22/24 5:26 PM, Sbob wrote:

On 8/22/24 5:06 PM, Rui DeSousa wrote:

On Aug 22, 2024, at 5:44 PM, Sbob <sbob@quadratum-braccas.com> wrote:

All;

I am running a select from a partitioned table. The table (and all the partitions) have an index on contract_date like this:
CREATE INDEX on part_tab (contract_date) where contract_date > '2022-01-01'::date

The table (including all partitions) has 32million rows
The db server is an aurora postgresql instance with 128GB of ram and 16 vcpu's

The shared buffers is set to 90GB and effective_cache_size is also 90GB
I set default_statistics_target to 1000 and ram a vacuum analyze on the table

I am selecting a number of columns and specifying this where clause:

WHERE (
                     (contract_date IS NULL)
                     OR
                     (contract_date > '2022-01-01'::date)
              )

This takes 15 seconds to run and an explain says it's doing a table scan on all partitions (the query is not specifying the partition key)
If I change the where clause to look like this:

WHERE (
                   (contract_date > '2022-01-01'::date)
              )

Then it performs index scans on all the partitions and runs in about 600ms

If i leave the where clause off entirely it performs table scans of the partitions and takes approx 18 seconds to run

I am trying to get the performance to less than 2sec,
I have tried adding indexes on the table and all partitions like this:
CREATE INDEX ON table (contract_date NULLS FIRST) ;
but the performance with the full where clause is the same:

WHERE (
                     (contract_date IS NULL)
                     OR
                     (contract_date > '2022-01-01'::date)
              )

runs in 15 seconds and scans all partitions

I also tried indexes i=on the table and all partitions like this:
CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;

but I get the same result, table scans on all partitions and it runs in 15 seconds

Any help or advice ?

Thanks in advance


What is contract_date and when will it be null?


it's a date data type and it allows NULL's not sure why, this is a client's system


29 million of the 32 million rows in the table have NULL for contract_date


NULLs are not indexed thus the OR predicate invalidate the use of the index.  

Since you are already creating a partial index just include the NULLs.  It index will get used for both of your queries.

create index table_idx1 
  on table (contract_date) 
  where contract_date > ‘1/1/2022’
      or contract_date is null
;


The reason why I asked when is contract_date null is because attributes in a table should be non nullable.  If it’s nullable then that begs the question if it belong in that table in the first place; and sometimes the answer is yes.  I just see a lot of half baked schemas out there.  I refer to them as organically designed schemas. 

-Rui.



Re: checking for a NULL date in a partitioned table kills performance

From
Muhammad Usman Khan
Date:

Hi Sbob,
Have you tried using the following indexes ?

B-tree
    • The default and most commonly used type, ideal for equality, range and Pattern queries.

BRIN (Block Range INdex)
    • Compact indexes that are efficient for large tables where the data is naturally ordered.



On Fri, 23 Aug 2024 at 02:45, Sbob <sbob@quadratum-braccas.com> wrote:
All;

I am running a select from a partitioned table. The table (and all the
partitions) have an index on contract_date like this:
CREATE INDEX on part_tab (contract_date) where contract_date >
'2022-01-01'::date

The table (including all partitions) has 32million rows
The db server is an aurora postgresql instance with 128GB of ram and 16
vcpu's

The shared buffers is set to 90GB and effective_cache_size is also 90GB
I set default_statistics_target to 1000 and ram a vacuum analyze on the
table

I am selecting a number of columns and specifying this where clause:

WHERE (
                     (contract_date IS NULL)
                     OR
                     (contract_date > '2022-01-01'::date)
              )

This takes 15 seconds to run and an explain says it's doing a table scan
on all partitions (the query is not specifying the partition key)
If I change the where clause to look like this:

WHERE (
                   (contract_date > '2022-01-01'::date)
              )

Then it performs index scans on all the partitions and runs in about 600ms

If i leave the where clause off entirely it performs table scans of the
partitions and takes approx 18 seconds to run

I am trying to get the performance to less than 2sec,
I have tried adding indexes on the table and all partitions like this:
CREATE INDEX ON table (contract_date NULLS FIRST) ;
but the performance with the full where clause is the same:

WHERE (
                     (contract_date IS NULL)
                     OR
                     (contract_date > '2022-01-01'::date)
              )

runs in 15 seconds and scans all partitions

I also tried indexes i=on the table and all partitions like this:
CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;

but I get the same result, table scans on all partitions and it runs in
15 seconds

Any help or advice ?

Thanks in advance


Re: checking for a NULL date in a partitioned table kills performance

From
shammat@gmx.net
Date:
Sbob schrieb am 22.08.2024 um 23:44:
> 
> I am selecting a number of columns and specifying this where clause:
> 
> WHERE (
>                     (contract_date IS NULL)
>                     OR
>                     (contract_date > '2022-01-01'::date)
>              )
> 

It's not the check for NULL, it's the OR that makes this perform so badly. 

I typically never set columns used for range queries to NULL. 

Would using infinity instead of NULL be a viable option here? 

Then you can remove the OR condition entirely. 



Re: checking for a NULL date in a partitioned table kills performance

From
Doug Reynolds
Date:
I don’t see how an index is going to help since virtually all of the rows are null AND contract_date isn’t the
partitionkey.   

Perhaps, you could try a UNION ALL with one query selecting the date and the other selecting where the date is null.

You could try something really ugly where you make a function index that COALESCEs the nulls to 1-1-1900 and use the
COALESCEin the query. 


Sent from my iPhone

> On Aug 22, 2024, at 7:43 PM, Sbob <sbob@quadratum-braccas.com> wrote:
>
> 
>> On 8/22/24 5:26 PM, Sbob wrote:
>>
>>> On 8/22/24 5:06 PM, Rui DeSousa wrote:
>>>
>>>> On Aug 22, 2024, at 5:44 PM, Sbob <sbob@quadratum-braccas.com> wrote:
>>>>
>>>> All;
>>>>
>>>> I am running a select from a partitioned table. The table (and all the partitions) have an index on contract_date
likethis: 
>>>> CREATE INDEX on part_tab (contract_date) where contract_date > '2022-01-01'::date
>>>>
>>>> The table (including all partitions) has 32million rows
>>>> The db server is an aurora postgresql instance with 128GB of ram and 16 vcpu's
>>>>
>>>> The shared buffers is set to 90GB and effective_cache_size is also 90GB
>>>> I set default_statistics_target to 1000 and ram a vacuum analyze on the table
>>>>
>>>> I am selecting a number of columns and specifying this where clause:
>>>>
>>>> WHERE (
>>>>                      (contract_date IS NULL)
>>>>                      OR
>>>>                      (contract_date > '2022-01-01'::date)
>>>>               )
>>>>
>>>> This takes 15 seconds to run and an explain says it's doing a table scan on all partitions (the query is not
specifyingthe partition key) 
>>>> If I change the where clause to look like this:
>>>>
>>>> WHERE (
>>>>                    (contract_date > '2022-01-01'::date)
>>>>               )
>>>>
>>>> Then it performs index scans on all the partitions and runs in about 600ms
>>>>
>>>> If i leave the where clause off entirely it performs table scans of the partitions and takes approx 18 seconds to
run
>>>>
>>>> I am trying to get the performance to less than 2sec,
>>>> I have tried adding indexes on the table and all partitions like this:
>>>> CREATE INDEX ON table (contract_date NULLS FIRST) ;
>>>> but the performance with the full where clause is the same:
>>>>
>>>> WHERE (
>>>>                      (contract_date IS NULL)
>>>>                      OR
>>>>                      (contract_date > '2022-01-01'::date)
>>>>               )
>>>>
>>>> runs in 15 seconds and scans all partitions
>>>>
>>>> I also tried indexes i=on the table and all partitions like this:
>>>> CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;
>>>>
>>>> but I get the same result, table scans on all partitions and it runs in 15 seconds
>>>>
>>>> Any help or advice ?
>>>>
>>>> Thanks in advance
>>>>
>>>>
>>> What is contract_date and when will it be null?
>>
>>
>> it's a date data type and it allows NULL's not sure why, this is a client's system
>>
>>
> 29 million of the 32 million rows in the table have NULL for contract_date
>
>
>
>
>
>





On 8/22/24 6:07 PM, Rui DeSousa wrote:


On Aug 22, 2024, at 7:32 PM, Sbob <sbob@quadratum-braccas.com> wrote:


On 8/22/24 5:26 PM, Sbob wrote:

On 8/22/24 5:06 PM, Rui DeSousa wrote:

On Aug 22, 2024, at 5:44 PM, Sbob <sbob@quadratum-braccas.com> wrote:

All;

I am running a select from a partitioned table. The table (and all the partitions) have an index on contract_date like this:
CREATE INDEX on part_tab (contract_date) where contract_date > '2022-01-01'::date

The table (including all partitions) has 32million rows
The db server is an aurora postgresql instance with 128GB of ram and 16 vcpu's

The shared buffers is set to 90GB and effective_cache_size is also 90GB
I set default_statistics_target to 1000 and ram a vacuum analyze on the table

I am selecting a number of columns and specifying this where clause:

WHERE (
                     (contract_date IS NULL)
                     OR
                     (contract_date > '2022-01-01'::date)
              )

This takes 15 seconds to run and an explain says it's doing a table scan on all partitions (the query is not specifying the partition key)
If I change the where clause to look like this:

WHERE (
                   (contract_date > '2022-01-01'::date)
              )

Then it performs index scans on all the partitions and runs in about 600ms

If i leave the where clause off entirely it performs table scans of the partitions and takes approx 18 seconds to run

I am trying to get the performance to less than 2sec,
I have tried adding indexes on the table and all partitions like this:
CREATE INDEX ON table (contract_date NULLS FIRST) ;
but the performance with the full where clause is the same:

WHERE (
                     (contract_date IS NULL)
                     OR
                     (contract_date > '2022-01-01'::date)
              )

runs in 15 seconds and scans all partitions

I also tried indexes i=on the table and all partitions like this:
CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;

but I get the same result, table scans on all partitions and it runs in 15 seconds

Any help or advice ?

Thanks in advance


What is contract_date and when will it be null?


it's a date data type and it allows NULL's not sure why, this is a client's system


29 million of the 32 million rows in the table have NULL for contract_date


NULLs are not indexed thus the OR predicate invalidate the use of the index.  

Since you are already creating a partial index just include the NULLs.  It index will get used for both of your queries.

create index table_idx1 
  on table (contract_date) 
  where contract_date > ‘1/1/2022’
      or contract_date is null
;


The reason why I asked when is contract_date null is because attributes in a table should be non nullable.  If it’s nullable then that begs the question if it belong in that table in the first place; and sometimes the answer is yes.  I just see a lot of half baked schemas out there.  I refer to them as organically designed schemas. 

-Rui.

I agree, I will find out from the client


> On Aug 23, 2024, at 5:39 AM, Doug Reynolds <mav@wastegate.net> wrote:
>
> You could try something really ugly where you make a function index that COALESCEs the nulls to 1-1-1900 and use the
COALESCEin the query. 

I don't see how that could be better than just creating a partial index on it WHERE contract_date IS NULL--and anyway
I'msure you're right that no index would help. No matter what, it seems that sequential scans of all partitions will be
requiredsince most rows have it null, and it's not even related to the partition key. 


Re: checking for a NULL date in a partitioned table kills performance

From
Doug Reynolds
Date:
The only difference is that you would be reading from one index instead of two, which could be more efficient.


Sent from my iPhone

> On Aug 23, 2024, at 11:19 AM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
>
> 
>>
>> On Aug 23, 2024, at 5:39 AM, Doug Reynolds <mav@wastegate.net> wrote:
>>
>> You could try something really ugly where you make a function index that COALESCEs the nulls to 1-1-1900 and use the
COALESCEin the query. 
>
> I don't see how that could be better than just creating a partial index on it WHERE contract_date IS NULL--and anyway
I'msure you're right that no index would help. No matter what, it seems that sequential scans of all partitions will be
requiredsince most rows have it null, and it's not even related to the partition key. 
>




Re: checking for a NULL date in a partitioned table kills performance

From
"Wetmore, Matthew (CTR)"
Date:

I have had this issue in the past.

 

The real admin fix to this is to have a NULL replacement character that prevents this.

This does a few things:

  1. An index will index on a replacement character ( I use <->)
  2. A join is easier on a replacement character than NULL (Nulls last/first avoided)
  3. Stops all evil NULL rules.

 

We strive to fix things, but the real solution, IMHO, is better arch design and better fundamental understanding of how NULL works.

 

Partitioned tables under 500M-750M rows will always have these performance issues.

 

Great ideas on the workaround’s though, I do understand sometimes you inherit a bad db.

 

 

 

From: Scott Ribe <scott_ribe@elevated-dev.com>
Date: Friday, August 23, 2024 at 8:17 AM
To: Pgsql-admin <pgsql-admin@lists.postgresql.org>
Subject: Re: checking for a NULL date in a partitioned table kills performance

 

> On Aug 23, 2024, at 5:39 AM, Doug Reynolds <mav@wastegate.net> wrote: > > You could try something really ugly where you make a function index that COALESCEs the nulls to 1-1-1900 and use the COALESCE in the query. I don't see

> On Aug 23, 2024, at 5:39 AM, Doug Reynolds <mav@wastegate.net> wrote:
> 
> You could try something really ugly where you make a function index that COALESCEs the nulls to 1-1-1900 and use the COALESCE in the query.
 
I don't see how that could be better than just creating a partial index on it WHERE contract_date IS NULL--and anyway I'm sure you're right that no index would help. No matter what, it seems that sequential scans of all partitions will be required since most rows have it null, and it's not even related to the partition key.
 
> On Aug 23, 2024, at 9:42 AM, Doug Reynolds <mav@wastegate.net> wrote:
>
> The only difference is that you would be reading from one index instead of two, which could be more efficient.

Ah yes, that's a good point to take into consideration in such a case.

In the one at hand though, if statistics are correct, neither index is going to be used, given the 90% of rows with
NULLvalues. Using an index would just waste time compared to a simple sequential scan.