Re: Postgresql 14 partitioning advice - Mailing list pgsql-performance

From Slava Mudry
Subject Re: Postgresql 14 partitioning advice
Date
Msg-id CAEFxPe29wJo6SVfpAfdfZf12g1yC9=jY2fp0Z_RPT4Y6TpyEtw@mail.gmail.com
Whole thread Raw
In response to Re: Postgresql 14 partitioning advice  (Rick Otten <rottenwindfish@gmail.com>)
Responses Re: Postgresql 14 partitioning advice  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-performance
> I'm spinning up a new Postgresql 14 database where I'll have to store a couple years worth of time series data at the rate of single-digit millions of rows per day.
I think you absolutely need to use partitioning for the following reasons:
1. maintenance and roll-off of older data
2. indexes are much smaller
3. performance is predictable (if partition pruning kicks in)

Postgres 14 improved partitioning quite a bit. I used it in Postgres 9 and there was a lot of locking on partition hierarchy when you add/drop partition tables.
Having thousands of partitions shouldn't be a problem, BUT you will incur cost on query planning, which is usually under 0.1 second on modern hardware.


On Tue, Aug 2, 2022 at 5:55 AM Rick Otten <rottenwindfish@gmail.com> wrote:


On Mon, Aug 1, 2022 at 10:16 AM Rick Otten <rottenwindfish@gmail.com> wrote:

The other problem I ran into, which I'm still building a test case for and I fear might be a bug if I can easily reproduce it,
is if I did the original select in a CTE, and then did a sort outside of the CTE, even though the CTE found 0 rows, the database
still spent a _ton_ of time sorting those 0 rows:
```
   ->  Sort  (cost=70.03..72.53 rows=1000 width=112) (actual time=84848.452..84848.453 rows=0 loops=1)
```
Once I can reproduce this on test data I'll be able to pin down more closely what is happening and tell if I'm just reading
the explain plan wrong or if something is broken.  It was getting mixed up with the lack of pruning/index usage problem.

I'll report back again next week.  Anyway it is looking to me like it doesn't really matter (within reason) from a performance
perspective how many partitions we use for our data set and query patterns.  We should be able to pick the most convenient
from an archiving and data management perspective instead.


This behavior is definitely consistent.  0 rows end up slower than when I find some rows in my CTE:
```
   ->  Sort  (cost=109.44..113.19 rows=1500 width=112) (actual time=87110.841..87110.842 rows=0 loops=1)
   ->  Sort  (cost=109.44..113.19 rows=1500 width=112) (actual time=25367.867..25367.930 rows=840 loops=1)
```
The only thing I changed in the query was the date range.  It is actually the CTE scan step inside the Sort block that is slower when no rows are returned than when rows are returned.  It also only happens when all the partitions are sequence scanned instead of being partition pruned.

I'm still writing up a test case that can demo this without using proprietary data.

After a bunch of experiments I can explain this now.  :-)

I had a `limit` clause in my test CTE.  When sequence scanning a bunch of partitions, if the limit is reached, the subsequent partitions are marked with `never executed` and not scanned.  On the other hand, when no rows are found, all of the partitions are scanned.

Therefore, with many millions of rows in the partitions, and being forced to sequence scan because I put the `at time zone` clause in the `where`, the case when rows are found is always noticeably faster than the case when rows aren't found as long as at least one partition hasn't been scanned yet when the limit is hit.

I'm now satisfied this is a good thing, and will move on to other problems.  Thanks for hearing me out.  I was scratching my head for a while over that one.

 


--
-slava

pgsql-performance by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: pg_wal filling up while running huge updates
Next
From: Justin Pryzby
Date:
Subject: Re: Postgresql 14 partitioning advice