Thread: Occasional performance issue after changing table partitions

Occasional performance issue after changing table partitions

From

Nathan Ward

Date:

10 July 2022, 04:55:34

Hi,

I am running Postgres 13 on CentOS 7, installed from the yum.postgresql.org repo.

I have 3 servers in different sites, with asynchronous replication managed by pgpool 4.0.11. Highest latency is 16ms RTT - but shouldn’t matter as it’s async - but my application is running many concurrent connections to keep throughput up when latency is high - as the queries are typically very fast.

Servers have 32GB RAM and 4 cores.

max_connections = 256

max_locks_per_transaction = 256

max_stack_depth = 2MB

max_wal_senders = 5

max_wal_size = 1GB

min_wal_size = 80MB

shared_buffers = 10GB

I am storing internet usage data and attributes from RADIUS messages.

Our ingest does not come directly from the RADIUS server - instead it comes via rabbitmq which allows the database performance to drop without impacting our RADIUS services.

I am making pretty heavy use of partitions to manage this. The data usage accounting is stored in a partition per hour, which runs through daily aggregation - aggregating usage older than 3 to usage per day, rather than usage per hour. At this point, we have a partition per day, rather than per hour.

I am doing this, because we want hourly usage for 3 days, and RADIUS data comes in as a running total rather than a delta per some interval, and we want to have the deltas available. For each incoming record, we look at previous hours to find a previous total, and calculate a delta from that - this means we are regularly (hundred of times per second) looking at the last 2 hours. Partitioning per hour lets us manage that.

I partition on an “interval_start” timestamp column.

The aggregation process which runs each night at some time after midnight does the following, in a transaction:

1) Create a new “full day” partition for 3 days ago.

2) Look at hourly data older than 3 days (so, across 24 partitions) and calculate totals for that internet session on that day and insert in to the “full day” table.

3) Detach the hourly partitions

4) Attach the full day partition

5) Drop the hourly partitions

This process takes around 12s to run - aggregating around 16.8M rows in to 700k rows.

Each partition is around 70MB.

Through this process, data continues to be ingested - but as the data being aggregated is 3+ days old, and the ingest process is only looking at current partitions (i.e. last couple hours), we don’t have any conflicts here.

Additionally (and I think less importantly so skip this if it’s getting long winded) we store all RADIUS attributes from the RADIUS messages as jsonb, doing deduplication - most RADIUS messages (apart from the usage, session timers, etc.) are static, so we have a table of those “static” attributes in a jsonb column, and a table of events matching timestamps to sets of RADIUS messages.

These tables also make use of partitioning - we have hot, cold, and frozen data so that only the hot and cold indexes are in memory most of the time - the “frozen” data is very large (100s of GB) so is only used for reporting.

There are a couple of processes which run over this data periodically:

1) Move data from hot -> cold -> frozen as it ages

2) Aggregate “events” (i.e. RADIUS message timestamp to attributes) together so that longer term we don’t have a row per message, and rather only a row each time the attributes for a RADIUS session changes. This means there is always dead rows in this table, but they regularly get re-used.

This appears to work very well, these processes run every 5 mins or so. The event aggregation process drops around 180k rows.

The issue I am having, is that when the daily data usage aggregation runs, sometimes we have a big performance impact, with the following characteristics which happen *after* the aggregation job runs in it usual fast time of 12s or so:

- The aggregation runs fast as per normal

- Load on the server goes to 30-40 - recall we have quite high “max connections” to keep throughput high when the client is far (16ms) from the server

- IOWait drops to ~0% (it’s usually at around 1-2%) but actual disk IO rates seem approx normal

- User and System CPU increase to a total of 100% - ~86% and ~14% respectively

- Processing time for RADIUS messages increases, and a big processing backlog builds

- Swap is not used - it is enabled, but very low swap IO

- Memory usage does not change

If I stop the ingest process briefly, then start it up again, the problem goes away - the database server drops to 0% CPU, then after starting the ingest process the backlog clears very rapidly and performance is back to normal.

This happens maybe once or twice a week - it’s not every day. It’s not on specific days each week.

There is a vague correlation with other aggregation jobs (i.e. event aggregation mentioned above) running immediately after the daily data usage aggregation. Only one of these jobs runs at once - so if another scheduled job wants to run, it will run immediately after whatever is already running.

I am wondering if there’s some sort of problem where we drop all these partitions, and postgres needs to do some work internally to free the space up or something, but has a hard time doing so with all the updates going on?

I am not clear why this only happens some days - I am working on seeing if I can firm up (or rule out) the correlation with other aggregation jobs running immediately afterwards.

Can anyone recommend some things to look at here? I’ve got quite a bit of metrics collected every minute - per-table io (i.e. hit/read), index sizes, table sizes, etc. - however everything there seems “normal” for the slow ingest rate when the issue occurs, so it’s hard to differentiate between cause and symptoms in those metrics.

I have bumped up effective_cache_size from default of 4GB to 16GB since this last happened - but given IO doesn’t appear to be an issue, I don’t think this will have too much effect.

Nathan Ward

Re: Occasional performance issue after changing table partitions

From

Justin Pryzby

Date:

10 July 2022, 14:05:20

On Sun, Jul 10, 2022 at 04:55:34PM +1200, Nathan Ward wrote:
> I am running Postgres 13 on CentOS 7, installed from the yum.postgresql.org <http://yum.postgresql.org/> repo.

It doesn't sound relevant, but what kind of storage systems is postgres using ?
Filesystem, raid, device.

Is the high CPU use related to to autovacuum/autoanalyze ?

> The issue I am having, is that when the daily data usage aggregation runs, sometimes we have a big performance
impact,with the following characteristics which happen *after* the aggregation job runs in it usual fast time of 12s or
so:
> - The aggregation runs fast as per normal
> - Load on the server goes to 30-40 - recall we have quite high “max connections” to keep throughput high when the
clientis far (16ms) from the server

I suggest to install and enable autoexplain to see what's running slowly here,
and what its query plans are.  It seems possible that when the daily
aggregation script drops the old partitions, the plan changes for the worse.
I'm not sure what the fix is - maybe you just need to run vacuum or analyze on
the new partitions soon after populating them.

For good measure, also set log_autovacuum_min_duration=0 (or something other
than -1) (and while you're at it, log_checkpoints=on, and log_lock_waits=on if
you haven't already).

Note that postgres doesn't automatically analyze parent tables, so you should
maybe do that whenever the data changes enough for it to matter.

-- 
Justin

Re: Occasional performance issue after changing table partitions

From

Nathan Ward

Date:

11 July 2022, 03:21:38

> On 11/07/2022, at 2:05 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Sun, Jul 10, 2022 at 04:55:34PM +1200, Nathan Ward wrote:
>> I am running Postgres 13 on CentOS 7, installed from the yum.postgresql.org <http://yum.postgresql.org/> repo.
>
> It doesn't sound relevant, but what kind of storage systems is postgres using ?
> Filesystem, raid, device.

It’s an NVME SSD backed SAN over 2x16G fibre channel. The postgres server is in a VM (vmware). I can pretty comfortably
do10Gbit/s or more to the disk (I’ve only personally tested that, because initial replication from other sites runs at
aroundthat sort of speed limited by ethernet interfaces). The normal IO is between 1MB/s and maybe 15MB/s - writes
only.Reads are pretty minimal. 
FS is XFS.

> Is the high CPU use related to to autovacuum/autoanalyze ?

Good question - I don’t know. I’ve set the server to debug1 log level so I can see that - I see you have some notes
belowabout autovacuum logs so I’ll see what that shows me. 
Since setting that log level I haven’t yet had the issue occur - I watched it tick over midnight last night and it was
normal(as it is, most days). 

>> The issue I am having, is that when the daily data usage aggregation runs, sometimes we have a big performance
impact,with the following characteristics which happen *after* the aggregation job runs in it usual fast time of 12s or
so:
>> - The aggregation runs fast as per normal
>> - Load on the server goes to 30-40 - recall we have quite high “max connections” to keep throughput high when the
clientis far (16ms) from the server 
>
> I suggest to install and enable autoexplain to see what's running slowly here,
> and what its query plans are.  It seems possible that when the daily
> aggregation script drops the old partitions, the plan changes for the worse.
> I'm not sure what the fix is - maybe you just need to run vacuum or analyze on
> the new partitions soon after populating them.

Hmm, I’ll check it out. I hadn’t thought that the query planner could be doing something different, that’s a good
point.

Note that the normal data ingest queries don’t hit the new partition - the new partition is for data 3 days ago, and
theingest queries only hit the partitions covering the last ~2 hours. 

> For good measure, also set log_autovacuum_min_duration=0 (or something other
> than -1) (and while you're at it, log_checkpoints=on, and log_lock_waits=on if
> you haven't already).

Wilco.

> Note that postgres doesn't automatically analyze parent tables, so you should
> maybe do that whenever the data changes enough for it to matter.

Hmm. This raises some stuff I’m not familiar with - does analysing a parent table do anything? I got the impression
thatanalysing the parent was just shorthand for analysing all of the attached partitions. 

Perhaps because I attach a table with data, the parent sometimes decides it needs to run analyse on a bunch of things?
Or, maybe it uses the most recently attached partition, with bad statistics, to plan queries that only touch other
partitions?

--
Nathan Ward

Re: Occasional performance issue after changing table partitions

From

Justin Pryzby

Date:

11 July 2022, 04:05:28

On Mon, Jul 11, 2022 at 03:21:38PM +1200, Nathan Ward wrote:
> > Note that postgres doesn't automatically analyze parent tables, so you should
> > maybe do that whenever the data changes enough for it to matter.
> 
> Hmm. This raises some stuff I’m not familiar with - does analysing a parent table do anything?

Yes

You could check if you have stats now (maybe due to a global ANALYZE or
analyzedb) and how the query plans change if you analyze.
The transaction may be overly conservative.

SELECT COUNT(1) FROM pg_stats WHERE tablename=PARENT;
SELECT last_analyze, last_autoanalyze, relname FROM pg_stat_all_tables WHERE relname=PARENT;
begin;
SET default_statistics_target=10;
ANALYZE;
explain SELECT [...];
rollback;

> I got the impression that analysing the parent was just shorthand for analysing all of the attached partitions.

Could you let us know if the documentation left that impression ?

See here (this was updated recently).

https://www.postgresql.org/docs/13/sql-analyze.html#id-1.9.3.46.8

For partitioned tables, ANALYZE gathers statistics by sampling rows from all partitions; in addition, it will recurse
intoeach partition and update its statistics. Each leaf partition is analyzed only once, even with multi-level
partitioning.No statistics are collected for only the parent table (without data from its partitions), because with
partitioningit's guaranteed to be empty.

By contrast, if the table being analyzed has inheritance children, ANALYZE gathers two sets of statistics: one on the
rowsof the parent table only, and a second including rows of both the parent table and all of its children. This second
setof statistics is needed when planning queries that process the inheritance tree as a whole. The child tables
themselvesare not individually analyzed in this case.

The autovacuum daemon does not process partitioned tables, nor does it process inheritance parents if only the children
areever modified. It is usually necessary to periodically run a manual ANALYZE to keep the statistics of the table
hierarchyup to date.

> Perhaps because I attach a table with data, the parent sometimes decides it needs to run analyse on a bunch of
things?

No, that doesn't happen.

> Or, maybe it uses the most recently attached partition, with bad statistics, to plan queries that only touch other
partitions?

This is closer to what I was talking about.

To be clear, you are using relkind=p partitions (added in v10), and not
inheritance parents, right ?

-- 
Justin

Re: Occasional performance issue after changing table partitions

From

Nathan Ward

Date:

11 July 2022, 06:20:23

> On 11/07/2022, at 4:05 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Mon, Jul 11, 2022 at 03:21:38PM +1200, Nathan Ward wrote:
>>> Note that postgres doesn't automatically analyze parent tables, so you should
>>> maybe do that whenever the data changes enough for it to matter.
>>
>> Hmm. This raises some stuff I’m not familiar with - does analysing a parent table do anything?
>
> Yes
>
> You could check if you have stats now (maybe due to a global ANALYZE or
> analyzedb) and how the query plans change if you analyze.
> The transaction may be overly conservative.
>
> SELECT COUNT(1) FROM pg_stats WHERE tablename=PARENT;
> SELECT last_analyze, last_autoanalyze, relname FROM pg_stat_all_tables WHERE relname=PARENT;
> begin;
> SET default_statistics_target=10;
> ANALYZE;
> explain SELECT [...];
> rollback;

I have a development database which gets a mirror of about 50% of the data coming in, and ran a global ANALYZE earlier
on- and note that the disk IO is actually a lot higher since which is interesting and not desirable obviously, so I
havesome more fiddling to do.. 
The behaviour during the ANALYZE was very similar to what happens on my production database when things go funny
though,so, this feels like it’s getting me close. 

The above is going to be a bit tricky to do I think - the ingest process runs a stored procedure, and behaviour varies
quitea bit if I stick in synthetic values. 

I think probably my approach for now will be to turn on auto explain with some sampling, and see what happens.

Side note, in the auto_explain docs, there is a note in a callout box saying that log_analyze has a high impact even if
thequery isn’t logged - if I use sampling, is this still the case - i.e. all queries are impacted - or is it only the
sampledqueries? 

>> I got the impression that analysing the parent was just shorthand for analysing all of the attached partitions.
>
> Could you let us know if the documentation left that impression ?
>
> See here (this was updated recently).
>
> https://www.postgresql.org/docs/13/sql-analyze.html#id-1.9.3.46.8
>
> For partitioned tables, ANALYZE gathers statistics by sampling rows from all partitions; in addition, it will recurse
intoeach partition and update its statistics. Each leaf partition is analyzed only once, even with multi-level
partitioning.No statistics are collected for only the parent table (without data from its partitions), because with
partitioningit's guaranteed to be empty. 
>
> By contrast, if the table being analyzed has inheritance children, ANALYZE gathers two sets of statistics: one on the
rowsof the parent table only, and a second including rows of both the parent table and all of its children. This second
setof statistics is needed when planning queries that process the inheritance tree as a whole. The child tables
themselvesare not individually analyzed in this case. 
>
> The autovacuum daemon does not process partitioned tables, nor does it process inheritance parents if only the
childrenare ever modified. It is usually necessary to periodically run a manual ANALYZE to keep the statistics of the
tablehierarchy up to date. 

It was this part:
“””
No statistics are collected for *only* the parent table (without data from its partitions), because with partitioning
it'sguaranteed to be empty. 
“””

Emphasis around “only” is mine - I think my brain skipped that word, but, it’s obviously critical.

I also note this:
“””
It is usually necessary to periodically run a manual ANALYZE to keep the statistics of the table hierarchy up to date.
“””
This seems really important and is something I was entirely unaware of - maybe this should be in one of those callout
boxes.

>> Perhaps because I attach a table with data, the parent sometimes decides it needs to run analyse on a bunch of
things?
>
> No, that doesn't happen.

Ack.

>> Or, maybe it uses the most recently attached partition, with bad statistics, to plan queries that only touch other
partitions?
>
> This is closer to what I was talking about.
>
> To be clear, you are using relkind=p partitions (added in v10), and not
> inheritance parents, right ?

Yes, relkind=p.

--
Nathan Ward

Re: Occasional performance issue after changing table partitions

From

Nathan Ward

Date:

12 July 2022, 15:13:46

Hi,

I haven’t caught the issue yet with this debug etc. in place, but auto_explain (and some pg_stat_statements poking about) has helped me find something interesting that might(?) be related.

My data ingest is in 2 functions, depending on the type of data:

- RADIUS data with usage info

- RADIUS data without usage info

The functions are the largely same, except the one with usage info has to go and work with a table with lots of partitions (444, right at the moment).

The function that works with the usage info is *significantly* slower.

One thing I have specifically noticed is that the run time for the total function doesn’t add up to the run time of each of the of the nested statements. Not even close. It’s around 16ms on average (both in pg_stat_statements and in the auto_explain), but the nested statements add up to around 1-2ms or so - which I think means the planner is the culprit here.

I have been stepping through the various statements which are different between the two functions, and note that when I do math on a timestamp in a SELECT statement (i.e. _event_timestamp - INTERVAL ‘1 hour’), the planner takes 50ms or so - note that the result of the timestamp is used to search the partition key.

If I declare a function which does the math in advance, stores it in a variable and then runs the SELECT, the planner takes less than 1ms.

Does this mean it’s calculating the timestamp for each partition, or something like that?

I have updated the function in my production database to do this math in advance, and the mean time is down to around 6ms, from 16ms.

This is still longer than the actual statement execution times in the auto_explain output - which add up to around 1-2ms - but it’s better!

I’ve also turned on pg_stat_statements.track_planning and will see what that looks like after some time.

I see Postgres 14 release notes has information about performance improvements in the planner for updates on tables with "many partitions”. Is 444 partitions “many”?

My updates are all impacting a single partition only.

On 11/07/2022, at 6:20 PM, Nathan Ward <lists+postgresql@daork.net> wrote:

On 11/07/2022, at 4:05 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:

On Mon, Jul 11, 2022 at 03:21:38PM +1200, Nathan Ward wrote:
Note that postgres doesn't automatically analyze parent tables, so you should
maybe do that whenever the data changes enough for it to matter.

Hmm. This raises some stuff I’m not familiar with - does analysing a parent table do anything?

Yes

You could check if you have stats now (maybe due to a global ANALYZE or
analyzedb) and how the query plans change if you analyze.
The transaction may be overly conservative.

SELECT COUNT(1) FROM pg_stats WHERE tablename=PARENT;
SELECT last_analyze, last_autoanalyze, relname FROM pg_stat_all_tables WHERE relname=PARENT;
begin;
SET default_statistics_target=10;
ANALYZE;
explain SELECT [...];
rollback;

I have a development database which gets a mirror of about 50% of the data coming in, and ran a global ANALYZE earlier on - and note that the disk IO is actually a lot higher since which is interesting and not desirable obviously, so I have some more fiddling to do..
The behaviour during the ANALYZE was very similar to what happens on my production database when things go funny though, so, this feels like it’s getting me close.

The above is going to be a bit tricky to do I think - the ingest process runs a stored procedure, and behaviour varies quite a bit if I stick in synthetic values.

I think probably my approach for now will be to turn on auto explain with some sampling, and see what happens.

Side note, in the auto_explain docs, there is a note in a callout box saying that log_analyze has a high impact even if the query isn’t logged - if I use sampling, is this still the case - i.e. all queries are impacted - or is it only the sampled queries?

I got the impression that analysing the parent was just shorthand for analysing all of the attached partitions.

Could you let us know if the documentation left that impression ?

See here (this was updated recently).

https://www.postgresql.org/docs/13/sql-analyze.html#id-1.9.3.46.8

For partitioned tables, ANALYZE gathers statistics by sampling rows from all partitions; in addition, it will recurse into each partition and update its statistics. Each leaf partition is analyzed only once, even with multi-level partitioning. No statistics are collected for only the parent table (without data from its partitions), because with partitioning it's guaranteed to be empty.

By contrast, if the table being analyzed has inheritance children, ANALYZE gathers two sets of statistics: one on the rows of the parent table only, and a second including rows of both the parent table and all of its children. This second set of statistics is needed when planning queries that process the inheritance tree as a whole. The child tables themselves are not individually analyzed in this case.

The autovacuum daemon does not process partitioned tables, nor does it process inheritance parents if only the children are ever modified. It is usually necessary to periodically run a manual ANALYZE to keep the statistics of the table hierarchy up to date.

It was this part:
“””
No statistics are collected for *only* the parent table (without data from its partitions), because with partitioning it's guaranteed to be empty.
“””

Emphasis around “only” is mine - I think my brain skipped that word, but, it’s obviously critical.

I also note this:
“””
It is usually necessary to periodically run a manual ANALYZE to keep the statistics of the table hierarchy up to date.
“””
This seems really important and is something I was entirely unaware of - maybe this should be in one of those callout boxes.

Perhaps because I attach a table with data, the parent sometimes decides it needs to run analyse on a bunch of things?

No, that doesn't happen.

Ack.

Or, maybe it uses the most recently attached partition, with bad statistics, to plan queries that only touch other partitions?

This is closer to what I was talking about.

To be clear, you are using relkind=p partitions (added in v10), and not
inheritance parents, right ?

Yes, relkind=p.

--
Nathan Ward

Re: Occasional performance issue after changing table partitions

From

Justin Pryzby

Date:

12 July 2022, 15:43:10

On Wed, Jul 13, 2022 at 03:13:46AM +1200, Nathan Ward wrote:
> I have been stepping through the various statements which are different between the two functions, and note that when
Ido math on a timestamp in a SELECT statement (i.e. _event_timestamp - INTERVAL ‘1 hour’),

> the planner takes 50ms or so - note that the result of the timestamp is used to search the partition key.
> If I declare a function which does the math in advance, stores it in a variable and then runs the SELECT, the planner
takesless than 1ms.

> Does this mean it’s calculating the timestamp for each partition, or something like that?

I'm not sure I understand what you're doing - the relevant parts of your
function text and query plan would help here.

Maybe auto_explain.log_nested_statements would be useful ?

Note that "partition pruning" can happen even if you don't have a literal
constant.  For example:
|explain(costs off) SELECT * FROM metrics WHERE start_time > now()::timestamp - '1 days'::interval;
| Append
|   Subplans Removed: 36

> I see Postgres 14 release notes has information about performance improvements in the planner for updates on tables
with"many partitions”. Is 444 partitions “many”?

> My updates are all impacting a single partition only.

It sounds like that'll certainly help you.  Another option is to update the
partition directly (which is what we do, to be able to use "ON CONFLICT").

I think with "old partitioning with inheritance", more than a few hundred
partitions was considered unreasonable, and plan-time suffered.

With relkind=p native/declarative partitioning, a few hundred is considered
reasonable, and a few thousand is still considered excessive - even if the
planner time is no issue, you'll still run into problems like "work-mem is
per-node", which works poorly when you might have 10x more nodes.

TBH, this doesn't sound related to your original issue.

-- 
Justin