Re: [GENERAL] Strange checkpoint behavior - checkpoints take a longtime - Mailing list pgsql-general

From Achilleas Mantzios
Subject Re: [GENERAL] Strange checkpoint behavior - checkpoints take a longtime
Date
Msg-id d679178f-1c09-e136-f0a2-cedd98f65515@matrix.gatewaynet.com
Whole thread Raw
In response to Re: [GENERAL] Strange checkpoint behavior - checkpoints take a long time  (Vladimir Nicolici <vladnc@gmail.com>)
Responses Re: [GENERAL] Strange checkpoint behavior - checkpoints take alongtime  (Vladimir Nicolici <vladnc@gmail.com>)
List pgsql-general
Hello Vladimir,

maybe your update triggered auto_vacuum on those tables ? Default autovacuum_freeze_max_age is exactly set at 200,000,000 . Did you check your vacuum stats afterwards (pg_stat_*_tables) ?
Can you show the code which performed the deletes?

On 10/10/2017 16:56, Vladimir Nicolici wrote:

I experimented some more with the settings this weekend, while doing some large write operations (deleting 200 million records from a table), and I realized that the database is capable of generating much more WAL than I estimated.

 

And it seems that spikes in write activity, when longer than a few minutes, can cause the checkpoint process to “panic” and start a checkpoint earlier, and trying to complete it as soon as possible, estimating, correctly, that if that level of activity continues it will hit the max_wal_size limit.

 

Based on that, I reduced the checkpoint_timeout from 30 minutes to 20 minutes, while keeping max_wal_size at 144GB . Alternatively I could have increased the maximum WAL size more, but I’m not sure it’s a good idea to set it higher than the shared buffers, which are also set at 144GB. After this change, on Monday all checkpoints were triggered by “time”, I didn’t have any more checkpoints triggered by “xlog”.

 

I also set checkpoint_completion_target to 0.5 to see if our hardware can handle concentrating the write activity for 20 minutes in just 10 minutes, and that worked very well too, checkpoints finished on time. The %util (busy%) for the SSDs as reported by sar was around 20% when not doing a checkpoint, and 60% during the checkpoint, so it seems the hardware will be able to handle future increases in activity just fine.

 

The lesson I learned here is that max_wal_size needs to be configured based on the *maximum* volume of wal the database can generate in the checkpoint_timeout interval. Initially I had it set based on the *average* volume of wal generated in that interval, setting it to 3 times that average, but that was not enough, triggering the unexpected behavior.

 

Thanks,

Vlad


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

pgsql-general by date:

Previous
From: Vladimir Nicolici
Date:
Subject: Re: [GENERAL] Strange checkpoint behavior - checkpoints take a long time
Next
From: Vladimir Nicolici
Date:
Subject: Re: [GENERAL] Strange checkpoint behavior - checkpoints take alongtime