Thread: [GENERAL] Strange checkpoint behavior - checkpoints take a long time

[GENERAL] Strange checkpoint behavior - checkpoints take a long time

From
Vladimir Nicolici
Date:

I have a large database, 1.3 TB, with quite a bit of write activity. The machine has, 2 cpus x 6 cores x 2 threads (2 x E5-2630 v2 @ 2.60GHz), 4 x EVO Pro 2TB SSDs in a RAID 1+0 software raid configuration, on a SATA 3 controller.

 

The machine has a lot of memory, 384 GB, so it doesn’t do a lot of reads, as most of the frequently accessed data is cached.

 

Reads are between 5-10 MB/s on each SSD, giving a total of about 20-30 MB/s for the entire RAID volume.

 

Each SSD writes on average 30-40 MB/s, so overall the RAID writes on 2 mirrors at 60-80 MB/s each, so 120-160 MB/s total writes.

 

According to AnandTech benchmarks, those SSDs should be able to do around 90 MB/s random reads and 250 MB/s random writes at 4KB block size, so in theory I shouldn’t hit I/O limits even if all I/O was random and very small.

 

I estimate the SSDs should be about 10% busy from the reads and 15% busy from the writes, so maybe 25% total.

 

However sar -d shows %util on those SSDs between 70% and 80% under the normal daily load. But with SSDs, you can’t really trust fully the %util reported by sar, because SSDs perform best at large queue sizes. Average queue size reported by sar is between 5 and 10, so I’m not entirely sure if I should trust it or not for the %util values. A queue size of 8 and larger should be enough to keep most SSDs busy, but in that case I would expect %util to be 100% all the time.

 

Still, based on what sar is showing, I am concerned about hitting an I/O performance wall as the activity on that database might increase in the future, and I was trying to determine the actual limits, especially for writes. While doing that, I started to investigate the checkpoint behavior.

 

The database is configured with:

 

         - the default checkpoint timeout interval of 5 minutes (not set explicitly). It’s probably not long enough, I plan to increase it to something like 30 minutes, or more.

         - checkpoint_warning with the default at 30 seconds. Not sure if that is measured between the end of the previous checkpoint and the start of the next one, or between the start times (or end times) of two consecutive checkpoints.

         - completion target of 0.85, so it should write for 4:15 minutes every 5 minutes.

         - max_wal_size = 16 GB, I plan to increase it to something like 96 GB.

         - 96 GB of shared buffers. I may increase them to something like 144 GB

         - bgwriter with 400 max pages and the default 200ms delay. That should result in 400 x 5 x 8kb pages / second max, so 16 MB/s max from the bgwriter. It’s probably too much and I should have left it to the default value of 100 pages, or even lower. In fact I’m not even sure why I even need the bgwriter process instead of using just checkpoints for writes. In any case, even with 400 pages, it should still leave some work for the checkpoint process.

         - wal_buffers of 32 MB, I had them as high as 256 MB as some point, but it seems values larger than 16MB are useless, I will reduce it to 16MB.

 

The log of the last checkpoints looks like this when the machine is idle in the morning:

 

2017-10-04 08:43:00.283 EDT [7831] LOG:  checkpoint starting: time

2017-10-04 08:47:15.887 EDT [7831] LOG:  checkpoint complete: wrote 19290 buffers (0.2%); 0 transaction log file(s) added, 4 removed, 0 recycled; write=254.787 s, sync=0.794 s, total=255.604 s; sync files=274, longest=0.033 s, average=0.002 s; distance=61864 kB, estimate=322398 kB

 

So it does take the 5 minutes and 0.85% seconds into account well. 300 x 0.85  = 255 seconds, and it did indeed spread the writes over 255 seconds.

 

However, later during the day, as the activity increases it starts to look like this:

 

2017-10-04 08:48:00.932 EDT [7831] LOG:  checkpoint starting: time

2017-10-04 08:55:27.179 EDT [7831] LOG:  checkpoint complete: wrote 635481 buffers (5.1%); 0 transaction log file(s) added, 4 removed, 0 recycled; write=442.133 s, sync=4.003 s, total=446.246 s; sync files=445, longest=0.118 s, average=0.008 s; distance=5227434 kB, estimate=5227434 kB

 

First, a checkpoint finished in 446 seconds instead of 255 seconds. 635481 buffers in 446 seconds, 1425 buffers/second, around 12 MB/second. I would say that’s not such a huge amount of I/O for the hardware. Then the next starts immediately:

 

2017-10-04 08:55:27.180 EDT [7831] LOG:  checkpoint starting: xlog time

2017-10-04 09:05:23.391 EDT [7831] LOG:  checkpoint complete: wrote 2090424 buffers (16.6%); 0 transaction log file(s) added, 319 removed, 0 recycled; write=595.690 s, sync=0.468 s, total=596.211 s; sync files=454, longest=0.020 s, average=0.001 s; distance=11415997 kB, estimate=11415997 kB

 

This one takes even longer 596 seconds. 2090424 buffers, 3500 buffers/second, around 29 MB/second. That should still not be such a big deal, it would mean writing at around 15 MB/s on each SSD just for checkpoints.

 

So, not sure what is going on here. I could understand that the checkpoint process is surprised by the sudden increase in writes at the beginning of the day, and the first 2-3 checkpoints would take more than the configured duration. But even the next checkpoints follow the same pattern of taking more than 255 seconds, and sometimes more than 300 seconds:

 

2017-10-04 09:05:23.392 EDT [7831] LOG:  checkpoint starting: xlog time

2017-10-04 09:12:54.799 EDT [7831] LOG:  checkpoint complete: wrote 1342046 buffers (10.7%); 0 transaction log file(s) added, 697 removed, 0 recycled; write=450.554 s, sync=0.791 s, total=451.406 s; sync files=706, longest=0.014 s, average=0.0

01 s; distance=8478044 kB, estimate=11122202 kB

2017-10-04 09:12:54.801 EDT [7831] LOG:  checkpoint starting: xlog time

2017-10-04 09:17:49.897 EDT [7831] LOG:  checkpoint complete: wrote 1220559 buffers (9.7%); 0 transaction log file(s) added, 376 removed, 141 recycled; write=292.734 s, sync=0.677 s, total=295.096 s; sync files=690, longest=0.026 s, average=0.

000 s; distance=8551073 kB, estimate=10865089 kB

2017-10-04 09:17:54.903 EDT [7831] LOG:  checkpoint starting: time

2017-10-04 09:22:19.092 EDT [7831] LOG:  checkpoint complete: wrote 864467 buffers (6.9%); 0 transaction log file(s) added, 213 removed, 309 recycled; write=258.673 s, sync=1.379 s, total=264.188 s; sync files=672, longest=0.034 s, average=0.0

02 s; distance=5986866 kB, estimate=10377266 kB

2017-10-04 09:22:54.127 EDT [7831] LOG:  checkpoint starting: time

2017-10-04 09:27:22.678 EDT [7831] LOG:  checkpoint complete: wrote 962442 buffers (7.6%); 0 transaction log file(s) added, 176 removed, 190 recycled; write=265.684 s, sync=0.603 s, total=268.550 s; sync files=689, longest=0.018 s, average=0.0

00 s; distance=6278971 kB, estimate=9967437 kB

2017-10-04 09:27:22.679 EDT [7831] LOG:  checkpoint starting: xlog

2017-10-04 09:32:08.082 EDT [7831] LOG:  checkpoint complete: wrote 1206510 buffers (9.6%); 0 transaction log file(s) added, 200 removed, 183 recycled; write=282.610 s, sync=0.622 s, total=285.402 s; sync files=683, longest=0.013 s, average=0.000 s; distance=7412886 kB, estimate=9711982 kB

2017-10-04 09:32:22.097 EDT [7831] LOG:  checkpoint starting: time

2017-10-04 09:36:51.049 EDT [7831] LOG:  checkpoint complete: wrote 977391 buffers (7.8%); 0 transaction log file(s) added, 218 removed, 234 recycled; write=264.946 s, sync=0.939 s, total=268.952 s; sync files=671, longest=0.032 s, average=0.001 s; distance=6576530 kB, estimate=9398437 kB

2017-10-04 09:37:09.023 EDT [7831] LOG:  checkpoint starting: xlog

2017-10-04 09:41:41.421 EDT [7831] LOG:  checkpoint complete: wrote 1051219 buffers (8.4%); 0 transaction log file(s) added, 245 removed, 157 recycled; write=269.147 s, sync=0.996 s, total=272.398 s; sync files=699, longest=0.026 s, average=0.001 s; distance=6699634 kB, estimate=9128556 kB

2017-10-04 09:41:41.423 EDT [7831] LOG:  checkpoint starting: xlog

2017-10-04 09:46:36.460 EDT [7831] LOG:  checkpoint complete: wrote 1238858 buffers (9.8%); 0 transaction log file(s) added, 260 removed, 149 recycled; write=292.544 s, sync=0.833 s, total=295.037 s; sync files=716, longest=0.033 s, average=0.001 s; distance=7551236 kB, estimate=8970824 kB

2017-10-04 09:46:36.462 EDT [7831] LOG:  checkpoint starting: xlog

2017-10-04 09:51:03.839 EDT [7831] LOG:  checkpoint complete: wrote 1038125 buffers (8.3%); 0 transaction log file(s) added, 177 removed, 283 recycled; write=263.307 s, sync=0.718 s, total=267.377 s; sync files=708, longest=0.027 s, average=0.001 s; distance=6799399 kB, estimate=8753682 kB

2017-10-04 09:51:36.871 EDT [7831] LOG:  checkpoint starting: time

2017-10-04 09:56:30.606 EDT [7831] LOG:  checkpoint complete: wrote 913761 buffers (7.3%); 0 transaction log file(s) added, 152 removed, 263 recycled; write=288.352 s, sync=0.933 s, total=293.735 s; sync files=697, longest=0.027 s, average=0.001 s; distance=5756934 kB, estimate=8454007 kB

2017-10-04 09:56:30.607 EDT [7831] LOG:  checkpoint starting: xlog

2017-10-04 10:02:44.340 EDT [7831] LOG:  checkpoint complete: wrote 1080696 buffers (8.6%); 0 transaction log file(s) added, 189 removed, 163 recycled; write=371.377 s, sync=0.569 s, total=373.732 s; sync files=712, longest=0.014 s, average=0.000 s; distance=6734628 kB, estimate=8282069 kB

2017-10-04 10:02:44.341 EDT [7831] LOG:  checkpoint starting: xlog time

2017-10-04 10:10:13.903 EDT [7831] LOG:  checkpoint complete: wrote 1200040 buffers (9.5%); 0 transaction log file(s) added, 360 removed, 51 recycled; write=448.221 s, sync=0.760 s, total=449.561 s; sync files=696, longest=0.021 s, average=0.001 s; distance=7387465 kB, estimate=8192609 kB

2017-10-04 10:10:13.905 EDT [7831] LOG:  checkpoint starting: xlog time

2017-10-04 10:16:33.351 EDT [7831] LOG:  checkpoint complete: wrote 1472230 buffers (11.7%); 0 transaction log file(s) added, 401 removed, 50 recycled; write=377.834 s, sync=0.979 s, total=379.445 s; sync files=700, longest=0.028 s, average=0.001 s; distance=8555349 kB, estimate=8555349 kB

2017-10-04 10:16:33.352 EDT [7831] LOG:  checkpoint starting: xlog time

2017-10-04 10:21:58.810 EDT [7831] LOG:  checkpoint complete: wrote 1250086 buffers (9.9%); 0 transaction log file(s) added, 378 removed, 144 recycled; write=322.737 s, sync=0.888 s, total=325.457 s; sync files=680, longest=0.031 s, average=0.001 s; distance=7402273 kB, estimate=8440041 kB

2017-10-04 10:21:58.811 EDT [7831] LOG:  checkpoint starting: xlog time

2017-10-04 10:27:09.544 EDT [7831] LOG:  checkpoint complete: wrote 1171682 buffers (9.3%); 0 transaction log file(s) added, 264 removed, 188 recycled; write=307.983 s, sync=0.796 s, total=310.732 s; sync files=688, longest=0.026 s, average=0.001 s; distance=7018008 kB, estimate=8297838 kB

2017-10-04 10:27:09.546 EDT [7831] LOG:  checkpoint starting: xlog time

2017-10-04 10:34:16.507 EDT [7831] LOG:  checkpoint complete: wrote 1088387 buffers (8.6%); 0 transaction log file(s) added, 343 removed, 85 recycled; write=425.445 s, sync=0.480 s, total=426.961 s; sync files=670, longest=0.012 s, average=0.000 s; distance=6695430 kB, estimate=8137597 kB

 

Then as the activity on the database increases even more, and checkpoints start to take more than 10 minutes:

 

2017-10-04 10:34:16.509 EDT [7831] LOG:  checkpoint starting: xlog time

2017-10-04 10:45:29.233 EDT [7831] LOG:  checkpoint complete: wrote 1445250 buffers (11.5%); 1 transaction log file(s) added, 409 removed, 0 recycled; write=671.755 s, sync=0.832 s, total=672.724 s; sync files=687, longest=0.039 s, average=0.001 s; distance=8688405 kB, estimate=8688405 kB

2017-10-04 10:45:29.235 EDT [7831] LOG:  checkpoint starting: xlog time

2017-10-04 11:04:48.081 EDT [7831] LOG:  checkpoint complete: wrote 1929827 buffers (15.3%); 1 transaction log file(s) added, 530 removed, 0 recycled; write=1156.147 s, sync=2.487 s, total=1158.846 s; sync files=704, longest=0.046 s, average=0.003 s; distance=12851319 kB, estimate=12851319 kB

 

Then more than 20 minutes:

 

2017-10-04 11:04:48.082 EDT [7831] LOG:  checkpoint starting: xlog time

2017-10-04 11:25:55.806 EDT [7831] LOG:  checkpoint complete: wrote 2695524 buffers (21.4%); 0 transaction log file(s) added, 784 removed, 0 recycled; write=1257.586 s, sync=10.070 s, total=1267.723 s; sync files=709, longest=0.117 s, average=0.014 s; distance=20710817 kB, estimate=20710817 kB

2017-10-04 11:25:55.807 EDT [7831] LOG:  checkpoint starting: xlog time

2017-10-04 11:48:08.366 EDT [7831] LOG:  checkpoint complete: wrote 2857107 buffers (22.7%); 0 transaction log file(s) added, 1264 removed, 0 recycled; write=1326.199 s, sync=6.233 s, total=1332.558 s; sync files=726, longest=0.076 s, average=0.008 s; distance=24439493 kB, estimate=24439493 kB

2017-10-04 11:48:08.368 EDT [7831] LOG:  checkpoint starting: xlog time

2017-10-04 12:11:39.550 EDT [7831] LOG:  checkpoint complete: wrote 3127276 buffers (24.9%); 0 transaction log file(s) added, 1492 removed, 0 recycled; write=1403.735 s, sync=7.293 s, total=1411.182 s; sync files=732, longest=0.071 s, average=0.009 s; distance=27967471 kB, estimate=27967471 kB

2017-10-04 12:11:39.552 EDT [7831] LOG:  checkpoint starting: xlog time

2017-10-04 12:32:28.853 EDT [7831] LOG:  checkpoint complete: wrote 2823207 buffers (22.4%); 0 transaction log file(s) added, 1707 removed, 0 recycled; write=1236.280 s, sync=12.828 s, total=1249.300 s; sync files=739, longest=0.133 s, average=0.017 s; distance=28321252 kB, estimate=28321252 kB

 

There are no “checkpoints are occurring too frequently” messages in the logs, and since a checkpoint starts as soon as the previous one ends, not in over 30 seconds as I configured the warning, in my opinion there should have been warnings, but there aren’t. Probably the warning measures start to start, or end to end, not end to start. The lack of a warning made me even more confused and made me lose a lot of time trying to figure this out, until I noticed the long durations.

 

In any case, I still don’t fully understand what is happening here. It looks like I’m hitting some hardware limits, but it’s not obvious. The CPU usage is around 20% user, 4% system, and Linux also reports 10-15% iowait, so the machine looks a bit stressed on the I/O side and a bit less on the CPU side, but again, I can’t be sure.

 

The longest checkpoint today at 11:48 AM took 1411 seconds and has written 3127276 buffers, so 2216 buffers/second. That’s just 18 MB/second, so 9 MB/second on each SSD from checkpoints. In fact earlier in the day it had a shorter checkpoint that did 29 MB/second, so I think this larger one could have completed faster.

 

The sar disk I/O report was looking like this during this 1411 long checkpoint:

 

11:55:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util

12:05:01 PM   dev8-16   2478.72  17412.49  64107.22     32.89      7.02      2.83      0.31     77.79

12:05:01 PM    dev8-0   2139.80  10546.02  64153.14     34.91      7.16      3.34      0.35     75.27

12:05:01 PM   dev8-32   2155.46  10881.22  64075.18     34.78      6.84      3.17      0.35     75.08

12:05:01 PM   dev8-48   2497.76  17701.13  64153.14     32.77      7.90      3.16      0.31     77.52

12:05:01 PM    dev9-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

12:05:01 PM    dev9-1   5873.69  60126.95 126219.54     31.73      0.00      0.00      0.00      0.00

 

9-1 is the RAID, and 8-0, 8-16, 8-32, 8-48 the SSDs. So around 32 MB/s writes and 7 MB/s reads on each SSD. In theory the SSDs should be easily able to handle that level of I/O, so I’m suspecting something else might be slowing the checkpoints.

 

Could there be some inefficiency or bug in the checkpoint process making it not work as fast as it should? Or something I may have configured improperly triggering this behavior?

 

Thanks,

Vlad

Re: [GENERAL] Strange checkpoint behavior - checkpoints take a long time

From
Vladimir Nicolici
Date:

Some further updates about the issue.

 

I did a bit of benchmarking on the disk system with iozone, and the during the test the SSDs seemed to be able to easily sustain 200 MB/second of writes each, they fluctuated between 200 MB/s and 400 MB/s when doing 96 GB of random writes in a file. That would mean between 400 and 800 MB/s for the entire RAID volume, since it’s 1+0 and has 4 SSDs, 2 in each mirror.

 

I wasn’t able to benchmark reads properly because the machine has so much RAM that they work mostly from cache. But reads  shouldn’t be an issue anyway. I’ll try to do more comprehensive tests tomorrow but, from what I’ve seen so far, SSD I/O limits shouldn’t be a concern.

 

I changed some configuration parameters during the night to the values I was considering yesterday:

 

  • shared_buffers = 144GB #previously 96 GB
  • bgwriter_lru_maxpages = 100  #previously 400
  • checkpoint_timeout = 30min  #previously 5min
  • checkpoint_completion_target = 0.83 #previously 0.85; 0.83 means 25 minutes writes out of 30 minutes.
  • max_wal_size = 96GB #previously 16GB
  • wal_buffers = 16MB  #previously 32 MB

 

With the new settings the checkpoints now finish on time, more or less. One recent checkpoint looked like this:

 

2017-10-05 14:16:22.891 EDT [7828] LOG:  checkpoint starting: time

2017-10-05 14:42:35.429 EDT [7828] LOG:  checkpoint complete: wrote 4770679 buffers (25.3%); 0 transaction log file(s) added, 0 removed, 2088 recycled; write=1501.567 s, sync=1.844 s, total=1572.538 s; sync files=750, longest=0.029 s, average=0.002 s; distance=33579763 kB, estimate=35950738 kB

 

So it took 1502 seconds for writes, instead of the configured 1494, that’s close enough, just 6 seconds more. Sync was another 2 seconds. So 1504 write + sync. However, the total is 1573 seconds, 69 seconds more. Not sure what the checkpoint process does in those 69 seconds.

 

Looking further back at the logs with the checkpoints from today, the first checkpoint of the day that did some significant work didn’t have such a large difference between write time and checkpoint time, it had write + sync  = 1494, and total 1495, just 1 second difference:

 

2017-10-05 09:16:22.851 EDT [7828] LOG:  checkpoint starting: time

2017-10-05 09:41:18.139 EDT [7828] LOG:  checkpoint complete: wrote 4445573 buffers (23.6%); 0 transaction log file(s) added, 0 removed, 98 recycled; write=1493.322 s, sync=0.575 s, total=1495.287 s; sync files=734, longest=0.022 s, average=0.000 s; distance=29166105 kB, estimate=29166105 kB

 

The difference is it didn’t need to recycle so many buffers as the later one. Not sure what exactly the recycling does, if anything, but it seems to add some overhead to the process. The checkpoint with the 69 seconds delay recycled 2088 wal segments, the one with the 1 second delay recycled just 98.

 

If the overhead is indeed caused by recycling, I’m not sure if it’s proportional to the size of the recycled segments, or with the number of recycled segments. If it’s the number of segments, then maybe a wal segment size larger than 16 MB would help with this issue, but unfortunately for the time being that can only be configured at compile time and forces you to recreate the cluster as well.

 

For tomorrow I plan to reduce bgwriter_lru_maxpages from 100 to 0 and rely just on checkpoints for datafile writes, and reduce checkpoint_completion_target from 0.83 to 0.66 to try make the checkpoints finish the writes in 20 minutes instead of 25 minutes, out of 30 minutes.

 

Thanks,

Vlad

Re: [GENERAL] Strange checkpoint behavior - checkpoints take a long time

From
Igor Polishchuk
Date:
Vladimir,
Just curious, if your goal is to reduce checkpoint overhead, shouldn’t you decrease shared_buffers instead of increasing it? 
With bigger shared_buffers, you can accumulate more dirty buffers for checkpoint to take care. 
I remember in early versions ( around 8.4), when checkpoint_completion_target was not available, one suggested way of fighting heavy checkpoints was 
setting very low shared_buffers.
Also, why do yo need to reduce your checkpoint write time to 20 minutes from 25? What will you gain? If you will have the same number of dirty buffers to flush, your IO intensity and overhead will  increase in these 20 minutes.

Igor

  
On Oct 5, 2017, at 12:58, Vladimir Nicolici <vladnc@gmail.com> wrote:

Some further updates about the issue.
 
I did a bit of benchmarking on the disk system with iozone, and the during the test the SSDs seemed to be able to easily sustain 200 MB/second of writes each, they fluctuated between 200 MB/s and 400 MB/s when doing 96 GB of random writes in a file. That would mean between 400 and 800 MB/s for the entire RAID volume, since it’s 1+0 and has 4 SSDs, 2 in each mirror.
 
I wasn’t able to benchmark reads properly because the machine has so much RAM that they work mostly from cache. But reads  shouldn’t be an issue anyway. I’ll try to do more comprehensive tests tomorrow but, from what I’ve seen so far, SSD I/O limits shouldn’t be a concern.
 
I changed some configuration parameters during the night to the values I was considering yesterday:
 
  • shared_buffers = 144GB #previously 96 GB
  • bgwriter_lru_maxpages = 100  #previously 400
  • checkpoint_timeout = 30min  #previously 5min
  • checkpoint_completion_target = 0.83 #previously 0.85; 0.83 means 25 minutes writes out of 30 minutes.
  • max_wal_size = 96GB #previously 16GB
  • wal_buffers = 16MB  #previously 32 MB
 
With the new settings the checkpoints now finish on time, more or less. One recent checkpoint looked like this:
 
2017-10-05 14:16:22.891 EDT [7828] LOG:  checkpoint starting: time
2017-10-05 14:42:35.429 EDT [7828] LOG:  checkpoint complete: wrote 4770679 buffers (25.3%); 0 transaction log file(s) added, 0 removed, 2088 recycled; write=1501.567 s, sync=1.844 s, total=1572.538 s; sync files=750, longest=0.029 s, average=0.002 s; distance=33579763 kB, estimate=35950738 kB
 
So it took 1502 seconds for writes, instead of the configured 1494, that’s close enough, just 6 seconds more. Sync was another 2 seconds. So 1504 write + sync. However, the total is 1573 seconds, 69 seconds more. Not sure what the checkpoint process does in those 69 seconds.
 
Looking further back at the logs with the checkpoints from today, the first checkpoint of the day that did some significant work didn’t have such a large difference between write time and checkpoint time, it had write + sync  = 1494, and total 1495, just 1 second difference:
 
2017-10-05 09:16:22.851 EDT [7828] LOG:  checkpoint starting: time
2017-10-05 09:41:18.139 EDT [7828] LOG:  checkpoint complete: wrote 4445573 buffers (23.6%); 0 transaction log file(s) added, 0 removed, 98 recycled; write=1493.322 s, sync=0.575 s, total=1495.287 s; sync files=734, longest=0.022 s, average=0.000 s; distance=29166105 kB, estimate=29166105 kB
 
The difference is it didn’t need to recycle so many buffers as the later one. Not sure what exactly the recycling does, if anything, but it seems to add some overhead to the process. The checkpoint with the 69 seconds delay recycled 2088 wal segments, the one with the 1 second delay recycled just 98.
 
If the overhead is indeed caused by recycling, I’m not sure if it’s proportional to the size of the recycled segments, or with the number of recycled segments. If it’s the number of segments, then maybe a wal segment size larger than 16 MB would help with this issue, but unfortunately for the time being that can only be configured at compile time and forces you to recreate the cluster as well.
 
For tomorrow I plan to reduce bgwriter_lru_maxpages from 100 to 0 and rely just on checkpoints for datafile writes, and reduce checkpoint_completion_target from 0.83 to 0.66 to try make the checkpoints finish the writes in 20 minutes instead of 25 minutes, out of 30 minutes.
 
Thanks,
Vlad

Re: [GENERAL] Strange checkpoint behavior - checkpoints take alongtime

From
Vladimir Nicolici
Date:

My primary goal is to determine the current level of utilization of the current hardware + software combination, and how much more load it can handle.

 

It’s quite likely that volume of work performed by that database will need to increase by 50% or more over the next months, and I don’t want to risk hitting a performance wall unexpectedly when that happens, I want to plan in advance for more hardware or other solutions in case the current hardware and software combination is inadequate.

 

My secondary goal is to determine optimal settings for best performance, and learn more about how postgres works.

 

From the point of view of the applications using the database, the system doesn’t currently experience any performance issues. Even when the checkpoints didn’t finish in the default configured 5 minutes, the database performed acceptably.

About the shared_buffers, decreasing them would indeed make the database keep less dirty buffers, but that would not reduce the overall number of writes to the disk. It would just change the responsibility of doing writes more towards the bgwriter process instead of the checkpoint process, and I don’t see any advantage in doing that.

 

Actually, reducing the shared buffers may even increase the number of writes to the datafiles, because according to the documentation the bgwriter process can write the same buffer multiple times during a checkpoint interval, and the checkpoint would only write it once. That’s why I want to completely disable the bgwriter.

 

About reducing the checkpoint write time from 25 minutes to 20 minutes, you are correct that the same number of buffers will need to be flushed and that it will increase the I/O intensity and put more stress on the disk system in that interval. That’s exactly what I want.

 

I want to see if the hardware is powerful enough to complete the checkpoint in just 20 minutes. This will prove there is some headroom and the database can handle more load. If that works fine, I may even try 15 minutes, to see how it behaves.

 

Since it’s a production system I have to be a bit careful, downtime during the day can be quite expensive, but at the same time suddenly hitting a performance wall unexpectedly in the future will be even more expensive.

 

After I finish the tests and reach a conclusion about the capabilities of this hardware + software combination, I will probably set it to something like 0.90 target, so that it distributes the writes over 27 minutes.

 

Thanks,

Vlad

 

From: Igor Polishchuk
Sent: Friday, October 6, 2017 02:56
To: Vladimir Nicolici
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Strange checkpoint behavior - checkpoints take a longtime

 

Vladimir,

Just curious, if your goal is to reduce checkpoint overhead, shouldn’t you decrease shared_buffers instead of increasing it? 

With bigger shared_buffers, you can accumulate more dirty buffers for checkpoint to take care. 

I remember in early versions ( around 8.4), when checkpoint_completion_target was not available, one suggested way of fighting heavy checkpoints was 

setting very low shared_buffers.

Also, why do yo need to reduce your checkpoint write time to 20 minutes from 25? What will you gain? If you will have the same number of dirty buffers to flush, your IO intensity and overhead will  increase in these 20 minutes.

 

Igor

 

  

On Oct 5, 2017, at 12:58, Vladimir Nicolici <vladnc@gmail.com> wrote:

 

Some further updates about the issue.

 

I did a bit of benchmarking on the disk system with iozone, and the during the test the SSDs seemed to be able to easily sustain 200 MB/second of writes each, they fluctuated between 200 MB/s and 400 MB/s when doing 96 GB of random writes in a file. That would mean between 400 and 800 MB/s for the entire RAID volume, since it’s 1+0 and has 4 SSDs, 2 in each mirror.

 

I wasn’t able to benchmark reads properly because the machine has so much RAM that they work mostly from cache. But reads  shouldn’t be an issue anyway. I’ll try to do more comprehensive tests tomorrow but, from what I’ve seen so far, SSD I/O limits shouldn’t be a concern.

 

I changed some configuration parameters during the night to the values I was considering yesterday:

 

·        shared_buffers = 144GB #previously 96 GB

·        bgwriter_lru_maxpages = 100  #previously 400

·        checkpoint_timeout = 30min  #previously 5min

·        checkpoint_completion_target = 0.83 #previously 0.85; 0.83 means 25 minutes writes out of 30 minutes.

·        max_wal_size = 96GB #previously 16GB

·        wal_buffers = 16MB  #previously 32 MB

 

With the new settings the checkpoints now finish on time, more or less. One recent checkpoint looked like this:

 

2017-10-05 14:16:22.891 EDT [7828] LOG:  checkpoint starting: time

2017-10-05 14:42:35.429 EDT [7828] LOG:  checkpoint complete: wrote 4770679 buffers (25.3%); 0 transaction log file(s) added, 0 removed, 2088 recycled; write=1501.567 s, sync=1.844 s, total=1572.538 s; sync files=750, longest=0.029 s, average=0.002 s; distance=33579763 kB, estimate=35950738 kB

 

So it took 1502 seconds for writes, instead of the configured 1494, that’s close enough, just 6 seconds more. Sync was another 2 seconds. So 1504 write + sync. However, the total is 1573 seconds, 69 seconds more. Not sure what the checkpoint process does in those 69 seconds.

 

Looking further back at the logs with the checkpoints from today, the first checkpoint of the day that did some significant work didn’t have such a large difference between write time and checkpoint time, it had write + sync  = 1494, and total 1495, just 1 second difference:

 

2017-10-05 09:16:22.851 EDT [7828] LOG:  checkpoint starting: time

2017-10-05 09:41:18.139 EDT [7828] LOG:  checkpoint complete: wrote 4445573 buffers (23.6%); 0 transaction log file(s) added, 0 removed, 98 recycled; write=1493.322 s, sync=0.575 s, total=1495.287 s; sync files=734, longest=0.022 s, average=0.000 s; distance=29166105 kB, estimate=29166105 kB

 

The difference is it didn’t need to recycle so many buffers as the later one. Not sure what exactly the recycling does, if anything, but it seems to add some overhead to the process. The checkpoint with the 69 seconds delay recycled 2088 wal segments, the one with the 1 second delay recycled just 98.

 

If the overhead is indeed caused by recycling, I’m not sure if it’s proportional to the size of the recycled segments, or with the number of recycled segments. If it’s the number of segments, then maybe a wal segment size larger than 16 MB would help with this issue, but unfortunately for the time being that can only be configured at compile time and forces you to recreate the cluster as well.

 

For tomorrow I plan to reduce bgwriter_lru_maxpages from 100 to 0 and rely just on checkpoints for datafile writes, and reduce checkpoint_completion_target from 0.83 to 0.66 to try make the checkpoints finish the writes in 20 minutes instead of 25 minutes, out of 30 minutes.

 

Thanks,

Vlad

 

 

Re: [GENERAL] Strange checkpoint behavior - checkpoints take a longtime

From
Andres Freund
Date:
Hi,

On 2017-10-05 22:58:31 +0300, Vladimir Nicolici wrote:
> I changed some configuration parameters during the night to the values I was considering yesterday:
> 
> - shared_buffers = 144GB #previously 96 GB
> - bgwriter_lru_maxpages = 100  #previously 400
> - checkpoint_timeout = 30min  #previously 5min
> - checkpoint_completion_target = 0.83 #previously 0.85; 0.83 means 25 minutes writes out of 30 minutes.
> - max_wal_size = 96GB #previously 16GB

Which version of postgres is this?

> - wal_buffers = 16MB  #previously 32 MB

That seems quite unlikely to help.

You might want to try also enabling wal_compression, sometimes the WAL
volume is a considerable problem.

I'd suggest reporting some "pidstat -dl 1" output, so we can see which
processes are doing how much IO.

Regards,

Andres



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Strange checkpoint behavior - checkpoints take alongtime

From
Vladimir Nicolici
Date:

Hello, it’s postgres 9.6 . I know reducing wal_buffers from 32 MB to 16MB is unlikely to help, but according to the documentation values larger than 16MB are unlikely to help either, at least at the default wal segment size, so I decided to go with 16 MB.

 

I will probably try the compression on Monday or Tuesday, I can only experiment with a single set of changes in a day, and I plan to test something else tomorrow.

 

Thanks for the suggestions, and sorry for the reply style, but my mail client is not best suited for replying inline to individual points.

 

From: Andres Freund
Sent: Friday, October 6, 2017 04:51
To: Vladimir Nicolici
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Strange checkpoint behavior - checkpoints take a longtime

 

Hi,

 

On 2017-10-05 22:58:31 +0300, Vladimir Nicolici wrote:

> I changed some configuration parameters during the night to the values I was considering yesterday:

>

> - shared_buffers = 144GB #previously 96 GB

> - bgwriter_lru_maxpages = 100  #previously 400

> - checkpoint_timeout = 30min  #previously 5min

> - checkpoint_completion_target = 0.83 #previously 0.85; 0.83 means 25 minutes writes out of 30 minutes.

> - max_wal_size = 96GB #previously 16GB

 

Which version of postgres is this?

 

> - wal_buffers = 16MB  #previously 32 MB

 

That seems quite unlikely to help.

 

You might want to try also enabling wal_compression, sometimes the WAL

volume is a considerable problem.

 

I'd suggest reporting some "pidstat -dl 1" output, so we can see which

processes are doing how much IO.

 

Regards,

 

Andres

 

Re: [GENERAL] Strange checkpoint behavior - checkpoints take alongtime

From
Andres Freund
Date:
Hi,

On 2017-10-06 05:53:39 +0300, Vladimir Nicolici wrote:
> Hello, it’s postgres 9.6.

Consider setting checkpoint_flush_after to 16MB or something large like that.


> I will probably try the compression on Monday or Tuesday, I can only
> experiment with a single set of changes in a day, and I plan to test
> something else tomorrow.
> 
> Thanks for the suggestions, and sorry for the reply style, but my mail
> client is not best suited for replying inline to individual points.

You should consider getting a new mail client then...

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Strange checkpoint behavior - checkpoints take a long time

From
Vladimir Nicolici
Date:

Further updates:

 

Yesterday checkpoints were finishing more or less on time with the configuration for 25 minutes out of 30 minutes, taking 26 minutes at most.

 

So for today I reduced the time reserved for checkpoint writes to 20 minutes out of 30 minutes, by setting checkpoint_completion_target to 0.66 instead of 0.83, and I disabled the bgwriter completely by setting bgwriter_lru_maxpages from 100 to 0.

 

I expected this might cause the checkpoints to take more time, and not finish in 20 minutes, but that didn’t happen.

 

For the first part of the day, when the activity was lower, checkpoints finished almost on time, less than 30 seconds more than configured:

 

2017-10-06 10:30:56.248 EDT [4927] LOG:  checkpoint starting: time

2017-10-06 10:51:26.227 EDT [4927] LOG:  checkpoint complete: wrote 4364979 buffers (23.1%); 0 transaction log file(s) added, 0 removed, 1936 recycled; write=1187.940 s, sync=1.888 s, total=1229.979 s; sync files=744, longest=0.041 s, average=0.002 s; distance=31073898 kB, estimate=32879405 kB

 

Then later, as the activity increased, something I didn’t expect happened. The checkpoints started to finish much sooner than configured, in 16:38 minutes instead of 20 minutes, so 3:22 minutes earlier than configured:

 

2017-10-06 11:00:56.347 EDT [4927] LOG:  checkpoint starting: time

2017-10-06 11:17:34.778 EDT [4927] LOG:  checkpoint complete: wrote 5376422 buffers (28.5%); 0 transaction log file(s) added, 0 removed, 1897 recycled; write=959.908 s, sync=1.578 s, total=998.431 s; sync files=772, longest=0.045 s, average=0.002 s; distance=37712624 kB, estimate=37712624 kB

 

Then, even stranger, the next checkpoint started sooner than configured, by 2:11 minutes, with the reason “xlog” instead of “time”:

 

2017-10-06 11:27:48.754 EDT [4927] LOG:  checkpoint starting: xlog

2017-10-06 11:42:00.131 EDT [4927] LOG:  checkpoint complete: wrote 6123787 buffers (32.4%); 0 transaction log file(s) added, 239 removed, 2063 recycled; write=806.887 s, sync=1.032 s, total=851.377 s; sync files=767, longest=0.052 s, average=0.001 s; distance=40254253 kB, estimate=40254253 kB

 

All subsequent checkpoints for the rest of the day look similar, they start with the reason “xlog”, and finish in around 15 minutes each instead of 20, and the time between two checkpoint starts is around 25 minutes instead of 30.

 

I’m not sure what causes these early finishes and early starts. The checkpoint start message is not very helpful in determining why it started early, in my opinion. It’s saying “xlog”, but a more verbose description, maybe even including some numbers it based its decision to start the checkpoint on, would be more helpful.

 

I’m guessing the reason for starting a checkpoint is that it estimated the 96 GB reserved with max_wal_size would be filled unless it starts a checkpoint earlier than configured. I’ll increase the max_wal_size from 96 GB to 144 GB for Monday, to see if it changes this behavior.

 

Thanks,

Vlad

Re: [GENERAL] Strange checkpoint behavior - checkpoints take a long time

From
Vladimir Nicolici
Date:

I experimented some more with the settings this weekend, while doing some large write operations (deleting 200 million records from a table), and I realized that the database is capable of generating much more WAL than I estimated.

 

And it seems that spikes in write activity, when longer than a few minutes, can cause the checkpoint process to “panic” and start a checkpoint earlier, and trying to complete it as soon as possible, estimating, correctly, that if that level of activity continues it will hit the max_wal_size limit.

 

Based on that, I reduced the checkpoint_timeout from 30 minutes to 20 minutes, while keeping max_wal_size at 144GB . Alternatively I could have increased the maximum WAL size more, but I’m not sure it’s a good idea to set it higher than the shared buffers, which are also set at 144GB. After this change, on Monday all checkpoints were triggered by “time”, I didn’t have any more checkpoints triggered by “xlog”.

 

I also set checkpoint_completion_target to 0.5 to see if our hardware can handle concentrating the write activity for 20 minutes in just 10 minutes, and that worked very well too, checkpoints finished on time. The %util (busy%) for the SSDs as reported by sar was around 20% when not doing a checkpoint, and 60% during the checkpoint, so it seems the hardware will be able to handle future increases in activity just fine.

 

The lesson I learned here is that max_wal_size needs to be configured based on the *maximum* volume of wal the database can generate in the checkpoint_timeout interval. Initially I had it set based on the *average* volume of wal generated in that interval, setting it to 3 times that average, but that was not enough, triggering the unexpected behavior.

 

Thanks,

Vlad

Re: [GENERAL] Strange checkpoint behavior - checkpoints take a longtime

From
Achilleas Mantzios
Date:
Hello Vladimir,

maybe your update triggered auto_vacuum on those tables ? Default autovacuum_freeze_max_age is exactly set at 200,000,000 . Did you check your vacuum stats afterwards (pg_stat_*_tables) ?
Can you show the code which performed the deletes?

On 10/10/2017 16:56, Vladimir Nicolici wrote:

I experimented some more with the settings this weekend, while doing some large write operations (deleting 200 million records from a table), and I realized that the database is capable of generating much more WAL than I estimated.

 

And it seems that spikes in write activity, when longer than a few minutes, can cause the checkpoint process to “panic” and start a checkpoint earlier, and trying to complete it as soon as possible, estimating, correctly, that if that level of activity continues it will hit the max_wal_size limit.

 

Based on that, I reduced the checkpoint_timeout from 30 minutes to 20 minutes, while keeping max_wal_size at 144GB . Alternatively I could have increased the maximum WAL size more, but I’m not sure it’s a good idea to set it higher than the shared buffers, which are also set at 144GB. After this change, on Monday all checkpoints were triggered by “time”, I didn’t have any more checkpoints triggered by “xlog”.

 

I also set checkpoint_completion_target to 0.5 to see if our hardware can handle concentrating the write activity for 20 minutes in just 10 minutes, and that worked very well too, checkpoints finished on time. The %util (busy%) for the SSDs as reported by sar was around 20% when not doing a checkpoint, and 60% during the checkpoint, so it seems the hardware will be able to handle future increases in activity just fine.

 

The lesson I learned here is that max_wal_size needs to be configured based on the *maximum* volume of wal the database can generate in the checkpoint_timeout interval. Initially I had it set based on the *average* volume of wal generated in that interval, setting it to 3 times that average, but that was not enough, triggering the unexpected behavior.

 

Thanks,

Vlad


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

Re: [GENERAL] Strange checkpoint behavior - checkpoints take alongtime

From
Vladimir Nicolici
Date:

No, it didn’t. The delete was done in a single transaction.

 

From: Achilleas Mantzios
Sent: Tuesday, October 10, 2017 17:18
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Strange checkpoint behavior - checkpoints take a longtime

 

Hello Vladimir,

maybe your update triggered auto_vacuum on those tables ? Default autovacuum_freeze_max_age is exactly set at 200,000,000 . Did you check your vacuum stats afterwards (pg_stat_*_tables) ?
Can you show the code which performed the deletes?

On 10/10/2017 16:56, Vladimir Nicolici wrote:

I experimented some more with the settings this weekend, while doing some large write operations (deleting 200 million records from a table), and I realized that the database is capable of generating much more WAL than I estimated.

 

And it seems that spikes in write activity, when longer than a few minutes, can cause the checkpoint process to “panic” and start a checkpoint earlier, and trying to complete it as soon as possible, estimating, correctly, that if that level of activity continues it will hit the max_wal_size limit.

 

Based on that, I reduced the checkpoint_timeout from 30 minutes to 20 minutes, while keeping max_wal_size at 144GB . Alternatively I could have increased the maximum WAL size more, but I’m not sure it’s a good idea to set it higher than the shared buffers, which are also set at 144GB. After this change, on Monday all checkpoints were triggered by “time”, I didn’t have any more checkpoints triggered by “xlog”.

 

I also set checkpoint_completion_target to 0.5 to see if our hardware can handle concentrating the write activity for 20 minutes in just 10 minutes, and that worked very well too, checkpoints finished on time. The %util (busy%) for the SSDs as reported by sar was around 20% when not doing a checkpoint, and 60% during the checkpoint, so it seems the hardware will be able to handle future increases in activity just fine.

 

The lesson I learned here is that max_wal_size needs to be configured based on the *maximum* volume of wal the database can generate in the checkpoint_timeout interval. Initially I had it set based on the *average* volume of wal generated in that interval, setting it to 3 times that average, but that was not enough, triggering the unexpected behavior.

 

Thanks,

Vlad

 

-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

 

Re: [GENERAL] Strange checkpoint behavior - checkpoints takealongtime

From
Vladimir Nicolici
Date:

In fact it was a single delete statement.

 

From: Vladimir Nicolici
Sent: Tuesday, October 10, 2017 17:30
To: Achilleas Mantzios; pgsql-general@postgresql.org
Subject: RE: [GENERAL] Strange checkpoint behavior - checkpoints take alongtime

 

No, it didn’t. The delete was done in a single transaction.