Thread: (Relatively) Oversized Checkpoint

(Relatively) Oversized Checkpoint

From

Nathaniel Talbott

Date:

16 June 2014, 18:01:58

We're still learning to tune PostgreSQL for our production setup, and over the weekend we saw a huge spike in INSERT/UPDATE completion time for a short period which turned out to correlate to an oversized checkpoint. I've included the preceding and following checkpoints for reference points, as they're typical of the usual checkpoints we see:

Jun 16 06:01:19 sql-2 postgres[19015]: [10706-1] 2014-06-16 06:01:19 UTC LOG: checkpoint starting: time

Jun 16 06:04:40 sql-2 postgres[19015]: [10707-1] 2014-06-16 06:04:40 UTC LOG: checkpoint complete: wrote 1962 buffers (0.7%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=197.906 s, sync=2.411 s, total=200.697 s; sync files=42, longest=1.110 s, average=0.057 s

Jun 16 06:06:19 sql-2 postgres[19015]: [10708-1] 2014-06-16 06:06:19 UTC LOG: checkpoint starting: time

Jun 16 06:12:03 sql-2 postgres[19015]: [10709-1] 2014-06-16 06:12:03 UTC LOG: checkpoint complete: wrote 25958 buffers (9.9%); 0 transaction log file(s) added, 0 removed, 12 recycled; write=269.832 s, sync=74.105 s, total=344.294 s; sync files=49, longest=64.421 s, average=1.512 s

Jun 16 06:12:03 sql-2 postgres[19015]: [10710-1] 2014-06-16 06:12:03 UTC LOG: checkpoint starting: time

Jun 16 06:16:35 sql-2 postgres[19015]: [10711-1] 2014-06-16 06:16:35 UTC LOG: checkpoint complete: wrote 2900 buffers (1.1%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=269.369 s, sync=2.357 s, total=271.820 s; sync files=47, longest=0.694 s, average=0.050 s

What I don't understand and am hoping you can help me with is why a checkpoint would balloon like the 06:06:19 one did. While we were experiencing higher load on our systems at this time, the elevated load lasted for almost an hour at a constant clip, and only this one checkpoint ballooned. As far as I can tell there wasn't anything unique or abnormal about the queries being run during this particular checkpoint.

Somewhat unique aspects to our PostgreSQL usage:

- we only INSERT/UPDATE, never DELETE

- we use pk+hstore tables exclusively

- we write more than we read

What can trigger this kind of out of character checkpoint? What should I be looking at to understand this?

Thanks!

--
Nathaniel

Re: (Relatively) Oversized Checkpoint

From

Jeff Janes

Date:

16 June 2014, 18:49:52

On Mon, Jun 16, 2014 at 11:01 AM, Nathaniel Talbott
<nathaniel@spreedly.com> wrote:
> We're still learning to tune PostgreSQL for our production setup, and over
> the weekend we saw a huge spike in INSERT/UPDATE completion time for a short
> period which turned out to correlate to an oversized checkpoint. I've
> included the preceding and following checkpoints for reference points, as
> they're typical of the usual checkpoints we see:
>
>     Jun 16 06:01:19 sql-2 postgres[19015]: [10706-1] 2014-06-16 06:01:19 UTC
> LOG:  checkpoint starting: time
>     Jun 16 06:04:40 sql-2 postgres[19015]: [10707-1] 2014-06-16 06:04:40 UTC
> LOG:  checkpoint complete: wrote 1962 buffers (0.7%); 0 transaction log
> file(s) added, 0 removed, 1 recycled; write=197.906 s, sync=2.411 s,
> total=200.697 s; sync files=42, longest=1.110 s, average=0.057 s
>
>     Jun 16 06:06:19 sql-2 postgres[19015]: [10708-1] 2014-06-16 06:06:19 UTC
> LOG:  checkpoint starting: time
>     Jun 16 06:12:03 sql-2 postgres[19015]: [10709-1] 2014-06-16 06:12:03 UTC
> LOG:  checkpoint complete: wrote 25958 buffers (9.9%); 0 transaction log
> file(s) added, 0 removed, 12 recycled; write=269.832 s, sync=74.105 s,
> total=344.294 s; sync files=49, longest=64.421 s, average=1.512 s

So the fsync (by the kernel) of a single file took 64 seconds.  This
single event explains almost all of the overrun and the performance
problems.

This is a classic problem, but the solutions are fuzzy.

Lowering /proc/sys/vm/dirty_background_ratio and
/proc/sys/vm/dirty_background_bytes and not using ext3 for the file
system are the clearest I know of, but people have also reported that
lowering shared_buffers can help.

This is fundamentally a kernel/FS/IO issue.  What are the current
settings for those, and what kernel and file system, and IO subsystem
are you using?

Cheers,

Jeff

Re: (Relatively) Oversized Checkpoint

From

Sameer Kumar

Date:

17 June 2014, 09:21:47

As pointed out you may want to look at some of your kernel parameters which control the writeback (from cache) to disk.

Additionally you may want to take a look bg write parameters. Specially bgwriter delay. I guess tuning that should help as well.

It might be useful if you can share your parameter file here.

Best Regards,

Sameer Kumar | Database Consultant

ASHNIK PTE. LTD.

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069533

M: +65 8110 0350 T: +65 6438 3504 | www.ashnik.com

This email may contain confidential, privileged or copyright material and is solely for the use of the intended recipient(s).

Attachment

Re: (Relatively) Oversized Checkpoint

From

Nathaniel Talbott

Date:

18 June 2014, 14:32:52

> What is the sustained volume of disk output you experience, for
> example from vmstat snapshots?

Unfortunately our metrics are... less than comprehensive at this point, so I don't have data for that at the moment. Going to work to rectify, but will take some time.

> 10% of 8GB (which is sounds like the mount of of RAM you have) is
> nearly a gig, which is a lot of dirty data to keep around. I'd use
> dirty_background_bytes to set it, as it give finer control than
> dirty_background_ratio does, and set it maybe 100MB (but depends on
> your IO system)

After realizing that often (but not always) these events seem to happen a few minutes after an auto vacuum run, I'm inclined to agree. I'm trying 100Mb for now; we'll see overnight whether that helps.

> Series 3 kernels before 3.8 seem to have a poor (but vague) reputation
> for running database-type workloads.

Super helpful to know. Getting the OS upgraded on these boxes is on the short list anyhow, and I'll make sure we end up on at least 3.8 by the time we're done.

> How many drives of what RPM in what RAID configuration, and does the
> RAID controller have BBU. I personally don't know from drivers, but
> other people on the list do, so that info could be useful as well.

/proc/driver/cciss$ cat cciss0

cciss0: HP Smart Array P400i Controller

Board ID: X

Firmware Version: 7.22

IRQ: 64

Logical drives: 1

Current Q depth: 0

Current # commands on controller: 0

Max Q depth since init: 191

Max # commands on controller since init: 310

Max SG entries since init: 128

Sequential access devices: 0

cciss/c0d0: 127.99GB RAID 1(1+0)

~$ sudo hpacucli controller slot=0 show

Smart Array P400i in Slot 0 (Embedded)

Bus Interface: PCI

Slot: 0

Serial Number: X

Cache Serial Number: X

RAID 6 (ADG) Status: Disabled

Controller Status: OK

Hardware Revision: D

Firmware Version: 7.22

Rebuild Priority: Medium

Expand Priority: Medium

Surface Scan Delay: 15 secs

Surface Scan Mode: Idle

Wait for Cache Room: Disabled

Surface Analysis Inconsistency Notification: Disabled

Post Prompt Timeout: 0 secs

Cache Board Present: True

Cache Status: OK

Cache Ratio: 100% Read / 0% Write

Drive Write Cache: Disabled

Total Cache Size: 256 MB

Total Cache Memory Available: 208 MB

No-Battery Write Cache: Disabled

Battery/Capacitor Count: 0

SATA NCQ Supported: True

Physical drives:

~$ sudo hpacucli controller slot=0 pd all show detail

Smart Array P400i in Slot 0 (Embedded)

array A

physicaldrive 1I:1:1

Port: 1I

Box: 1

Bay: 1

Status: OK

Drive Type: Data Drive

Interface Type: SATA

Size: 128.0 GB

Firmware Revision: DXM04B0Q

Serial Number: X

Model: ATA Samsung SSD 840

SATA NCQ Capable: True

SATA NCQ Enabled: True

Current Temperature (C): 23

Maximum Temperature (C): 70

PHY Count: 1

PHY Transfer Rate: 1.5Gbps

physicaldrive 1I:1:2

Port: 1I

Box: 1

Bay: 2

Status: OK

Drive Type: Data Drive

Interface Type: SATA

Size: 128.0 GB

Firmware Revision: DXM04B0Q

Serial Number: X

Model: ATA Samsung SSD 840

SATA NCQ Capable: True

SATA NCQ Enabled: True

Current Temperature (C): 21

Maximum Temperature (C): 70

PHY Count: 1

PHY Transfer Rate: 1.5Gbps

--
Nathaniel

On Mon, Jun 16, 2014 at 5:01 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Mon, Jun 16, 2014 at 12:40 PM, Nathaniel Talbott
<nathaniel@spreedly.com> wrote:
>> So the fsync (by the kernel) of a single file took 64 seconds. This
>> single event explains almost all of the overrun and the performance
>> problems.
>
> So the 10x number of buffers being written is a red herring and/or
> correlates with a slow fsync?

Two theories there:

Either the uniformly high level of activity is not actually uniform
and it was extra high during that checkpoint.

Or, much of the writing that happens is normally being done by the
background writer rather than the checkpoint writer. Severe IO
problems caused the background writer to freeze up or slow down,
leaving the work to be done by the checkpoint writer instead. Once
the checkpoint writer hits the fsync phase of its work, it then gets
blocked by the same IO problems that were affecting the background
writer. So in this theory, the shift of the work load from background
writer to checkpoint writer is just a symptom of the IO problems.

Unfortunately there is little good way to assess those theories,
unless you have a lot of logging data for the period of interest, like
"sar" or "vmstat" or have been running "select * from
pg_stat_bgwriter" periodically and saving the results.

What is the sustained volume of disk output you experience, for
example from vmstat snapshots?

>
>
>> Lowering /proc/sys/vm/dirty_background_ratio and
>> /proc/sys/vm/dirty_background_bytes
>
> I've never tweaked dirty_background_* before; currently we have:
>
> $ cat /proc/sys/vm/dirty_background_ratio
> 10
> $ cat /proc/sys/vm/dirty_background_bytes
> 0
>
> Where should I start with tweaking those?

10% of 8GB (which is sounds like the mount of of RAM you have) is
nearly a gig, which is a lot of dirty data to keep around. I'd use
dirty_background_bytes to set it, as it give finer control than
dirty_background_ratio does, and set it maybe 100MB (but depends on
your IO system)

>
>
>> not using ext3 for the file system
>
> What are the recommended alternatives? We're using ext4 currently.

ext3 and ext4 are the only thing I've used personally for production
databases. I think ext4 mostly fixed the problems that ext3 had.
Some people like the XFS or JFS but I have no experience with them.

Since you aren't using the dreaded ext3, I wouldn't worry about that aspect.

>> are the clearest I know of, but people have also reported that
>> lowering shared_buffers can help.
>
> Currently we have:
>
> => SHOW shared_buffers;
> shared_buffers
> ----------------
> 2GB
> (1 row)
>
> That seems to match up with tuning recommendations I've read (25% of
> available RAM). And of course generally it seems to work great.

People who reported problems with large shared_buffers usually had
huge amounts of RAM, so that 25% of it was far larger than 2GB. So
probably nothing for you to worry about.

>
>
>> This is fundamentally a kernel/FS/IO issue. What are the current
>> settings for those, and what kernel and file system, and IO subsystem
>> are you using?
>
> Kernel: Linux 3.2.0-51-generic #77-Ubuntu SMP x86_64
> Filesystem: ext4

Series 3 kernels before 3.8 seem to have a poor (but vague) reputation
for running database-type workloads.

>
> Not sure what you're asking RE IO subsystem - do you mean drivers/hardware?

How many drives of what RPM in what RAID configuration, and does the
RAID controller have BBU. I personally don't know from drivers, but
other people on the list do, so that info could be useful as well.

Cheers,

Jeff