Re: Spread checkpoint sync - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Spread checkpoint sync
Date
Msg-id 4D357733.3090104@2ndquadrant.com
Whole thread Raw
In response to Re: Spread checkpoint sync  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Robert Haas wrote:
> Idea #4: For ext3 filesystems that like to dump the entire buffer
> cache instead of only the requested file, write a little daemon that
> runs alongside of (and completely indepdently of) PostgreSQL.  Every
> 30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and
> closes the file, thus dumping the cache and preventing a ridiculous
> growth in the amount of data to be sync'd at checkpoint time.
>   

Today's data suggests this problem has been resolved in the latest 
kernels.  I saw the "giant flush/series of small flushes" pattern quite 
easily on the CentOS5 system I last did heavy pgbench testing on.  The 
one I'm testing now has kernel 2.6.23 (Ubuntu 10.04), and it doesn't 
show it at all.

Here's what a bad checkpoint looks like on this system:

LOG:  checkpoint starting: xlog
DEBUG:  checkpoint sync: number=1 file=base/24746/36596.8 time=7651.601 msec
DEBUG:  checkpoint sync: number=2 file=base/24746/36506 time=0.001 msec
DEBUG:  checkpoint sync: number=3 file=base/24746/36596.2 time=1891.695 msec
DEBUG:  checkpoint sync: number=4 file=base/24746/36596.4 time=7431.441 msec
DEBUG:  checkpoint sync: number=5 file=base/24746/36515 time=0.216 msec
DEBUG:  checkpoint sync: number=6 file=base/24746/36596.9 time=4422.892 msec
DEBUG:  checkpoint sync: number=7 file=base/24746/36596.12 time=954.242 msec
DEBUG:  checkpoint sync: number=8 file=base/24746/36237_fsm time=0.002 msec
DEBUG:  checkpoint sync: number=9 file=base/24746/36503 time=0.001 msec
DEBUG:  checkpoint sync: number=10 file=base/24746/36584 time=41.401 msec
DEBUG:  checkpoint sync: number=11 file=base/24746/36596.7 time=885.921 msec
DEBUG:  checkpoint sync: number=12 file=base/24813/30774 time=0.002 msec
DEBUG:  checkpoint sync: number=13 file=base/24813/24822 time=0.005 msec
DEBUG:  checkpoint sync: number=14 file=base/24746/36801 time=49.801 msec
DEBUG:  checkpoint sync: number=15 file=base/24746/36601.2 time=610.996 msec
DEBUG:  checkpoint sync: number=16 file=base/24746/36596 time=16154.361 msec
DEBUG:  checkpoint sync: number=17 file=base/24746/36503_vm time=0.001 msec
DEBUG:  checkpoint sync: number=18 file=base/24746/36508 time=0.000 msec
DEBUG:  checkpoint sync: number=19 file=base/24746/36596.10 
time=9759.898 msec
DEBUG:  checkpoint sync: number=20 file=base/24746/36596.3 time=3392.727 
msec
DEBUG:  checkpoint sync: number=21 file=base/24746/36237 time=0.150 msec
DEBUG:  checkpoint sync: number=22 file=base/24746/36596.11 
time=9153.437 msec
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 1057833 of relation base/24746/36596

[>800 more of these]

DEBUG:  checkpoint sync: number=23 file=base/24746/36601.1 
time=48697.179 msec
DEBUG:  could not forward fsync request because request queue is full
DEBUG:  checkpoint sync: number=24 file=base/24746/36597 time=0.080 msec
DEBUG:  checkpoint sync: number=25 file=base/24746/36237_vm time=0.001 msec
DEBUG:  checkpoint sync: number=26 file=base/24813/24822_fsm time=0.001 msec
DEBUG:  checkpoint sync: number=27 file=base/24746/36503_fsm time=0.000 msec
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 20619 of relation base/24746/36601
DEBUG:  checkpoint sync: number=28 file=base/24746/36506_fsm time=0.000 msec
DEBUG:  checkpoint sync: number=29 file=base/24746/36596_vm time=0.040 msec
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 278967 of relation base/24746/36596
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 1582400 of relation base/24746/36596
DEBUG:  checkpoint sync: number=30 file=base/24746/36596.6 time=0.002 msec
DEBUG:  checkpoint sync: number=31 file=base/24813/11647 time=0.004 msec
DEBUG:  checkpoint sync: number=32 file=base/24746/36601 time=201.632 msec
DEBUG:  checkpoint sync: number=33 file=base/24746/36801_fsm time=0.001 msec
DEBUG:  checkpoint sync: number=34 file=base/24746/36596.5 time=0.001 msec
DEBUG:  checkpoint sync: number=35 file=base/24746/36599 time=0.000 msec
DEBUG:  checkpoint sync: number=36 file=base/24746/36587 time=0.005 msec
DEBUG:  checkpoint sync: number=37 file=base/24746/36596_fsm time=0.001 msec
DEBUG:  checkpoint sync: number=38 file=base/24746/36596.1 time=0.001 msec
DEBUG:  checkpoint sync: number=39 file=base/24746/36801_vm time=0.001 msec
LOG:  checkpoint complete: wrote 9515 buffers (29.0%); 0 transaction log 
file(s) added, 0 removed, 64 recycled; write=32.409 s, sync=111.615 s, 
total=144.052 s; sync files=39, longest=48.697 s, average=2.853 s

Here the file that's been brutally delayed via backend contention is 
#23, after already seeing quite long delays on the earlier ones.  That 
I've never seen with earlier kernels running ext3.

This is good in that it makes it more likely a spread sync approach that 
works on XFS will also work on these newer kernels with ext4.  Then the 
only group we wouldn't be able to help if that works the ext3 + old 
kernel crowd.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



pgsql-hackers by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: REVIEW: Extensions support for pg_dump
Next
From: Magnus Hagander
Date:
Subject: Re: pg_basebackup for streaming base backups