Re: Spread checkpoint sync - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Spread checkpoint sync
Date
Msg-id 4D4C4E97.7000106@2ndquadrant.com
Whole thread Raw
In response to Re: Spread checkpoint sync  (Greg Smith <greg@2ndquadrant.com>)
Responses Re: Spread checkpoint sync
List pgsql-hackers
As already mentioned in the broader discussion at 
http://archives.postgresql.org/message-id/4D4C4610.1030109@2ndquadrant.com 
, I'm seeing no solid performance swing in the checkpoint sorting code 
itself.  Better sometimes, worse others, but never by a large amount.

Here's what the statistics part derived from the sorted data looks like 
on a real checkpoint spike:

2011-02-04 07:02:51 EST: LOG:  checkpoint starting: xlog
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 10 dirty blocks in 
relation.segment_fork 17216.0_2
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 159 dirty blocks in 
relation.segment_fork 17216.0_1
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 10 dirty blocks in 
relation.segment_fork 17216.3_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 548 dirty blocks in 
relation.segment_fork 17216.4_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 808 dirty blocks in 
relation.segment_fork 17216.5_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 799 dirty blocks in 
relation.segment_fork 17216.6_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 807 dirty blocks in 
relation.segment_fork 17216.7_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 716 dirty blocks in 
relation.segment_fork 17216.8_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 3857 buffers to write, 8 
total dirty segment file(s) expected to need sync
2011-02-04 07:03:31 EST: DEBUG:  checkpoint sync: number=1 
file=base/16384/17216.5 time=1324.614 msec
2011-02-04 07:03:31 EST: DEBUG:  checkpoint sync: number=2 
file=base/16384/17216.4 time=0.002 msec
2011-02-04 07:03:31 EST: DEBUG:  checkpoint sync: number=3 
file=base/16384/17216_fsm time=0.001 msec
2011-02-04 07:03:47 EST: DEBUG:  checkpoint sync: number=4 
file=base/16384/17216.10 time=16446.753 msec
2011-02-04 07:03:53 EST: DEBUG:  checkpoint sync: number=5 
file=base/16384/17216.8 time=5804.252 msec
2011-02-04 07:03:53 EST: DEBUG:  checkpoint sync: number=6 
file=base/16384/17216.7 time=0.001 msec
2011-02-04 07:03:54 EST: DEBUG:  compacted fsync request queue from 
32768 entries to 2 entries
2011-02-04 07:03:54 EST: CONTEXT:  writing block 1642223 of relation 
base/16384/17216
2011-02-04 07:04:00 EST: DEBUG:  checkpoint sync: number=7 
file=base/16384/17216.11 time=6350.577 msec
2011-02-04 07:04:00 EST: DEBUG:  checkpoint sync: number=8 
file=base/16384/17216.9 time=0.001 msec
2011-02-04 07:04:00 EST: DEBUG:  checkpoint sync: number=9 
file=base/16384/17216.6 time=0.001 msec
2011-02-04 07:04:00 EST: DEBUG:  checkpoint sync: number=10 
file=base/16384/17216.3 time=0.001 msec
2011-02-04 07:04:00 EST: DEBUG:  checkpoint sync: number=11 
file=base/16384/17216_vm time=0.001 msec
2011-02-04 07:04:00 EST: LOG:  checkpoint complete: wrote 3813 buffers 
(11.6%); 0 transaction log file(s) added, 0 removed, 64 recycled; 
write=39.073 s, sync=29.926 s, total=69.003 s; sync files=11, 
longest=16.446 s, average=2.720 s

You can see that it ran out of fsync absorption space in the middle of 
the sync phase, which is usually when compaction is needed, but the 
recent patch to fix that kicked in and did its thing.

Couple of observations:

-The total number of buffers I'm computing based on the checkpoint 
writes being sorted it not a perfect match to the number reported by the 
"checkpoint complete" status line.  Sometimes they are the same, 
sometimes not.  Not sure why yet.

-The estimate for "expected to need sync" computed as a by-product of 
the checkpoint sorting is not completely accurate either.  This 
particular one has a fairly large error in it, percentage-wise, being 
off by 3 with a total of 11.  Presumably these are absorbed fsync 
requests that were already queued up before the checkpoint even 
started.  So any time estimate I drive based off of this count is only 
going to be approximate.

-The order in which the sync phase processes files is unrelated to the 
order in which they are written out.  Note that 17216.10 here, the 
biggest victim (cause?) of the I/O spike, isn't even listed among the 
checkpoint writes!

The fuzziness here is a bit disconcerting, and I'll keep digging for why 
it happens.  But I don't see any reason not to continue forward using 
the rough count here to derive a nap time from, which I can then feed 
into the "useful leftovers" patch that Robert already refactored here.  
Can always sharpen up that estimate later, I need to get some solid 
results I can share on what the delay time does to the 
throughput/latency pattern next.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



pgsql-hackers by date:

Previous
From: Hitoshi Harada
Date:
Subject: Re: Add ENCODING option to COPY
Next
From: Tom Lane
Date:
Subject: Use a separate pg_depend.deptype for extension membership?