Re: Spread checkpoint sync - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Spread checkpoint sync
Date
Msg-id 4D34DFC3.9020802@2ndquadrant.com
Whole thread Raw
In response to Re: Spread checkpoint sync  (Jim Nasby <jim@nasby.net>)
List pgsql-hackers
Jim Nasby wrote:
> Wow, that's the kind of thing that would be incredibly difficult to figure out, especially while your production
systemis in flames... Can we change ereport that happens in that case from DEBUG1 to WARNING? Or provide some other
meansto track it
 

That's why we already added pg_stat_bgwriter.buffers_backend_fsync to 
track the problem before trying to improve it.  It was driving me crazy 
on a production server not having any visibility into when it happened.  
I haven't seen that we need anything beyond that so far.  In the context 
of this new patch for example, if you get to where a backend does its 
own sync, you'll know it did a compaction as part of that.  The existing 
statistic would tell you enough.

There's now enough data in test set 3 at 
http://www.2ndquadrant.us/pgbench-results/index.htm to start to see how 
this breaks down on a moderately big system (well, by most people's 
standards, but not Jim for whom this is still a toy).  Note the 
backend_sync column on the right, very end of the page; that's the 
relevant counter I'm commenting on:

scale=175:  Some backend fsync with 64 clients, 2/3 runs.
scale=250:  Significant backend fsync with 32 and 64 clients, every run.
scale=500:  Moderate to large backend fsync at any client count >=16.  
This seems to be worst spot of those mapped.  Above here, I would guess 
the TPS numbers start slowing enough that the fsync request queue 
activity drops, too.
scale=1000:  Backend fsync starting at 8 clients
scale=2000:  Backend fsync starting at 16 clients.  By here I think the 
TPS volumes are getting low enough that clients are stuck significantly 
more often waiting for seeks rather than fsync.

Looks like the most effective spot for me to focus testing on with this 
server is scales of 500 and 1000, with 16 to 64 clients.  Now that I've 
got the scale fine tuned better, I may crank up the client counts too 
and see what that does.  I'm glad these are appearing in reasonable 
volume here though, was starting to get nervous about only having NDA 
restricted results to work against.  Some days you just have to cough up 
for your own hardware.

I just tagged pgbench-tools-0.6.0 and pushed to 
GitHub/git.postgresql.org with the changes that track and report on 
buffers_backend_fsync if anyone else wants to try this out.  It includes 
those numbers if you have a 9.1 with them, otherwise just reports 0 for 
it all the time; detection of the feature wasn't hard to add.  The end 
portion of a config file for the program (the first part specifies 
host/username info and the like) that would replicate the third test set 
here is:

MAX_WORKERS="4"
SCRIPT="tpc-b.sql"
SCALES="1 10 100 175 250 500 1000 2000"
SETCLIENTS="4 8 16 32 64"
SETTIMES=3
RUNTIME=600
TOTTRANS=""

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



pgsql-hackers by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: SQL/MED - file_fdw
Next
From: Tomas Vondra
Date:
Subject: Re: estimating # of distinct values