Re: Improvement of checkpoint IO scheduler for stable transaction responses - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date
Msg-id CA+CSw_ta4s_gLGnAH4GQyvLdc8LAWgFJXhS7Y-Lwt5tZOHz_9g@mail.gmail.com
Whole thread Raw
In response to Re: Improvement of checkpoint IO scheduler for stable transaction responses  (Greg Smith <greg@2ndQuadrant.com>)
Responses Re: Improvement of checkpoint IO scheduler for stable transaction responses
Re: Improvement of checkpoint IO scheduler for stable transaction responses
List pgsql-hackers
On Jul 14, 2013 9:46 PM, "Greg Smith" <greg@2ndquadrant.com> wrote:
> I updated and re-reviewed that in 2011: http://www.postgresql.org/message-id/4D31AE64.3000202@2ndquadrant.com and
commentedon why I think the improvement was difficult to reproduce back then.  The improvement didn't follow for me
either. It would take a really amazing bit of data to get me to believe write sorting code is worthwhile after that.
Onlarge systems capable of dirtying enough blocks to cause a problem, the operating system and RAID controllers are
alreadysorting block.  And *that* sorting is also considering concurrent read requests, which are a lot more important
toan efficient schedule than anything the checkpoint process knows about.  The database doesn't have nearly enough
informationyet to compete against OS level sorting. 

That reasoning makes no sense. OS level sorting can only see the
writes in the time window between PostgreSQL write, and being forced
to disk. Spread checkpoints sprinkles the writes out over a long
period and the general tuning advice is to heavily bound the amount of
memory the OS willing to keep dirty. This makes probability of
scheduling adjacent writes together quite low, the merging window
being limited either by dirty_bytes or dirty_expire_centisecs. The
checkpointer has the best long term overview of the situation here, OS
scheduling only has the short term view of outstanding read and write
requests. By sorting checkpoint writes it is much more likely that
adjacent blocks are visible to OS writeback at the same time and will
be issued together.

I gave the linked patch a shot. I tried it with pgbench scale 100
concurrency 32, postgresql shared_buffers=3GB,
checkpoint_timeout=5min, checkpoint_segments=100,
checkpoint_completion_target=0.5, pgdata was on a 7200RPM HDD, xlog on
Intel 320 SSD, kernel settings: dirty_background_bytes = 32M,
dirty_bytes = 128M.

first checkpoint on master: wrote 209496 buffers (53.7%); 0
transaction log file(s) added, 0 removed, 26 recycled; write=314.444
s, sync=9.614 s, total=324.166 s; sync files=16, longest=9.208 s,
average=0.600 s
IO while checkpointing: about 500 write iops at 5MB/s, 100% utilisation.

first checkpoint with checkpoint sorting applied: wrote 205269 buffers
(52.6%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=149.049 s, sync=0.386 s, total=149.559 s; sync files=39,
longest=0.255 s, average=0.009 s
IO while checkpointing: about 23 write iops at 12MB/s, 10% utilisation.

Transaction processing rate for a 20min run went from 5200 to 7000.

Looks to me that in this admittedly best case workload the sorting is
working exactly as designed, converting mostly random IO into
sequential. I have seen many real world workloads where this kind of
sorting would have benefited greatly.

I also did a I/O bound test with scalefactor 100 and
checkpoint_timeout 30min. 2hour average tps went from 121 to 135, but
I'm not yet sure if it's repeatable or just noise.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: checking variadic "any" argument in parser - should be array
Next
From: Greg Stark
Date:
Subject: Re: Differences in WHERE clause of SELECT