Re: Improvement of checkpoint IO scheduler for stable transaction responses - Mailing list pgsql-hackers
From | Ants Aasma |
---|---|
Subject | Re: Improvement of checkpoint IO scheduler for stable transaction responses |
Date | |
Msg-id | CA+CSw_ta4s_gLGnAH4GQyvLdc8LAWgFJXhS7Y-Lwt5tZOHz_9g@mail.gmail.com Whole thread Raw |
In response to | Re: Improvement of checkpoint IO scheduler for stable transaction responses (Greg Smith <greg@2ndQuadrant.com>) |
Responses |
Re: Improvement of checkpoint IO scheduler for stable transaction
responses
Re: Improvement of checkpoint IO scheduler for stable transaction responses |
List | pgsql-hackers |
On Jul 14, 2013 9:46 PM, "Greg Smith" <greg@2ndquadrant.com> wrote: > I updated and re-reviewed that in 2011: http://www.postgresql.org/message-id/4D31AE64.3000202@2ndquadrant.com and commentedon why I think the improvement was difficult to reproduce back then. The improvement didn't follow for me either. It would take a really amazing bit of data to get me to believe write sorting code is worthwhile after that. Onlarge systems capable of dirtying enough blocks to cause a problem, the operating system and RAID controllers are alreadysorting block. And *that* sorting is also considering concurrent read requests, which are a lot more important toan efficient schedule than anything the checkpoint process knows about. The database doesn't have nearly enough informationyet to compete against OS level sorting. That reasoning makes no sense. OS level sorting can only see the writes in the time window between PostgreSQL write, and being forced to disk. Spread checkpoints sprinkles the writes out over a long period and the general tuning advice is to heavily bound the amount of memory the OS willing to keep dirty. This makes probability of scheduling adjacent writes together quite low, the merging window being limited either by dirty_bytes or dirty_expire_centisecs. The checkpointer has the best long term overview of the situation here, OS scheduling only has the short term view of outstanding read and write requests. By sorting checkpoint writes it is much more likely that adjacent blocks are visible to OS writeback at the same time and will be issued together. I gave the linked patch a shot. I tried it with pgbench scale 100 concurrency 32, postgresql shared_buffers=3GB, checkpoint_timeout=5min, checkpoint_segments=100, checkpoint_completion_target=0.5, pgdata was on a 7200RPM HDD, xlog on Intel 320 SSD, kernel settings: dirty_background_bytes = 32M, dirty_bytes = 128M. first checkpoint on master: wrote 209496 buffers (53.7%); 0 transaction log file(s) added, 0 removed, 26 recycled; write=314.444 s, sync=9.614 s, total=324.166 s; sync files=16, longest=9.208 s, average=0.600 s IO while checkpointing: about 500 write iops at 5MB/s, 100% utilisation. first checkpoint with checkpoint sorting applied: wrote 205269 buffers (52.6%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=149.049 s, sync=0.386 s, total=149.559 s; sync files=39, longest=0.255 s, average=0.009 s IO while checkpointing: about 23 write iops at 12MB/s, 10% utilisation. Transaction processing rate for a 20min run went from 5200 to 7000. Looks to me that in this admittedly best case workload the sorting is working exactly as designed, converting mostly random IO into sequential. I have seen many real world workloads where this kind of sorting would have benefited greatly. I also did a I/O bound test with scalefactor 100 and checkpoint_timeout 30min. 2hour average tps went from 121 to 135, but I'm not yet sure if it's repeatable or just noise. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
pgsql-hackers by date: