Thread: Re: Sorting writes during checkpoint
(Go back to -hackers) Simon Riggs <simon@2ndquadrant.com> wrote: > No action on this seen since last commitfest, but I think we should do > something with it, rather than just ignore it. I will have a plan to test it on RAID-5 disks, where sequential writing are much better than random writing. I'll send the result as an evidence. Also, I have a relevant idea to sorting writes. Smoothed checkpoint in 8.3 spreads write(), but calls fsync() at once. With sorted writes, we can call fsync() segment-by-segment for each writes of dirty pages contained in the segment. It could improve worst response time during checkpoints. > Note that if we do this for checkpoint we should also do this for > FlushRelationBuffers(), used during heap_sync(), for exactly the same > reasons. Ah, I overlooked FlushRelationBuffers(). It is worth sorting. > Would suggest calling it bulk_io_hook() or similar. I think we need to reconsider the "bufmgr - smgr - md" layers, not only an I/O elevator hook. If we will have spreading fsync(), bufmgr should know where the file segments are switched. It seems to break area between bufmgr and md in the current architecture unhappily. In addition, the current smgr layer is completely useless because it cannot be extended dynamically and cannot handle multiple md-layer modules. I would rather merge current smgr and part of bufmgr into a new smgr and add smgr_hook() than bulk_io_hook(). Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Mon, 7 Jul 2008, ITAGAKI Takahiro wrote: > I will have a plan to test it on RAID-5 disks, where sequential writing > are much better than random writing. I'll send the result as an evidence. If you're running more tests here, please turn on log_checkpoints and collect the logs while the test is running. I'm really curious if there's any significant difference in what that reports here in the sorted case vs. the regular one. > Smoothed checkpoint in 8.3 spreads write(), but calls fsync() at once. > With sorted writes, we can call fsync() segment-by-segment for each > writes of dirty pages contained in the segment. It could improve worst > response time during checkpoints. Further decreasing the amount of data that is fsync'd at any point in time might be a bigger improvement than just the sorting itself is doing (so far I haven't seen anything really significant just from the sort but am still testing). One thing I didn't see any comments from you on is how/if the sorted writes patch lowers worst-case latency. That's the area I'd hope an improved fsync protocol would help most with, rather than TPS, which might even go backwards because writes won't be as bunched and therefore will have more seeking. It's easy enough to analyze the data coming from "pgbench -l" to figure that out; example shell snipped that shows just the worst ones: pgbench -l -N <db> p=$! wait $p mv pgbench_log.${p} pgbench.log cat pgbench.log | cut -f 3 -d " " | sort -n | tail Actually graphing the latencies can be even more instructive, I have some examples of that on my web page you may have seen before. > In addition, the current smgr layer is completely useless because > it cannot be extended dynamically and cannot handle multiple md-layer > modules. I would rather merge current smgr and part of bufmgr into > a new smgr and add smgr_hook() than bulk_io_hook(). I don't really have a firm opinion here about the code to comment on this specific suggestion, but I will say that I've found the amount of layering in this area makes it difficult to understand just what's going on sometimes (especially when new to it). A lot of that abstraction felt a bit pass-through to me, and anything that would collapse that a bit would be helpful for streamlining the code instrumenting going on with things like dtrace. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD