WAL and commit_delay - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | WAL and commit_delay |
Date | |
Msg-id | 200102171805.NAA24180@candle.pha.pa.us Whole thread Raw |
Responses |
Re: WAL and commit_delay
|
List | pgsql-hackers |
I want to give some background on commit_delay, its initial purpose, and possible options. First, looking at the process that happens during a commit: write() - copy WAL dirty page to kernel disk bufferfsync() - force WAL kernel disk buffer to disk platter fsync() take much longer than write(). What Vadim doesn't want is: time backend 1 backend 2 ---- --------- --------- 0 write() 1 fysnc() write() 2 fsync() This would be better as: time backend 1 backend 2 ---- --------- --------- 0 write() 1 write() 2 fsync() fsync() This was the purpose of the commit_delay. Having two fsync()'s is not a problem because only one will see there are dirty buffers. The other will probably either return right away, or wait for the other's fsync() to complete. With the delay, it looks like: time backend 1 backend 2 ---- --------- --------- 0 write() 1 sleep() write() 2 fsync() sleep() 3 fsync() Which shows the second fsync() doing nothing, which is good, because there are no dirty buffers at that time. However, a very possible circumstance is: time backend 1 backend 2 backend 3 ---- --------- --------- --------- 0 write() 1 sleep() write() 2 fsync() sleep() write() 3 fsync() sleep() 4 fsync() In this case, the fsync() by backend 2 does indeed do some work because fsync's backend 3's write(). Frankly, I don't see how the sleep does much except delay things because it doesn't have any smarts about when the delay is useful, and when it is useless. Without that feedback, I recommend removing the entire setting. For single backends, the sleep is clearly a loser. Another situation it can not deal with is: time backend 1 backend 2 ---- --------- --------- 0 write() 1 sleep() 2 fsync() write() 3 sleep() 4 fsync() My solution can't deal with this either. --------------------------------------------------------------------------- The quick fix is to remove the commit_delay code. A more elaborate performance boost would be to have the each backend get feedback from other backends, so they can block and wait for other about-to-fsync backends before fsync(). This allows the write() to bunch up before the fsync(). Here is the single backend case, which experiences no delays: time backend 1 backend 2 ---- --------- --------- 0 get_shlock() 1 write() 2 rel_shlock() 3 get_exlock() 4 rel_exlock() 5 fsync() Here is the two-backend case, which shows both write()'s completing before the fsync()'s: time backend 1 backend 2 ---- --------- --------- 0 get_shlock() 1 write() 2 rel_shlock() get_shlock() 3 get_exlock() write() 4 rel_shlock() 5 rel_exlock() 6 fsync() get_exlock() 7 rel_exlock() 8 fsync() Contrast that with the first 2 backend case presented above: time backend 1 backend 2 ---- --------- --------- 0 write() 1 fysnc() write() 2 fsync() Now, it is my understanding that instead of just shared locking around the write()'s, we could block the entire commit code, so the backend can signal to other about-to-fsync backends to wait. I believe our existing lock code can be used for the locking/unlocking. We can just lock a random, unused table log pg_log or something. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
pgsql-hackers by date: