WAL and commit_delay - Mailing list pgsql-hackers

From Bruce Momjian
Subject WAL and commit_delay
Date
Msg-id 200102171805.NAA24180@candle.pha.pa.us
Whole thread Raw
Responses Re: WAL and commit_delay
List pgsql-hackers
I want to give some background on commit_delay, its initial purpose, and
possible options.

First, looking at the process that happens during a commit:
write() - copy WAL dirty page to kernel disk bufferfsync() - force WAL kernel disk buffer to disk platter

fsync() take much longer than write().

What Vadim doesn't want is:

time    backend 1    backend 2
----    ---------    ---------
0    write()        
1    fysnc()        write()
2            fsync()

This would be better as:

time    backend 1    backend 2
----    ---------    ---------
0    write()        
1            write()
2    fsync()        fsync()

This was the purpose of the commit_delay.  Having two fsync()'s is not a
problem because only one will see there are dirty buffers.  The other
will probably either return right away, or wait for the other's fsync()
to complete.

With the delay, it looks like:

time    backend 1    backend 2
----    ---------    ---------
0    write()        
1    sleep()        write()
2    fsync()        sleep()
3            fsync()

Which shows the second fsync() doing nothing, which is good, because
there are no dirty buffers at that time.  However, a very possible
circumstance is:

time    backend 1    backend 2    backend 3
----    ---------    ---------    ---------
0    write()        
1    sleep()        write()        
2    fsync()        sleep()        write()
3            fsync()        sleep()
4                    fsync()

In this case, the fsync() by backend 2 does indeed do some work because
fsync's backend 3's write().  Frankly, I don't see how the sleep does
much except delay things because it doesn't have any smarts about when
the delay is useful, and when it is useless.  Without that feedback, I
recommend removing the entire setting.  For single backends, the sleep
is clearly a loser.

Another situation it can not deal with is:

time    backend 1    backend 2
----    ---------    ---------
0    write()        
1    sleep()        
2    fsync()        write()
3            sleep()
4            fsync()

My solution can't deal with this either.

---------------------------------------------------------------------------

The quick fix is to remove the commit_delay code.  A more elaborate
performance boost would be to have the each backend get feedback from
other backends, so they can block and wait for other about-to-fsync
backends before fsync().  This allows the write() to bunch up before 
the fsync().

Here is the single backend case, which experiences no delays:

time    backend 1    backend 2
----    ---------    ---------
0    get_shlock()
1    write()        
2    rel_shlock()
3    get_exlock()
4    rel_exlock()
5    fsync()

Here is the two-backend case, which shows both write()'s completing
before the fsync()'s:

time    backend 1    backend 2
----    ---------    ---------
0    get_shlock()
1    write()        
2    rel_shlock()    get_shlock()
3    get_exlock()    write()
4            rel_shlock()
5    rel_exlock()    
6    fsync()        get_exlock()
7            rel_exlock()
8            fsync()

Contrast that with the first 2 backend case presented above:

time    backend 1    backend 2
----    ---------    ---------
0    write()        
1    fysnc()        write()
2            fsync()

Now, it is my understanding that instead of just shared locking around
the write()'s, we could block the entire commit code, so the backend can
signal to other about-to-fsync backends to wait.

I believe our existing lock code can be used for the locking/unlocking. 
We can just lock a random, unused table log pg_log or something.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Re: beta5 ...
Next
From: Bruce Momjian
Date:
Subject: Re: Microsecond sleeps with select()