Re: Analysis of ganged WAL writes - Mailing list pgsql-hackers

From Hannu Krosing
Subject Re: Analysis of ganged WAL writes
Date
Msg-id 1034015575.2562.29.camel@rh72.home.ee
Whole thread Raw
In response to Re: Analysis of ganged WAL writes  ("Curtis Faith" <curtis@galtair.com>)
Responses Re: Analysis of ganged WAL writes
List pgsql-hackers
On Tue, 2002-10-08 at 00:12, Curtis Faith wrote:
> Tom, first of all, excellent job improving the current algorithm. I'm glad
> you look at the WALCommitLock code.
> 
> > This must be so because the backends that are
> > released at the end of any given disk revolution will not be able to
> > participate in the next group commit, if there is already at least
> > one backend ready to commit.
> 
> This is the major reason for my original suggestion about using aio_write.
> The writes don't block each other and there is no need for a kernel level
> exclusive locking call like fsync or fdatasync.
> 
> Even the theoretical limit you mention of one transaction per revolution
> per committing process seem like a significant bottleneck.
> 
> Is committing 1 and 4 transactions on every revolution good? It's certainly
> better than 1 per revolution.

Of course committing all 5 at each rev would be better ;)

> However, what if we could have done 3 transactions per process in the time
> it took for a single revolution?

I may be missing something obvious, but I don't see a way to get more
than 1 trx/process/revolution, as each previous transaction in that
process must be written to disk before the next can start, and the only
way it can be written to the disk is when the disk heads are on the
right place and that happens exactly once per revolution. 

In theory we could devise some clever page interleave scheme that would
allow us to go like this: fill one page - write page to disk, commit
trx's - fill the page in next 1/3 of rev - write next page to disk ... ,
but this will work only for some limited set ao WAL page sizes.

It could be possible to get near 5/trx/rev for 5 backends if we do the
following (A-E are backends from Toms explanation):

1. write the page for A's trx to its proper pos P (wher P is page
number)

2. if after sync for A returns and we already have more transactions
waiting for write()+sync() of the same page, immediately write the
_same_ page to pos P+N (where N is a tunable parameter). If N is small
enough then P+N will be on the same cylinder for most cases and thus
will get transactions B-E also committed on the same rev.

3. make sure that the last version will also be written to its proper
place before the end of log will overwrite P+N. (This may be tricky.)

4. When restoring from WAL, always check for a page at EndPos+N for a
possible newer version of last page.

This scheme requires page numbers+page versions to be stored in each
page and could get us near 1 trx/backend/rev performance, but it's hard
to tell if it is really useful in real life.

This could also possibly be extended to more than one "end page" and
more than one "continuation end page copy" to get better than 1
trx/backend/rev.

-----------------
Hannu




pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Dirty Buffer Writing [was Proposed LogWriter Scheme]
Next
From: "Curtis Faith"
Date:
Subject: Re: Dirty Buffer Writing [was Proposed LogWriter Scheme]