Re: Analysis of ganged WAL writes - Mailing list pgsql-hackers
From | Hannu Krosing |
---|---|
Subject | Re: Analysis of ganged WAL writes |
Date | |
Msg-id | 1034015575.2562.29.camel@rh72.home.ee Whole thread Raw |
In response to | Re: Analysis of ganged WAL writes ("Curtis Faith" <curtis@galtair.com>) |
Responses |
Re: Analysis of ganged WAL writes
|
List | pgsql-hackers |
On Tue, 2002-10-08 at 00:12, Curtis Faith wrote: > Tom, first of all, excellent job improving the current algorithm. I'm glad > you look at the WALCommitLock code. > > > This must be so because the backends that are > > released at the end of any given disk revolution will not be able to > > participate in the next group commit, if there is already at least > > one backend ready to commit. > > This is the major reason for my original suggestion about using aio_write. > The writes don't block each other and there is no need for a kernel level > exclusive locking call like fsync or fdatasync. > > Even the theoretical limit you mention of one transaction per revolution > per committing process seem like a significant bottleneck. > > Is committing 1 and 4 transactions on every revolution good? It's certainly > better than 1 per revolution. Of course committing all 5 at each rev would be better ;) > However, what if we could have done 3 transactions per process in the time > it took for a single revolution? I may be missing something obvious, but I don't see a way to get more than 1 trx/process/revolution, as each previous transaction in that process must be written to disk before the next can start, and the only way it can be written to the disk is when the disk heads are on the right place and that happens exactly once per revolution. In theory we could devise some clever page interleave scheme that would allow us to go like this: fill one page - write page to disk, commit trx's - fill the page in next 1/3 of rev - write next page to disk ... , but this will work only for some limited set ao WAL page sizes. It could be possible to get near 5/trx/rev for 5 backends if we do the following (A-E are backends from Toms explanation): 1. write the page for A's trx to its proper pos P (wher P is page number) 2. if after sync for A returns and we already have more transactions waiting for write()+sync() of the same page, immediately write the _same_ page to pos P+N (where N is a tunable parameter). If N is small enough then P+N will be on the same cylinder for most cases and thus will get transactions B-E also committed on the same rev. 3. make sure that the last version will also be written to its proper place before the end of log will overwrite P+N. (This may be tricky.) 4. When restoring from WAL, always check for a page at EndPos+N for a possible newer version of last page. This scheme requires page numbers+page versions to be stored in each page and could get us near 1 trx/backend/rev performance, but it's hard to tell if it is really useful in real life. This could also possibly be extended to more than one "end page" and more than one "continuation end page copy" to get better than 1 trx/backend/rev. ----------------- Hannu
pgsql-hackers by date: