Re: 500 tpsQL + WAL log implementation - Mailing list pgsql-hackers

From Tom Lane
Subject Re: 500 tpsQL + WAL log implementation
Date
Msg-id 19900.1037064725@sss.pgh.pa.us
Whole thread Raw
In response to 500 tpsQL + WAL log implementation  ("Curtis Faith" <curtis@galtair.com>)
Responses Re: 500 tpsQL + WAL log implementation  ("Curtis Faith" <curtis@galtair.com>)
List pgsql-hackers
"Curtis Faith" <curtis@galtair.com> writes:
> Using a raw file partition and a time-based technique for determining the
> optimal write position, I am able to get 8K writes physically written to disk
> synchronously in the range of 500 to 650 writes per second using FreeBSD raw
> device partitions on IDE disks (with write cache disabled).

What can you do *without* using a raw partition?

I dislike that idea for two reasons: portability and security.  The
portability disadvantages are obvious.  And in ordinary system setups
Postgres would have to run as root in order to write on a raw partition.

It occurs to me that the same technique could be used without any raw
device access.  Preallocate a large WAL file and apply the method within
it.  You'll have more noise in the measurements due to greater
variability in the physical positioning of the blocks --- but it's
rather illusory to imagine that you know the disk geometry with any
accuracy anyway.  Modern drives play a lot of games under the hood.

> The obvious problem with the above mechanism is that the WAL log needs to be
> able to read from the log file in transaction order during recovery. This
> could be provided for using an abstraction that prepends the logical order
> for each block written to the disk and makes sure that the log blocks contain
> either a valid logical order number or some other marker indicating that the
> block is not being used.

This scares me quite a bit too.  The reason that the existing
implementation maxes out at one WAL write per rotation is that for small
transactions it's having to repeatedly write the same disk sector.  You
could only get around that by writing multiple versions of the same WAL
page at different disk locations.  Reliably reconstructing what data to
use is not something that I'm prepared to accept on a handwave...
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: "Uninitialized page" bug mechanism identified
Next
From: Tom Lane
Date:
Subject: Idea for better handling of cntxDirty