Thread: 500 tpsQL + WAL log implementation

500 tpsQL + WAL log implementation

From
"Curtis Faith"
Date:
I have been experimenting with empirical tests of file system and device
level writes to determine the actual constraints in order to speed up the WAL
logging code.

Using a raw file partition and a time-based technique for determining the
optimal write position, I am able to get 8K writes physically written to disk
synchronously in the range of 500 to 650 writes per second using FreeBSD raw
device partitions on IDE disks (with write cache disabled).  I will be
testing it soon under linux with 10,00RPM SCSI which should be even better.
It is my belief that the mechanism used to achieve these speeds could be
incorporated into the existing WAL logging code as an abstraction that looks
to the WAL code just like the file level access currently used. The current
speeds are limited by the speed of a single disk rotation. For a 7,200 RPM
disk this is 120/second, for a 10,000 RPM disk this is 166.66/second

The mechanism works by adjusting the seek offset of the write by using
gettimeofday to determine approximately where the disk head is in its
rotation. The mechanism does not use any AIO calls.

Assuming the following:

1) Disk rotation time is 8.333ms or 8333us (7200 RPM).

2) A write at offset 1,500K completes at system time 103s 000ms 000us

3) A new write is requested at system time 103s 004ms 166us

4) A 390K per rotation alignment of the data on the disk.

5) A write must be sent at least 20K ahead of the current head position to
ensure that it is written in less than one rotation.

It can be determined from the above that a write for an offset of something
slightly more than 195K past the last write, or offset 1,695K will be ahead
of the current location of the head and will therefore complete in less than
a single rotation's time.

The disk specific metrics (rotation speed, bytes per rotation, base write
time, etc.) can be derived empirically through a tester program that would
take a few minutes to run and which could be run at log setup time.

The obvious problem with the above mechanism is that the WAL log needs to be
able to read from the log file in transaction order during recovery. This
could be provided for using an abstraction that prepends the logical order
for each block written to the disk and makes sure that the log blocks contain
either a valid logical order number or some other marker indicating that the
block is not being used.

A bitmap of blocks that have already been used would be kept in memory for
quickly determining the next set of possible unused blocks but this bitmap
would not need to be written to disk except during normal shutdown since in
the even of a failure the bitmaps would be reconstructed by reading all the
blocks from the disk.

Checkpointing and something akin to log rotation could be handled using this
mechanism as well.

So, MY REAL QUESTION is whether or not this is the sort of speed improvement
that warrants the work of writing the required abstraction layer and making
this very robust. The WAL code should remain essentially unchanged, with
perhaps new calls for the five or six routines used to access the log files,
and handle the equivalent of log rotation for raw device access. These new
calls would either use the current file based implementation or the new
logging mechanism depending on the configuration.

I anticipate that the extra work required for a PostgreSQL administrator to
use the proposed logging mechanism would be to:

1) Create a raw device partition of the appropriate size
2) Run the metrics tester for that device partition
3) Set the appropriate configuration parameters to indicate raw WAL logging

I anticipate that the additional space requirements for this system would be
on the order of 10% to 15% beyond the current file-based implementation's
requirements.

So, is this worth doing? Would a robust implementation likely be accepted for
7.4 assuming it can demonstrate speed improvements in the range of 500tps?

- Curtis



Re: 500 tpsQL + WAL log implementation

From
Tom Lane
Date:
"Curtis Faith" <curtis@galtair.com> writes:
> Using a raw file partition and a time-based technique for determining the
> optimal write position, I am able to get 8K writes physically written to disk
> synchronously in the range of 500 to 650 writes per second using FreeBSD raw
> device partitions on IDE disks (with write cache disabled).

What can you do *without* using a raw partition?

I dislike that idea for two reasons: portability and security.  The
portability disadvantages are obvious.  And in ordinary system setups
Postgres would have to run as root in order to write on a raw partition.

It occurs to me that the same technique could be used without any raw
device access.  Preallocate a large WAL file and apply the method within
it.  You'll have more noise in the measurements due to greater
variability in the physical positioning of the blocks --- but it's
rather illusory to imagine that you know the disk geometry with any
accuracy anyway.  Modern drives play a lot of games under the hood.

> The obvious problem with the above mechanism is that the WAL log needs to be
> able to read from the log file in transaction order during recovery. This
> could be provided for using an abstraction that prepends the logical order
> for each block written to the disk and makes sure that the log blocks contain
> either a valid logical order number or some other marker indicating that the
> block is not being used.

This scares me quite a bit too.  The reason that the existing
implementation maxes out at one WAL write per rotation is that for small
transactions it's having to repeatedly write the same disk sector.  You
could only get around that by writing multiple versions of the same WAL
page at different disk locations.  Reliably reconstructing what data to
use is not something that I'm prepared to accept on a handwave...
        regards, tom lane


Re: 500 tpsQL + WAL log implementation

From
"Curtis Faith"
Date:
tom lane wrote:
> What can you do *without* using a raw partition?
>
> I dislike that idea for two reasons: portability and security.  The
> portability disadvantages are obvious.  And in ordinary system setups
> Postgres would have to run as root in order to write on a raw partition.
>
> It occurs to me that the same technique could be used without any raw
> device access.  Preallocate a large WAL file and apply the method within
> it.  You'll have more noise in the measurements due to greater
> variability in the physical positioning of the blocks --- but it's
> rather illusory to imagine that you know the disk geometry with any
> accuracy anyway.  Modern drives play a lot of games under the hood.

A write to a raw disk file is immediate and completes with minimal system
overhead. I'll try to test a file-based approach using a write followed by an
immediate fdatasynch and see if that approaches the speed of the raw
partition access. I suspect we'll get decent performance, only perhaps 10% to
15% slower.

As you mention, there is nothing exact about the technique, so we should be
able to get similar improvements with a file based system. I've been able to
get over 1,500 raw writes confirmed to disk using raw partition writes each
slightly offset ahead of they other, yet, only somewhere between 500 and 650
on a sustained basis using the technique I described because of the noise in
the geometry measurements and variable timing for the writes themselves.

>
> This scares me quite a bit too.  The reason that the existing
> implementation maxes out at one WAL write per rotation is that for small
> transactions it's having to repeatedly write the same disk sector.  You
> could only get around that by writing multiple versions of the same WAL
> page at different disk locations.  Reliably reconstructing what data to
> use is not something that I'm prepared to accept on a handwave...

I'm pretty sure this could be done very reliably but at the cost of slightly
slower reading after a failure for redo.

I figured that whenever a transaction wrote to the log it would set the log
offset marker for new transactions to force the next transaction to use a new
block. This would result in space waste which could be partially offset by
using writes smaller than the 8K block size (along disk block size
boundaries, 512 bytes for my disk). This has the advantage of making it
fairly easy to make sure that the log can be reconstructed in order since
there would be no partial block writes to worry about.

I believe that 4 to 8 full rotations worth of usable blocks could be
maintained and blocks would be written to the lowest offset tracks first
unless there were no free blocks of sufficient size. This would probably
result in 90% to 95% utilization of the blocks (disregarding waste inside the
blocks themselves). When the lowest offset track filled up sufficiently,
another empty track would be added to the usable blocks list and the lowest
offset track taken off the unused list.

This would ensure that a read of 4 to 8 tracks, which needs to be a fixed
number for any given installation, could reconstruct the order of the WAL log
since at no time would blocks be out of order beyond that range.

Disk space is much cheaper than CPU and memory so I think that a logging
system that used as much as three or four times the space but is three or
four times faster would be a worthwhile improvement for those systems where
updates or insert volume are very heavy. Obviously, this needs to be an
option, not the default configuration.

- Curtis



Re: 500 tpsQL + WAL log implementation

From
"scott.marlowe"
Date:
Curtis, have you considered comparing raw writes versus file system writes 
on a raw multi-disk partition?

I always set up my machines to store data on a mirror set (RAID1) or RAID5 
set, and it seems your method should be tested there too.

P.s., Tom, the postgresql user would NOT need to run as root to do this, 
you can just do a 'chown postgres.postgres /dev/sda1' or 'chown 
postgres.postgres /dev/md0' to give postgresql permission to write / read 
from a raw partition.

That said, I am not sure how much we are likely to gain from this method 
either.  But if no one tries it out we'll never know.



Re: 500 tpsQL + WAL log implementation

From
Justin Clift
Date:
Bravo Curtis,

This is all excellent research.

:-)

Regards and best wishes,

Justin Clift


Curtis Faith wrote:
<snip>
> Disk space is much cheaper than CPU and memory so I think that a logging
> system that used as much as three or four times the space but is three or
> four times faster would be a worthwhile improvement for those systems where
> updates or insert volume are very heavy. Obviously, this needs to be an
> option, not the default configuration.
> 
> - Curtis

-- 
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."  - Indira Gandhi