Thread: 500 tpsQL + WAL log implementation

500 tpsQL + WAL log implementation

From
"Curtis Faith"
Date:
I have been experimenting with empirical tests of file system and device
level writes to determine the actual constraints in order to speed up the WAL
logging code.

Using a raw file partition and a time-based technique for determining the
optimal write position, I am able to get 8K writes physically written to disk
synchronously in the range of 500 to 650 writes per second using FreeBSD raw
device partitions on IDE disks (with write cache disabled).  I will be
testing it soon under linux with 10,00RPM SCSI which should be even better.
It is my belief that the mechanism used to achieve these speeds could be
incorporated into the existing WAL logging code as an abstraction that looks
to the WAL code just like the file level access currently used. The current
speeds are limited by the speed of a single disk rotation. For a 7,200 RPM
disk this is 120/second, for a 10,000 RPM disk this is 166.66/second

The mechanism works by adjusting the seek offset of the write by using
gettimeofday to determine approximately where the disk head is in its
rotation. The mechanism does not use any AIO calls.

Assuming the following:

1) Disk rotation time is 8.333ms or 8333us (7200 RPM).

2) A write at offset 1,500K completes at system time 103s 000ms 000us

3) A new write is requested at system time 103s 004ms 166us

4) A 390K per rotation alignment of the data on the disk.

5) A write must be sent at least 20K ahead of the current head position to
ensure that it is written in less than one rotation.

It can be determined from the above that a write for an offset of something
slightly more than 195K past the last write, or offset 1,695K will be ahead
of the current location of the head and will therefore complete in less than
a single rotation's time.

The disk specific metrics (rotation speed, bytes per rotation, base write
time, etc.) can be derived empirically through a tester program that would
take a few minutes to run and which could be run at log setup time.

The obvious problem with the above mechanism is that the WAL log needs to be
able to read from the log file in transaction order during recovery. This
could be provided for using an abstraction that prepends the logical order
for each block written to the disk and makes sure that the log blocks contain
either a valid logical order number or some other marker indicating that the
block is not being used.

A bitmap of blocks that have already been used would be kept in memory for
quickly determining the next set of possible unused blocks but this bitmap
would not need to be written to disk except during normal shutdown since in
the even of a failure the bitmaps would be reconstructed by reading all the
blocks from the disk.

Checkpointing and something akin to log rotation could be handled using this
mechanism as well.

So, MY REAL QUESTION is whether or not this is the sort of speed improvement
that warrants the work of writing the required abstraction layer and making
this very robust. The WAL code should remain essentially unchanged, with
perhaps new calls for the five or six routines used to access the log files,
and handle the equivalent of log rotation for raw device access. These new
calls would either use the current file based implementation or the new
logging mechanism depending on the configuration.

I anticipate that the extra work required for a PostgreSQL administrator to
use the proposed logging mechanism would be to:

1) Create a raw device partition of the appropriate size
2) Run the metrics tester for that device partition
3) Set the appropriate configuration parameters to indicate raw WAL logging

I anticipate that the additional space requirements for this system would be
on the order of 10% to 15% beyond the current file-based implementation's
requirements.

So, is this worth doing? Would a robust implementation likely be accepted for
7.4 assuming it can demonstrate speed improvements in the range of 500tps?

- Curtis


















Re: 500 tpsQL + WAL log implementation

From
Bruce Momjian
Date:
I kept this around from November because I wanted to think about it
further.

Basically, right now when we continually add to the last WAL page, we
have to wait for the platter to rotate to that block each time to do the
fsync. 

the idea is to have multiple versions of the last WAL block, meaning you
write the first record of the last block, then when you want to write
another, your disk platter has moved, so you write the first and second
records in a new location.

If we could somehow know the platter location, or tell the disk drive to
write it in several locations, whichever is closest, I think we would
have a real win.  However, I don't see any way of doing that.

Imagine what we could do with 8mb of battery-backed RAM!  This is sort
of what we are using WAL/disk for, and it really isn't very good at it.

Is this a TODO item?
Find a way to reduce rotational delay when repeatedly writing last WAL page


---------------------------------------------------------------------------

Curtis Faith wrote:
> I have been experimenting with empirical tests of file system and device
> level writes to determine the actual constraints in order to speed up the WAL
> logging code.
> 
> Using a raw file partition and a time-based technique for determining the
> optimal write position, I am able to get 8K writes physically written to disk
> synchronously in the range of 500 to 650 writes per second using FreeBSD raw
> device partitions on IDE disks (with write cache disabled).  I will be
> testing it soon under linux with 10,00RPM SCSI which should be even better.
> It is my belief that the mechanism used to achieve these speeds could be
> incorporated into the existing WAL logging code as an abstraction that looks
> to the WAL code just like the file level access currently used. The current
> speeds are limited by the speed of a single disk rotation. For a 7,200 RPM
> disk this is 120/second, for a 10,000 RPM disk this is 166.66/second
> 
> The mechanism works by adjusting the seek offset of the write by using
> gettimeofday to determine approximately where the disk head is in its
> rotation. The mechanism does not use any AIO calls.
> 
> Assuming the following:
> 
> 1) Disk rotation time is 8.333ms or 8333us (7200 RPM).
> 
> 2) A write at offset 1,500K completes at system time 103s 000ms 000us
> 
> 3) A new write is requested at system time 103s 004ms 166us
> 
> 4) A 390K per rotation alignment of the data on the disk.
> 
> 5) A write must be sent at least 20K ahead of the current head position to
> ensure that it is written in less than one rotation.
> 
> It can be determined from the above that a write for an offset of something
> slightly more than 195K past the last write, or offset 1,695K will be ahead
> of the current location of the head and will therefore complete in less than
> a single rotation's time.
> 
> The disk specific metrics (rotation speed, bytes per rotation, base write
> time, etc.) can be derived empirically through a tester program that would
> take a few minutes to run and which could be run at log setup time.
> 
> The obvious problem with the above mechanism is that the WAL log needs to be
> able to read from the log file in transaction order during recovery. This
> could be provided for using an abstraction that prepends the logical order
> for each block written to the disk and makes sure that the log blocks contain
> either a valid logical order number or some other marker indicating that the
> block is not being used.
> 
> A bitmap of blocks that have already been used would be kept in memory for
> quickly determining the next set of possible unused blocks but this bitmap
> would not need to be written to disk except during normal shutdown since in
> the even of a failure the bitmaps would be reconstructed by reading all the
> blocks from the disk.
> 
> Checkpointing and something akin to log rotation could be handled using this
> mechanism as well.
> 
> So, MY REAL QUESTION is whether or not this is the sort of speed improvement
> that warrants the work of writing the required abstraction layer and making
> this very robust. The WAL code should remain essentially unchanged, with
> perhaps new calls for the five or six routines used to access the log files,
> and handle the equivalent of log rotation for raw device access. These new
> calls would either use the current file based implementation or the new
> logging mechanism depending on the configuration.
> 
> I anticipate that the extra work required for a PostgreSQL administrator to
> use the proposed logging mechanism would be to:
> 
> 1) Create a raw device partition of the appropriate size
> 2) Run the metrics tester for that device partition
> 3) Set the appropriate configuration parameters to indicate raw WAL logging
> 
> I anticipate that the additional space requirements for this system would be
> on the order of 10% to 15% beyond the current file-based implementation's
> requirements.
> 
> So, is this worth doing? Would a robust implementation likely be accepted for
> 7.4 assuming it can demonstrate speed improvements in the range of 500tps?
> 
> - Curtis
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 500 tpsQL + WAL log implementation

From
"Peter Galbavy"
Date:
Bruce Momjian wrote:
> the idea is to have multiple versions of the last WAL block, meaning
> you
> write the first record of the last block, then when you want to write
> another, your disk platter has moved, so you write the first and
> second records in a new location.

But how much of this is entirely dependent on deterministic prediction of
the disk activity ?

Not only noting the way modern disks have their own write caches (most IDE
drives now come with between 2 and 8 MB), but transparent bad sector
remapping and also filesystem issues with ufs, ext2 and journalling
extensions to both.

While I believe that there is value is working towards a better coupling
between PosetgreSQL and the underlying hardware, is this approach going to
be productive in the "real" world ? Enough to spend time on it ?

Your choice mind, I am just whining.

Peter



Re: 500 tpsQL + WAL log implementation

From
Jan Wieck
Date:
Bruce Momjian wrote:
> [...]
> If we could somehow know the platter location, or tell the disk drive to
> write it in several locations, whichever is closest, I think we would
> have a real win.  However, I don't see any way of doing that.
> 
> Imagine what we could do with 8mb of battery-backed RAM!  This is sort
> of what we are using WAL/disk for, and it really isn't very good at it.

Isn't that what modern disk drives have ... well, not exactly battery 
backed, but actually the energy they have in the rotation is enough to 
flush the "write cache" out to the surface and get the heads back into 
the parking position in the case of a power loss.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: 500 tpsQL + WAL log implementation

From
Bruce Momjian
Date:
Jan Wieck wrote:
> Bruce Momjian wrote:
> > [...]
> > If we could somehow know the platter location, or tell the disk drive to
> > write it in several locations, whichever is closest, I think we would
> > have a real win.  However, I don't see any way of doing that.
> > 
> > Imagine what we could do with 8mb of battery-backed RAM!  This is sort
> > of what we are using WAL/disk for, and it really isn't very good at it.
> 
> Isn't that what modern disk drives have ... well, not exactly battery 
> backed, but actually the energy they have in the rotation is enough to 
> flush the "write cache" out to the surface and get the heads back into 
> the parking position in the case of a power loss.

That's what I am not sure about --- if those drives return a 'complete'
before getting the actual data on the drive, then we don't have a
rotational delay problem because it isn't waiting for the platter to
spin into place, _and_ the data is secure.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 500 tpsQL + WAL log implementation

From
Jan Wieck
Date:
Bruce Momjian wrote:
> Jan Wieck wrote:
> 
>>Bruce Momjian wrote:
>>
>>>[...]
>>>If we could somehow know the platter location, or tell the disk drive to
>>>write it in several locations, whichever is closest, I think we would
>>>have a real win.  However, I don't see any way of doing that.
>>>
>>>Imagine what we could do with 8mb of battery-backed RAM!  This is sort
>>>of what we are using WAL/disk for, and it really isn't very good at it.
>>
>>Isn't that what modern disk drives have ... well, not exactly battery 
>>backed, but actually the energy they have in the rotation is enough to 
>>flush the "write cache" out to the surface and get the heads back into 
>>the parking position in the case of a power loss.
> 
> 
> That's what I am not sure about --- if those drives return a 'complete'
> before getting the actual data on the drive, then we don't have a
> rotational delay problem because it isn't waiting for the platter to
> spin into place, _and_ the data is secure.

The option is called "Immediate SCSI Error Reporting" ;-)


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #