extending relations more efficiently - Mailing list pgsql-hackers

From Robert Haas
Subject extending relations more efficiently
Date
Msg-id CA+TgmobZ60z7=XspRDHUx2W1jOUjVhPUQj4tSKgq55UVuCZjnw@mail.gmail.com
Whole thread Raw
Responses Re: extending relations more efficiently
Re: extending relations more efficiently
Re: extending relations more efficiently
List pgsql-hackers
We've previously discussed the possible desirability of extending
relations in larger increments, rather than one block at a time, for
performance reasons.  I attempted to determine how much performance we
could possibly buy this way, and found that, as far as I can see, the
answer is, basically, none.  I wrote a test program which does writes
until it reaches 1GB, and times how long the writes take in aggregate.
 Then it performs a single fdatasync at the end and times that as
well.  On some of the machines it is slightly faster in the aggregate
to extend in larger chunks, but the magnitude of the change is little
enough that, at least to me, it seems entirely not worth bothering
with.  Some results are below.  Now, one thing that this test doesn't
help much with is the theory that it's better to extend a file in
larger chunks because the file will become less fragmented on disk.  I
don't really know how to factor that effect into the test - any ideas?

I also considered two other methods of extending a file.  First, there
is ftruncate().  It's really fast.  Unfortunately, it's unsuitable for
our purposes because it will cheerfully leave holes in the file, and
part of the reason for our current implementation is to make sure that
there are no holes, so that later writes to the file can't fail for
lack of disk space.  So that's no good.  Second, and more
interestingly, there is a function called posix_fallocate().  It is
present on Linux but not on MacOS X; I haven't checked any other
platforms.  It claims that it will extend a file out to a particular
size, forcing disk blocks to be allocated so that later writes won't
fail.  Testing (more details below) shows that posix_fallocate() is
quite efficient for large chunks.  For example, extending a file to
1GB in size 64 blocks at a time (that is, 256kB at a time) took only
~60 ms and the subsequent fdatasync took almost no time at all,
whereas zero-filling the file out 1GB using write() took 600-700 ms
and the subsequent fdatasync took another 4-5 seconds.  That seems
like a pretty sizable win, and it's not too hard to imagine that it
could be even better when the I/O subsystem is busy.  Unfortunately,
using posix_fallocate() for 8kB chunks seems to be significantly less
efficient than our current method - I'm guessing that it actually
writes the updated metadata back to disk, where write() does not (this
makes one wonder how safe it is to count on write to have the behavior
we need here in the first place).  So in this case it seems we would
probably want to do it in larger chunks.  (We could possibly also use
it when creating new WAL files, to extend all the way out to 16MB in
one shot, at a considerable savings in I/O.)

Any thoughts about where to go from here would be much appreciated.
Test results follow.

Some results from the IBM POWER7 box (ext4):

write 1 8K blocks at a time: write=782.408 fdatasync=4400.984
write 2 8K blocks at a time: write=560.569 fdatasync=4389.413
write 4 8K blocks at a time: write=479.647 fdatasync=4290.753
write 8 8K blocks at a time: write=627.038 fdatasync=4292.920
write 16 8K blocks at a time: write=619.882 fdatasync=4288.984
write 32 8K blocks at a time: write=613.037 fdatasync=4289.069
write 64 8K blocks at a time: write=608.669 fdatasync=4594.534
write 64 8K blocks at a time: write=608.475 fdatasync=4342.934
write 32 8K blocks at a time: write=612.506 fdatasync=4297.969
write 16 8K blocks at a time: write=621.387 fdatasync=4430.693
write 8 8K blocks at a time: write=629.576 fdatasync=4296.472
write 4 8K blocks at a time: write=674.419 fdatasync=4359.290
write 2 8K blocks at a time: write=652.029 fdatasync=4327.876
write 1 8K blocks at a time: write=800.973 fdatasync=4472.197

Some results from Nate Boley's 64-core box (xfs):

write 1 8K blocks at a time: write=1284.834 fdatasync=3538.361
write 2 8K blocks at a time: write=1176.082 fdatasync=3498.968
write 4 8K blocks at a time: write=1115.419 fdatasync=3634.673
write 8 8K blocks at a time: write=1088.404 fdatasync=3670.018
write 16 8K blocks at a time: write=1082.480 fdatasync=3778.763
write 32 8K blocks at a time: write=1075.875 fdatasync=3757.716
write 64 8K blocks at a time: write=1076.076 fdatasync=3996.997

Some results from Nate Boley's 32-core box (xfs):

write 1 8K blocks at a time: write=968.351 fdatasync=6013.304
write 2 8K blocks at a time: write=902.288 fdatasync=6810.980
write 4 8K blocks at a time: write=900.520 fdatasync=4886.449
write 8 8K blocks at a time: write=889.970 fdatasync=6096.856
write 16 8K blocks at a time: write=882.891 fdatasync=8136.211
write 32 8K blocks at a time: write=892.914 fdatasync=10898.796
write 64 8K blocks at a time: write=917.326 fdatasync=11223.696

And finally, from the IBM POWER7 machine, a  few posix_fallocate results:

posix_fallocate 1 8K blocks at a time: write=3021.177 fdatasync=0.029
posix_fallocate 2 8K blocks at a time: write=1124.638 fdatasync=0.029
posix_fallocate 4 8K blocks at a time: write=9290.490 fdatasync=0.028
posix_fallocate 8 8K blocks at a time: write=477.831 fdatasync=0.029
posix_fallocate 16 8K blocks at a time: write=425.341 fdatasync=0.032
posix_fallocate 32 8K blocks at a time: write=122.499 fdatasync=0.034
posix_fallocate 64 8K blocks at a time: write=60.789 fdatasync=0.023
posix_fallocate 64 8K blocks at a time: write=102.867 fdatasync=0.029
posix_fallocate 32 8K blocks at a time: write=107.753 fdatasync=0.032
posix_fallocate 16 8K blocks at a time: write=900.674 fdatasync=0.031
posix_fallocate 8 8K blocks at a time: write=690.407 fdatasync=0.030
posix_fallocate 4 8K blocks at a time: write=550.454 fdatasync=0.035
posix_fallocate 2 8K blocks at a time: write=3447.778 fdatasync=0.031
posix_fallocate 1 8K blocks at a time: write=8753.767 fdatasync=0.030
posix_fallocate 64 8K blocks at a time: write=42.779 fdatasync=0.029
posix_fallocate 32 8K blocks at a time: write=110.344 fdatasync=0.031
posix_fallocate 16 8K blocks at a time: write=181.700 fdatasync=0.030
posix_fallocate 8 8K blocks at a time: write=1599.181 fdatasync=0.032
posix_fallocate 4 8K blocks at a time: write=1076.495 fdatasync=0.029
posix_fallocate 2 8K blocks at a time: write=17192.049 fdatasync=0.028
posix_fallocate 1 8K blocks at a time: write=6244.441 fdatasync=0.028

Test program is attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: proposal: additional error fields
Next
From: "Kevin Grittner"
Date:
Subject: Re: proposal: additional error fields