extending relations more efficiently - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | extending relations more efficiently |
Date | |
Msg-id | CA+TgmobZ60z7=XspRDHUx2W1jOUjVhPUQj4tSKgq55UVuCZjnw@mail.gmail.com Whole thread Raw |
Responses |
Re: extending relations more efficiently
Re: extending relations more efficiently Re: extending relations more efficiently |
List | pgsql-hackers |
We've previously discussed the possible desirability of extending relations in larger increments, rather than one block at a time, for performance reasons. I attempted to determine how much performance we could possibly buy this way, and found that, as far as I can see, the answer is, basically, none. I wrote a test program which does writes until it reaches 1GB, and times how long the writes take in aggregate. Then it performs a single fdatasync at the end and times that as well. On some of the machines it is slightly faster in the aggregate to extend in larger chunks, but the magnitude of the change is little enough that, at least to me, it seems entirely not worth bothering with. Some results are below. Now, one thing that this test doesn't help much with is the theory that it's better to extend a file in larger chunks because the file will become less fragmented on disk. I don't really know how to factor that effect into the test - any ideas? I also considered two other methods of extending a file. First, there is ftruncate(). It's really fast. Unfortunately, it's unsuitable for our purposes because it will cheerfully leave holes in the file, and part of the reason for our current implementation is to make sure that there are no holes, so that later writes to the file can't fail for lack of disk space. So that's no good. Second, and more interestingly, there is a function called posix_fallocate(). It is present on Linux but not on MacOS X; I haven't checked any other platforms. It claims that it will extend a file out to a particular size, forcing disk blocks to be allocated so that later writes won't fail. Testing (more details below) shows that posix_fallocate() is quite efficient for large chunks. For example, extending a file to 1GB in size 64 blocks at a time (that is, 256kB at a time) took only ~60 ms and the subsequent fdatasync took almost no time at all, whereas zero-filling the file out 1GB using write() took 600-700 ms and the subsequent fdatasync took another 4-5 seconds. That seems like a pretty sizable win, and it's not too hard to imagine that it could be even better when the I/O subsystem is busy. Unfortunately, using posix_fallocate() for 8kB chunks seems to be significantly less efficient than our current method - I'm guessing that it actually writes the updated metadata back to disk, where write() does not (this makes one wonder how safe it is to count on write to have the behavior we need here in the first place). So in this case it seems we would probably want to do it in larger chunks. (We could possibly also use it when creating new WAL files, to extend all the way out to 16MB in one shot, at a considerable savings in I/O.) Any thoughts about where to go from here would be much appreciated. Test results follow. Some results from the IBM POWER7 box (ext4): write 1 8K blocks at a time: write=782.408 fdatasync=4400.984 write 2 8K blocks at a time: write=560.569 fdatasync=4389.413 write 4 8K blocks at a time: write=479.647 fdatasync=4290.753 write 8 8K blocks at a time: write=627.038 fdatasync=4292.920 write 16 8K blocks at a time: write=619.882 fdatasync=4288.984 write 32 8K blocks at a time: write=613.037 fdatasync=4289.069 write 64 8K blocks at a time: write=608.669 fdatasync=4594.534 write 64 8K blocks at a time: write=608.475 fdatasync=4342.934 write 32 8K blocks at a time: write=612.506 fdatasync=4297.969 write 16 8K blocks at a time: write=621.387 fdatasync=4430.693 write 8 8K blocks at a time: write=629.576 fdatasync=4296.472 write 4 8K blocks at a time: write=674.419 fdatasync=4359.290 write 2 8K blocks at a time: write=652.029 fdatasync=4327.876 write 1 8K blocks at a time: write=800.973 fdatasync=4472.197 Some results from Nate Boley's 64-core box (xfs): write 1 8K blocks at a time: write=1284.834 fdatasync=3538.361 write 2 8K blocks at a time: write=1176.082 fdatasync=3498.968 write 4 8K blocks at a time: write=1115.419 fdatasync=3634.673 write 8 8K blocks at a time: write=1088.404 fdatasync=3670.018 write 16 8K blocks at a time: write=1082.480 fdatasync=3778.763 write 32 8K blocks at a time: write=1075.875 fdatasync=3757.716 write 64 8K blocks at a time: write=1076.076 fdatasync=3996.997 Some results from Nate Boley's 32-core box (xfs): write 1 8K blocks at a time: write=968.351 fdatasync=6013.304 write 2 8K blocks at a time: write=902.288 fdatasync=6810.980 write 4 8K blocks at a time: write=900.520 fdatasync=4886.449 write 8 8K blocks at a time: write=889.970 fdatasync=6096.856 write 16 8K blocks at a time: write=882.891 fdatasync=8136.211 write 32 8K blocks at a time: write=892.914 fdatasync=10898.796 write 64 8K blocks at a time: write=917.326 fdatasync=11223.696 And finally, from the IBM POWER7 machine, a few posix_fallocate results: posix_fallocate 1 8K blocks at a time: write=3021.177 fdatasync=0.029 posix_fallocate 2 8K blocks at a time: write=1124.638 fdatasync=0.029 posix_fallocate 4 8K blocks at a time: write=9290.490 fdatasync=0.028 posix_fallocate 8 8K blocks at a time: write=477.831 fdatasync=0.029 posix_fallocate 16 8K blocks at a time: write=425.341 fdatasync=0.032 posix_fallocate 32 8K blocks at a time: write=122.499 fdatasync=0.034 posix_fallocate 64 8K blocks at a time: write=60.789 fdatasync=0.023 posix_fallocate 64 8K blocks at a time: write=102.867 fdatasync=0.029 posix_fallocate 32 8K blocks at a time: write=107.753 fdatasync=0.032 posix_fallocate 16 8K blocks at a time: write=900.674 fdatasync=0.031 posix_fallocate 8 8K blocks at a time: write=690.407 fdatasync=0.030 posix_fallocate 4 8K blocks at a time: write=550.454 fdatasync=0.035 posix_fallocate 2 8K blocks at a time: write=3447.778 fdatasync=0.031 posix_fallocate 1 8K blocks at a time: write=8753.767 fdatasync=0.030 posix_fallocate 64 8K blocks at a time: write=42.779 fdatasync=0.029 posix_fallocate 32 8K blocks at a time: write=110.344 fdatasync=0.031 posix_fallocate 16 8K blocks at a time: write=181.700 fdatasync=0.030 posix_fallocate 8 8K blocks at a time: write=1599.181 fdatasync=0.032 posix_fallocate 4 8K blocks at a time: write=1076.495 fdatasync=0.029 posix_fallocate 2 8K blocks at a time: write=17192.049 fdatasync=0.028 posix_fallocate 1 8K blocks at a time: write=6244.441 fdatasync=0.028 Test program is attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: