Thread: extending relations more efficiently
We've previously discussed the possible desirability of extending relations in larger increments, rather than one block at a time, for performance reasons. I attempted to determine how much performance we could possibly buy this way, and found that, as far as I can see, the answer is, basically, none. I wrote a test program which does writes until it reaches 1GB, and times how long the writes take in aggregate. Then it performs a single fdatasync at the end and times that as well. On some of the machines it is slightly faster in the aggregate to extend in larger chunks, but the magnitude of the change is little enough that, at least to me, it seems entirely not worth bothering with. Some results are below. Now, one thing that this test doesn't help much with is the theory that it's better to extend a file in larger chunks because the file will become less fragmented on disk. I don't really know how to factor that effect into the test - any ideas? I also considered two other methods of extending a file. First, there is ftruncate(). It's really fast. Unfortunately, it's unsuitable for our purposes because it will cheerfully leave holes in the file, and part of the reason for our current implementation is to make sure that there are no holes, so that later writes to the file can't fail for lack of disk space. So that's no good. Second, and more interestingly, there is a function called posix_fallocate(). It is present on Linux but not on MacOS X; I haven't checked any other platforms. It claims that it will extend a file out to a particular size, forcing disk blocks to be allocated so that later writes won't fail. Testing (more details below) shows that posix_fallocate() is quite efficient for large chunks. For example, extending a file to 1GB in size 64 blocks at a time (that is, 256kB at a time) took only ~60 ms and the subsequent fdatasync took almost no time at all, whereas zero-filling the file out 1GB using write() took 600-700 ms and the subsequent fdatasync took another 4-5 seconds. That seems like a pretty sizable win, and it's not too hard to imagine that it could be even better when the I/O subsystem is busy. Unfortunately, using posix_fallocate() for 8kB chunks seems to be significantly less efficient than our current method - I'm guessing that it actually writes the updated metadata back to disk, where write() does not (this makes one wonder how safe it is to count on write to have the behavior we need here in the first place). So in this case it seems we would probably want to do it in larger chunks. (We could possibly also use it when creating new WAL files, to extend all the way out to 16MB in one shot, at a considerable savings in I/O.) Any thoughts about where to go from here would be much appreciated. Test results follow. Some results from the IBM POWER7 box (ext4): write 1 8K blocks at a time: write=782.408 fdatasync=4400.984 write 2 8K blocks at a time: write=560.569 fdatasync=4389.413 write 4 8K blocks at a time: write=479.647 fdatasync=4290.753 write 8 8K blocks at a time: write=627.038 fdatasync=4292.920 write 16 8K blocks at a time: write=619.882 fdatasync=4288.984 write 32 8K blocks at a time: write=613.037 fdatasync=4289.069 write 64 8K blocks at a time: write=608.669 fdatasync=4594.534 write 64 8K blocks at a time: write=608.475 fdatasync=4342.934 write 32 8K blocks at a time: write=612.506 fdatasync=4297.969 write 16 8K blocks at a time: write=621.387 fdatasync=4430.693 write 8 8K blocks at a time: write=629.576 fdatasync=4296.472 write 4 8K blocks at a time: write=674.419 fdatasync=4359.290 write 2 8K blocks at a time: write=652.029 fdatasync=4327.876 write 1 8K blocks at a time: write=800.973 fdatasync=4472.197 Some results from Nate Boley's 64-core box (xfs): write 1 8K blocks at a time: write=1284.834 fdatasync=3538.361 write 2 8K blocks at a time: write=1176.082 fdatasync=3498.968 write 4 8K blocks at a time: write=1115.419 fdatasync=3634.673 write 8 8K blocks at a time: write=1088.404 fdatasync=3670.018 write 16 8K blocks at a time: write=1082.480 fdatasync=3778.763 write 32 8K blocks at a time: write=1075.875 fdatasync=3757.716 write 64 8K blocks at a time: write=1076.076 fdatasync=3996.997 Some results from Nate Boley's 32-core box (xfs): write 1 8K blocks at a time: write=968.351 fdatasync=6013.304 write 2 8K blocks at a time: write=902.288 fdatasync=6810.980 write 4 8K blocks at a time: write=900.520 fdatasync=4886.449 write 8 8K blocks at a time: write=889.970 fdatasync=6096.856 write 16 8K blocks at a time: write=882.891 fdatasync=8136.211 write 32 8K blocks at a time: write=892.914 fdatasync=10898.796 write 64 8K blocks at a time: write=917.326 fdatasync=11223.696 And finally, from the IBM POWER7 machine, a few posix_fallocate results: posix_fallocate 1 8K blocks at a time: write=3021.177 fdatasync=0.029 posix_fallocate 2 8K blocks at a time: write=1124.638 fdatasync=0.029 posix_fallocate 4 8K blocks at a time: write=9290.490 fdatasync=0.028 posix_fallocate 8 8K blocks at a time: write=477.831 fdatasync=0.029 posix_fallocate 16 8K blocks at a time: write=425.341 fdatasync=0.032 posix_fallocate 32 8K blocks at a time: write=122.499 fdatasync=0.034 posix_fallocate 64 8K blocks at a time: write=60.789 fdatasync=0.023 posix_fallocate 64 8K blocks at a time: write=102.867 fdatasync=0.029 posix_fallocate 32 8K blocks at a time: write=107.753 fdatasync=0.032 posix_fallocate 16 8K blocks at a time: write=900.674 fdatasync=0.031 posix_fallocate 8 8K blocks at a time: write=690.407 fdatasync=0.030 posix_fallocate 4 8K blocks at a time: write=550.454 fdatasync=0.035 posix_fallocate 2 8K blocks at a time: write=3447.778 fdatasync=0.031 posix_fallocate 1 8K blocks at a time: write=8753.767 fdatasync=0.030 posix_fallocate 64 8K blocks at a time: write=42.779 fdatasync=0.029 posix_fallocate 32 8K blocks at a time: write=110.344 fdatasync=0.031 posix_fallocate 16 8K blocks at a time: write=181.700 fdatasync=0.030 posix_fallocate 8 8K blocks at a time: write=1599.181 fdatasync=0.032 posix_fallocate 4 8K blocks at a time: write=1076.495 fdatasync=0.029 posix_fallocate 2 8K blocks at a time: write=17192.049 fdatasync=0.028 posix_fallocate 1 8K blocks at a time: write=6244.441 fdatasync=0.028 Test program is attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, May 1, 2012 at 3:08 PM, Robert Haas <robertmhaas@gmail.com> wrote: > We've previously discussed the possible desirability of extending > relations in larger increments, rather than one block at a time, for > performance reasons. I attempted to determine how much performance we > could possibly buy this way, and found that, as far as I can see, the > answer is, basically, none. Fair enough, but my understanding was that tests showed that the extension lock was a bottleneck, so doing extensions in larger chunks should reduce the time we spend waiting for a lock and thus improve performance. So while your results here show no gain, there is gain to be had elsewhere as a result. The algorithm needs to be adaptive, so we don't waste space on smaller tables or lower loads. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tuesday, May 01, 2012 04:08:27 PM Robert Haas wrote: > We've previously discussed the possible desirability of extending > relations in larger increments, rather than one block at a time, for > performance reasons. I attempted to determine how much performance we > could possibly buy this way, and found that, as far as I can see, the > answer is, basically, none. I wrote a test program which does writes > until it reaches 1GB, and times how long the writes take in aggregate. > Then it performs a single fdatasync at the end and times that as > well. On some of the machines it is slightly faster in the aggregate > to extend in larger chunks, but the magnitude of the change is little > enough that, at least to me, it seems entirely not worth bothering > with. Some results are below. Now, one thing that this test doesn't > help much with is the theory that it's better to extend a file in > larger chunks because the file will become less fragmented on disk. I > don't really know how to factor that effect into the test - any ideas? I think that test disregards the fact that were holding an exclusive lock during the file extension which happens rather frequently if you have multiple COPYs or similar running at the same time. Extending in bigger chunks reduces the frequency of taking that lock. I really would *love* to see improvements in that kind of workload. > I also considered two other methods of extending a file. First, there > is ftruncate(). It's really fast. Unfortunately, it's unsuitable for > our purposes because it will cheerfully leave holes in the file, and > part of the reason for our current implementation is to make sure that > there are no holes, so that later writes to the file can't fail for > lack of disk space. So that's no good. Second, and more > interestingly, there is a function called posix_fallocate(). It is > present on Linux but not on MacOS X; I haven't checked any other > platforms. It claims that it will extend a file out to a particular > size, forcing disk blocks to be allocated so that later writes won't > fail. Testing (more details below) shows that posix_fallocate() is > quite efficient for large chunks. For example, extending a file to > 1GB in size 64 blocks at a time (that is, 256kB at a time) took only > ~60 ms and the subsequent fdatasync took almost no time at all, > whereas zero-filling the file out 1GB using write() took 600-700 ms > and the subsequent fdatasync took another 4-5 seconds.That seems > like a pretty sizable win, and it's not too hard to imagine that it > could be even better when the I/O subsystem is busy. Unfortunately, > using posix_fallocate() for 8kB chunks seems to be significantly less > efficient than our current method - I'm guessing that it actually > writes the updated metadata back to disk, where write() does not (this > makes one wonder how safe it is to count on write to have the behavior > we need here in the first place). Currently the write() doesn't need to be crashsafe because it will be repeated on crash-recovery and a checkpoint will fsync the file. I don't really see why it would need to compare in the 8kb case. What reason would there be to further extend in that small increments? There is the question whether this should be done in the background though, so the relation extension lock is never hit in anything time-critical... Andres
On Tue, May 1, 2012 at 10:22 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Fair enough, but my understanding was that tests showed that the > extension lock was a bottleneck, so doing extensions in larger chunks > should reduce the time we spend waiting for a lock and thus improve > performance. So while your results here show no gain, there is gain to > be had elsewhere as a result. Maybe, but I'm skeptical. There are a few cases - such as the problem I fixed with SInvalReadLock - where the actual overhead of taking and releasing the lock is the problem. ProcArrayLock has some as-yet-unfixed issues in this area as well. But most of our locking bottlenecks are caused by doing too much work while holding the lock, not by acquiring and releasing it too frequently. A single lock manager partition can cope with upwards of 30k acquire/release cycles per second, which would amount to >240MB/sec of file extension. It seems very unlikely that we're hitting our head against that ceiling, although maybe you'd like to suggest a test case. Rather, I suspect that it's just plain taking too long to perform the actual extension. If we could extend by 8 blocks in only 4 times the time it takes to extend by 1 block, then obviously there would be a win available, but the test results suggest that isn't the case. <...thinks...> Maybe the solution here isn't extending in larger chunks, but allowing several extensions to proceed in parallel. It seems likely that a big chunk of the system call time is being used up waiting for write() to copy data from user space to kernel space, and there's no reason several backends couldn't do that in parallel. I think it would be sufficient to ensure that nobody starts using block N until the initial writes of all blocks < N have completed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 1, 2012 at 10:31 AM, Andres Freund <andres@anarazel.de> wrote: >> efficient than our current method - I'm guessing that it actually >> writes the updated metadata back to disk, where write() does not (this >> makes one wonder how safe it is to count on write to have the behavior >> we need here in the first place). > Currently the write() doesn't need to be crashsafe because it will be repeated > on crash-recovery and a checkpoint will fsync the file. That's not what I'm worried about. If the write() succeeds and then a subsequent close() on the filehandle reports an ENOSPC condition that means the write didn't really write after all, I am concerned that we might not handle that cleanly. > I don't really see why it would need to compare in the 8kb case. What reason > would there be to further extend in that small increments? In previous discussions, the concern has been that holding the relation extension lock across a multi-block extension would cause latency spikes for both the process doing the extensions and any other concurrent processes that need the lock. Obviously if it were possible to extend by 64kB in the same time it takes to extend by 8kB that would be awesome, but if it takes eight times longer then things don't look so good. > There is the question whether this should be done in the background though, so > the relation extension lock is never hit in anything time-critical... Yeah, although I'm fuzzy on how and whether that can be made to work, which is not to say that it can't. It might also be interesting to provide a mechanism to pre-extend a relation to a certain number of blocks, though if we did that we'd have to make sure that autovac got the memo not to truncate those pages away again. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tuesday, May 01, 2012 05:06:11 PM Robert Haas wrote: > On Tue, May 1, 2012 at 10:31 AM, Andres Freund <andres@anarazel.de> wrote: > >> efficient than our current method - I'm guessing that it actually > >> writes the updated metadata back to disk, where write() does not (this > >> makes one wonder how safe it is to count on write to have the behavior > >> we need here in the first place). > > > > Currently the write() doesn't need to be crashsafe because it will be > > repeated on crash-recovery and a checkpoint will fsync the file. > > That's not what I'm worried about. If the write() succeeds and then a > subsequent close() on the filehandle reports an ENOSPC condition that > means the write didn't really write after all, I am concerned that we > might not handle that cleanly. Hm. While write() might not write its state to disk I don't think that can imply than that the *in memory* state is inconsistent. Posix doesn't allow ENOSPC for close() as far as I can see. > > I don't really see why it would need to compare in the 8kb case. What > > reason would there be to further extend in that small increments? > In previous discussions, the concern has been that holding the > relation extension lock across a multi-block extension would cause > latency spikes for both the process doing the extensions and any other > concurrent processes that need the lock. Obviously if it were > possible to extend by 64kB in the same time it takes to extend by 8kB > that would be awesome, but if it takes eight times longer then things > don't look so good. Yes, sure. > > There is the question whether this should be done in the background > > though, so the relation extension lock is never hit in anything > > time-critical... > Yeah, although I'm fuzzy on how and whether that can be made to work, > which is not to say that it can't. The biggest problem I see is knowing when to trigger the extension of which file without scanning files all the time. Using some limited size shm-queue of {reltblspc, relfilenode} of to-be- extended files + a latch is the first thing I can think of. Every time a backend initializes a page with offset % EXTEND_SIZE == 0 it adds that table to the queue. The background writer extends the file by EXTEND_SIZE * 2 if necessary. If the queue is overflown all files are checked. Or the backends extend themselves again... EXTEND_SIZE should probably scale with the table size up to 64MB or so... > It might also be interesting to provide a mechanism to pre-extend a > relation to a certain number of blocks, though if we did that we'd > have to make sure that autovac got the memo not to truncate those > pages away again. Hm. I have to say I don't really see a big need to do this if the size of preallocation is adaptive to the file size. Sounds like it would add to much complications for little benefit. Andres
On Tue, May 1, 2012 at 11:42 AM, Andres Freund <andres@anarazel.de> wrote: >> > There is the question whether this should be done in the background >> > though, so the relation extension lock is never hit in anything >> > time-critical... >> Yeah, although I'm fuzzy on how and whether that can be made to work, >> which is not to say that it can't. > The biggest problem I see is knowing when to trigger the extension of which > file without scanning files all the time. > > Using some limited size shm-queue of {reltblspc, relfilenode} of to-be- > extended files + a latch is the first thing I can think of. Perhaps. An in-memory cache of file sizes would also let us eliminate a pile of lseek system calls, but the trick is that the synchronization can't be anything simple like a spinlock - Linux has does it that way in the kernel in versions <= 3.2 and it's catastrophic on short read-only transactions. I think the first thing we need here is a good test case, so we're clear on what we're trying to solve. I was just hoping to make file extension *faster* and what you and Simon are talking about is making it scale better in the face of heavy parallelism; obviously it would be nice to do both things, but they are different problems. Any old bulk-loading test will benefit from a raw performance improvement, but to test a scalability improvement we would need some kind of test case involving parallel bulk loads, or some other kind of parallel activity that causes rapid table growth. That's not something I've frequently run into, but I'd be willing to put a bit of time into it if we can nail down what we're talking about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 1, 2012 at 10:22 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, May 1, 2012 at 3:08 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> We've previously discussed the possible desirability of extending >> relations in larger increments, rather than one block at a time, for >> performance reasons. I attempted to determine how much performance we >> could possibly buy this way, and found that, as far as I can see, the >> answer is, basically, none. > > Fair enough, but my understanding was that tests showed that the > extension lock was a bottleneck, so doing extensions in larger chunks > should reduce the time we spend waiting for a lock and thus improve > performance. So while your results here show no gain, there is gain to > be had elsewhere as a result. Try to make sure that one of the scenarios involves there being multiple writers. One of the relevant cases is where there are multiple writers, where they may wind up fighting over the last page in the table. If free space is in that one page, they might serialize on the request for access to that page. Extending by several pages at a time may help *one* of the writers, but if that adds some I/O work that takes place while other connections are blocked, waiting, then this might worsen things somewhat for the other writers. If this reduces the frequency of that sort of collision, it might nonetheless be a win. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
Christopher Browne <cbbrowne@gmail.com> writes: > Try to make sure that one of the scenarios involves there being > multiple writers. > One of the relevant cases is where there are multiple writers, where > they may wind up fighting over the last page in the table. If free > space is in that one page, they might serialize on the request for > access to that page. The system is supposed to be designed to avoid that; we try to hand out pages with free space to different backends. One of the advantages of the current page-at-a-time setup is that when there is *no* free space according to the FSM, each such backend will create and fill its own page independently. They do serialize on the extension lock, but the intervening tuple additions are parallelized. We have to be careful to not make that case worse in a multi-page-extension scheme. regards, tom lane
Excerpts from Tom Lane's message of mar may 01 13:19:48 -0400 2012: > Christopher Browne <cbbrowne@gmail.com> writes: > > Try to make sure that one of the scenarios involves there being > > multiple writers. > > > One of the relevant cases is where there are multiple writers, where > > they may wind up fighting over the last page in the table. If free > > space is in that one page, they might serialize on the request for > > access to that page. > > The system is supposed to be designed to avoid that; we try to hand out > pages with free space to different backends. One of the advantages of > the current page-at-a-time setup is that when there is *no* free space > according to the FSM, each such backend will create and fill its own > page independently. They do serialize on the extension lock, but the > intervening tuple additions are parallelized. We have to be careful to > not make that case worse in a multi-page-extension scheme. This didn't work all that well for toast tables that have lots of very large tuples inserted concurrently, at least with older releases. Each backend would acquire the extension lock many times per tuple inserted. The result was really high contention on the extension lock. I haven't tested it recently to see if things have already improved, though. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, May 1, 2012 at 3:01 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: >> The system is supposed to be designed to avoid that; we try to hand out >> pages with free space to different backends. One of the advantages of >> the current page-at-a-time setup is that when there is *no* free space >> according to the FSM, each such backend will create and fill its own >> page independently. They do serialize on the extension lock, but the >> intervening tuple additions are parallelized. We have to be careful to >> not make that case worse in a multi-page-extension scheme. > > This didn't work all that well for toast tables that have lots of very > large tuples inserted concurrently, at least with older releases. Each > backend would acquire the extension lock many times per tuple inserted. > The result was really high contention on the extension lock. Hmm, this sounds like a good test case - a small number of SQL statements triggering a large amount of relation extension work. Using an unlogged table would probably make it easier to see the relation-extension contention, too, since you'd get WALInsertLock mostly out of the way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
* Robert Haas (robertmhaas@gmail.com) wrote: > I think the first thing we need here is a good test case, so we're > clear on what we're trying to solve. I was just hoping to make file > extension *faster* and what you and Simon are talking about is making > it scale better in the face of heavy parallelism; obviously it would > be nice to do both things, but they are different problems. Any old > bulk-loading test will benefit from a raw performance improvement, but > to test a scalability improvement we would need some kind of test case > involving parallel bulk loads, or some other kind of parallel activity > that causes rapid table growth. That's not something I've frequently > run into, but I'd be willing to put a bit of time into it if we can > nail down what we're talking about. Try loading the TIGER 2011 data into a single table, where you load each county (or perhaps state) in a separate, parallel, process. That's what I was doing when I hit this lock full-force and bitched about it on this list. I'd be happy to help construct that case, as well as test any changes in this area which might help address it (on a 10G SSD-backed SAN..). Thanks, Stephen
On Tue, May 1, 2012 at 3:48 PM, Stephen Frost <sfrost@snowman.net> wrote: > Try loading the TIGER 2011 data into a single table, where you load each > county (or perhaps state) in a separate, parallel, process. That's what > I was doing when I hit this lock full-force and bitched about it on this > list. > > I'd be happy to help construct that case, as well as test any changes in > this area which might help address it (on a 10G SSD-backed SAN..). Hmm, can you ship me one of those 10G SSD-backed SANs? That'd be awesome. In all seriousness, this is not a great test case unless you can provide some scripts to make it easy to run it in a reproducible fashion. Can you? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > On Tue, May 1, 2012 at 3:48 PM, Stephen Frost <sfrost@snowman.net> wrote: > > I'd be happy to help construct that case, as well as test any changes in > > this area which might help address it (on a 10G SSD-backed SAN..). > > Hmm, can you ship me one of those 10G SSD-backed SANs? That'd be awesome. Yeah... Not cheap. :) > In all seriousness, this is not a great test case unless you can > provide some scripts to make it easy to run it in a reproducible > fashion. Can you? Yeah, sure, I'll do that. The PostGIS folks have scripts, but they're kind of ugly, tbh.. I'll give you what I used. Thanks, Stephen
On 2012-05-01 22:06, Robert Haas wrote: > It might also be interesting to provide a mechanism to pre-extend a > relation to a certain number of blocks, though if we did that we'd > have to make sure that autovac got the memo not to truncate those > pages away again. Good point. And just to check before skipping over it, do we know that autovacuum not leaving enough slack space is not a significant cause of the bottlenecks in the first place? Jeroen
On Wed, May 2, 2012 at 7:16 AM, Jeroen Vermeulen <jtv@xs4all.nl> wrote: > On 2012-05-01 22:06, Robert Haas wrote: >> It might also be interesting to provide a mechanism to pre-extend a >> relation to a certain number of blocks, though if we did that we'd >> have to make sure that autovac got the memo not to truncate those >> pages away again. > > Good point. And just to check before skipping over it, do we know that > autovacuum not leaving enough slack space is not a significant cause of the > bottlenecks in the first place? I'm not sure exactly what you mean by this: autovacuum doesn't need any slack space. Regular DML operations can certainly benefit from slack space, both within each page and overall within the relation. But there's no evidence that vacuum is doing too good a job cleaning up the mess, forcing the relation to be re-extended. Rather, the usual (and frequent) complaint is that vacuum is leaving way too much slack space - i.e. bloat. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Excerpts from Robert Haas's message of mié may 02 08:14:36 -0400 2012: > On Wed, May 2, 2012 at 7:16 AM, Jeroen Vermeulen <jtv@xs4all.nl> wrote: > > On 2012-05-01 22:06, Robert Haas wrote: > >> It might also be interesting to provide a mechanism to pre-extend a > >> relation to a certain number of blocks, though if we did that we'd > >> have to make sure that autovac got the memo not to truncate those > >> pages away again. > > > > Good point. And just to check before skipping over it, do we know that > > autovacuum not leaving enough slack space is not a significant cause of the > > bottlenecks in the first place? > > I'm not sure exactly what you mean by this: autovacuum doesn't need > any slack space. Regular DML operations can certainly benefit from > slack space, both within each page and overall within the relation. > But there's no evidence that vacuum is doing too good a job cleaning > up the mess, forcing the relation to be re-extended. Rather, the > usual (and frequent) complaint is that vacuum is leaving way too much > slack space - i.e. bloat. Hm. I see those two things as different -- to me, bloat is unremoved dead tuples, whereas slack space would be free space that can be reused by new tuples. Slack space is useful as it avoids relation extension; bloat is not. I wonder, though, if we should set a less-than-100 fillfactor for heap relations. Just like default_statistic_target, it seems that the default value should be a conservative tradeoff between two extremes. This doesn't help extension for bulk insertions a lot, of course, but it'd be useful for tables where HOT updates happen with some regularity. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, May 2, 2012 at 12:26 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Excerpts from Robert Haas's message of mié may 02 08:14:36 -0400 2012: >> On Wed, May 2, 2012 at 7:16 AM, Jeroen Vermeulen <jtv@xs4all.nl> wrote: >> > On 2012-05-01 22:06, Robert Haas wrote: >> >> It might also be interesting to provide a mechanism to pre-extend a >> >> relation to a certain number of blocks, though if we did that we'd >> >> have to make sure that autovac got the memo not to truncate those >> >> pages away again. >> > >> > Good point. And just to check before skipping over it, do we know that >> > autovacuum not leaving enough slack space is not a significant cause of the >> > bottlenecks in the first place? >> >> I'm not sure exactly what you mean by this: autovacuum doesn't need >> any slack space. Regular DML operations can certainly benefit from >> slack space, both within each page and overall within the relation. >> But there's no evidence that vacuum is doing too good a job cleaning >> up the mess, forcing the relation to be re-extended. Rather, the >> usual (and frequent) complaint is that vacuum is leaving way too much >> slack space - i.e. bloat. > > Hm. I see those two things as different -- to me, bloat is unremoved > dead tuples, whereas slack space would be free space that can be reused > by new tuples. Slack space is useful as it avoids relation extension; > bloat is not. I guess I think of bloat as including both unremoved dead tuples and unwanted internal free space. If you create a giant table, delete 9 out of every 10 tuples, and vacuum, the table is still "bloated", IMV. > I wonder, though, if we should set a less-than-100 fillfactor for heap > relations. Just like default_statistic_target, it seems that the > default value should be a conservative tradeoff between two extremes. > This doesn't help extension for bulk insertions a lot, of course, but > it'd be useful for tables where HOT updates happen with some regularity. Perhaps, but in theory that should be self-correcting: the data should spread itself onto the number of pages where HOT pruning is able to prevent further relation extension. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, May 2, 2012 at 12:26 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: >> Hm. I see those two things as different -- to me, bloat is unremoved >> dead tuples, whereas slack space would be free space that can be reused >> by new tuples. Slack space is useful as it avoids relation extension; >> bloat is not. > I guess I think of bloat as including both unremoved dead tuples and > unwanted internal free space. If you create a giant table, delete 9 > out of every 10 tuples, and vacuum, the table is still "bloated", IMV. The difficulty is to tell the difference between useless free space and useful free space. If there's a reasonable probability of putting new data into a given chunk of free space in the near future, it's not bloat. regards, tom lane
Excerpts from Robert Haas's message of mié may 02 12:37:35 -0400 2012: > > On Wed, May 2, 2012 at 12:26 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: > > Excerpts from Robert Haas's message of mié may 02 08:14:36 -0400 2012: > >> On Wed, May 2, 2012 at 7:16 AM, Jeroen Vermeulen <jtv@xs4all.nl> wrote: > >> > On 2012-05-01 22:06, Robert Haas wrote: > >> >> It might also be interesting to provide a mechanism to pre-extend a > >> >> relation to a certain number of blocks, though if we did that we'd > >> >> have to make sure that autovac got the memo not to truncate those > >> >> pages away again. > >> > > >> > Good point. And just to check before skipping over it, do we know that > >> > autovacuum not leaving enough slack space is not a significant cause of the > >> > bottlenecks in the first place? > >> > >> I'm not sure exactly what you mean by this: autovacuum doesn't need > >> any slack space. Regular DML operations can certainly benefit from > >> slack space, both within each page and overall within the relation. > >> But there's no evidence that vacuum is doing too good a job cleaning > >> up the mess, forcing the relation to be re-extended. Rather, the > >> usual (and frequent) complaint is that vacuum is leaving way too much > >> slack space - i.e. bloat. > > > > Hm. I see those two things as different -- to me, bloat is unremoved > > dead tuples, whereas slack space would be free space that can be reused > > by new tuples. Slack space is useful as it avoids relation extension; > > bloat is not. > > I guess I think of bloat as including both unremoved dead tuples and > unwanted internal free space. If you create a giant table, delete 9 > out of every 10 tuples, and vacuum, the table is still "bloated", IMV. Agreed. Perhaps to solve this issue what we need is a way to migrate tuples from later pages into earlier ones. (This was one of the points, never resolved, that we discussed during the VACUUM FULL rework.) > > I wonder, though, if we should set a less-than-100 fillfactor for heap > > relations. Just like default_statistic_target, it seems that the > > default value should be a conservative tradeoff between two extremes. > > This doesn't help extension for bulk insertions a lot, of course, but > > it'd be useful for tables where HOT updates happen with some regularity. > > Perhaps, but in theory that should be self-correcting: the data should > spread itself onto the number of pages where HOT pruning is able to > prevent further relation extension. True. I gather you consider that the cases where it doesn't happen due to particular conditions are the ones that need manual tweaking. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, May 2, 2012 at 12:46 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Agreed. Perhaps to solve this issue what we need is a way to migrate > tuples from later pages into earlier ones. (This was one of the points, > never resolved, that we discussed during the VACUUM FULL rework.) Yeah, I agree. And frankly, we need to find a way to make it work without taking AccessExclusiveLock on the relation. Having to run VACUUM FULL is killing actual users and scaring off potential ones. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Excerpts from Robert Haas's message of mié may 02 12:55:17 -0400 2012: > > On Wed, May 2, 2012 at 12:46 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: > > Agreed. Perhaps to solve this issue what we need is a way to migrate > > tuples from later pages into earlier ones. (This was one of the points, > > never resolved, that we discussed during the VACUUM FULL rework.) > > Yeah, I agree. And frankly, we need to find a way to make it work > without taking AccessExclusiveLock on the relation. Having to run > VACUUM FULL is killing actual users and scaring off potential ones. And ideally without bloating the indexes while at it. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Robert, * Stephen Frost (sfrost@snowman.net) wrote: > > In all seriousness, this is not a great test case unless you can > > provide some scripts to make it easy to run it in a reproducible > > fashion. Can you? > > Yeah, sure, I'll do that. The PostGIS folks have scripts, but they're > kind of ugly, tbh.. I'll give you what I used. Alright, I made it dirt simple. Get a DB set up, get PostGIS installed in to it (or not, if you don't really want to..) and then download this: http://tamriel.snowman.net/~sfrost/testcase.tar.gz (Note: it's ~3.7G) It'll untar in to a 'testcase' directory. If you have PostGIS instealled already, just run testcase/create_linearwater.sql first, then run all the other .sql files in there in parallel by passing them to psql, eg: for file in tl*.sql; dopsql -d mydb -f $file & done It shouldn't take long to see stuff start blocking on that extension lock, since the table starts out empty. If you don't want to install PostGIS, just change the create_linearwater.sql script to have a column at the end named 'the_geom' and nuke the 'AddGeometryColumn' call. If you run into trouble getting PG to exhibit the expected behavior, you might try combining sets of files, like so: for state in `ls tl_2011_* | cut -f3 -d_ | cut -c1,2 | sort -u`; docat tl_2011_${state}*.sql | psql -d mydb & done Which will reduce the number of processes to just the number of states and territories (eg: 56), while increasing the amount of work each is doing. Thanks, Stephen
On Tue, 2012-05-01 at 10:08 -0400, Robert Haas wrote: > We've previously discussed the possible desirability of extending > relations in larger increments, rather than one block at a time, for > performance reasons. I attempted to determine how much performance we > could possibly buy this way, and found that, as far as I can see, the > answer is, basically, none. Another point here is that with checksums, we will want to make sure that zero pages can be treated as corrupt. That will probably involve using the WAL for extension operations, and we'll want to mitigate that cost somehow. Extending in larger chunks would probably be necessary. There are some challenges there, but I think it's worth pursuing. Regards,Jeff Davis