Thread: Extent Locks
All, Starting a new thread to avoid hijacking Heikki's original, but.. * Heikki Linnakangas (hlinnakangas@vmware.com) wrote: > Truncating a heap at the end of vacuum, to release unused space back to > the OS, currently requires taking an AccessExclusiveLock. Although > it's only held for a short duration, it can be enough to cause a > hiccup in query processing while it's held. Also, if there is a > continuous stream of queries on the table, autovacuum never succeeds > in acquiring the lock, and thus the table never gets truncated. Extent locking suffers from very similar problems and we really need to improve this situation. With today's fast i/o systems,and massive numbers of CPUs in a single system, it's absolutely trivial to have a whole slew of processes tryingto add data to a single relation and that access getting nearly serialized due to everyone waiting on the extent lock. Perhaps one really simple approach would be to increase the size of the extent which is created in relation to the sizeof the relation. I've no clue what level of effort is involved there but I'm hoping such an approach would help. I'velong thought that it'd be very neat if we could simply give each bulk-inserter process their own 1G chunk to insert directlyinto w/o having to talk to anyone else. The creation of the specific 1G piece could, hopefully, be made atomic easily(either thanks to the OS or with our own locking), etc, etc. I'm sure it's many bricks shy of a load, but I wanted to raise the issue, again, as I've seen it happening on yet anotherhigh-volume write-intensive system. Thanks, Stephen
On Wed, May 15, 2013 at 8:54 PM, Stephen Frost <sfrost@snowman.net> wrote: > Starting a new thread to avoid hijacking Heikki's original, but.. > > * Heikki Linnakangas (hlinnakangas@vmware.com) wrote: >> Truncating a heap at the end of vacuum, to release unused space back to >> the OS, currently requires taking an AccessExclusiveLock. Although >> it's only held for a short duration, it can be enough to cause a >> hiccup in query processing while it's held. Also, if there is a >> continuous stream of queries on the table, autovacuum never succeeds >> in acquiring the lock, and thus the table never gets truncated. > > Extent locking suffers from very similar problems and we really need > to improve this situation. With today's fast i/o systems, and massive > numbers of CPUs in a single system, it's absolutely trivial to have a > whole slew of processes trying to add data to a single relation and > that access getting nearly serialized due to everyone waiting on the > extent lock. > > Perhaps one really simple approach would be to increase the size of > the extent which is created in relation to the size of the relation. > I've no clue what level of effort is involved there but I'm hoping > such an approach would help. I've long thought that it'd be very neat > if we could simply give each bulk-inserter process their own 1G chunk > to insert directly into w/o having to talk to anyone else. The > creation of the specific 1G piece could, hopefully, be made atomic > easily (either thanks to the OS or with our own locking), etc, etc. > > I'm sure it's many bricks shy of a load, but I wanted to raise the > issue, again, as I've seen it happening on yet another high-volume > write-intensive system. I think you might be confused, or else I'm confused, because I don't believe we have any such thing as an extent lock. What we do have is a relation extension lock, but the size of the segment on disk has nothing to do with that: there's only one for the whole relation, and you hold it when adding a block to the relation. The organization of blocks into 1GB segments happens at a much lower level of the system, and is completely disconnected from the locking subsystem. So changing the segment size wouldn't help with this problem, and would actually be quite difficult to do, because everything in the system except at the very lowermost layer just knows about block numbers and has no idea what "extent" the block is in. But that having been said, it just so happens that I was recently playing around with ways of trying to fix the relation extension bottleneck. One thing I tried was: every time a particular backend extends the relation, it extends the relation by more than 1 block at a time before releasing the relation extension lock. Then, other backends can find those blocks in the free space map instead of having to grab the relation extension lock, so the number of acquire/release cycles on the relation extension lock goes down. This does help... but at least in my tests, extending by 2 blocks instead of 1 was the big winner, and after that you didn't get much further relief. Another thing I tried was pre-extending the relation to the estimated final size. That worked a lot better, and might be worth doing (e.g. ALTER TABLE zorp SET MINIMUM SIZE 1GB) but a less manual solution would be preferable if we can come up with one. After that, I ran out of time for investigation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert, For not understanding me, we seem to be in violent agreement. ;) * Robert Haas (robertmhaas@gmail.com) wrote: > I think you might be confused, or else I'm confused, because I don't > believe we have any such thing as an extent lock. The relation extension lock is what I was referring to. Apologies for any confusion there. > What we do have is > a relation extension lock, but the size of the segment on disk has > nothing to do with that: there's only one for the whole relation, and > you hold it when adding a block to the relation. Yes, which is farrr too small. I'm certainly aware that the segments on disk are dealt with in the storage layer- currently. My proposal was to consider how we might change that, a bit, to allow improved throughput when there are multiple writers. Consider this, for example- when we block on the relation extension lock, rather than sit and wait or continue to compete with the other threads, simply tell the storage layer to give us a dedicated file to work with. Once we're ready to commit, move that file into place as the next segment (through some command to the storage layer), using an atomic command to ensure that it either works and doesn't overwrite anything, or fails and we try again by moving the segment number up. We would need to work out, at the storage layer, how to handle cases where the file is less than 1G and realize that we should just skip over those blocks on disk as being known-to-be-empty. Those blocks would also be then put on the free space map and used for later processes which need to find somewhere to put new data, etc. > But that having been said, it just so happens that I was recently > playing around with ways of trying to fix the relation extension > bottleneck. One thing I tried was: every time a particular backend > extends the relation, it extends the relation by more than 1 block at > a time before releasing the relation extension lock. Right, exactly. One idea that I was discussing w/ Greg was to do this using some log(relation-size) approach or similar. > This does help... > but at least in my tests, extending by 2 blocks instead of 1 was the > big winner, and after that you didn't get much further relief. How many concurrent writers did you have and what kind of filesystem was backing this? Was it a temp filesystem where writes are essentially to memory, causing this relation extention lock to be much more contentious? > Another thing I tried was pre-extending the relation to the estimated > final size. That worked a lot better, and might be worth doing (e.g. > ALTER TABLE zorp SET MINIMUM SIZE 1GB) but a less manual solution > would be preferable if we can come up with one. Slightly confused here- above you said that '2' was way better than '1', but you implied that "more than 2 wasn't really much better"- yet "wayyy more than 2 is much better"? Did I follow that right? I can certainly understand such a case, just want to understand it and make sure it's what you meant. What "small-number" options did you try? > After that, I ran out of time for investigation. Too bad! Thanks much for the work in this area, it'd really help if we could improve this for our data warehouse, in particular, users. Thanks! Stephen
* Robert Haas (robertmhaas@gmail.com) wrote: > I think it's pretty unrealistic to suppose that this can be made to > work. The most obvious problem is that a sequential scan is coded to > assume that every block between 0 and the last block in the relation > is worth reading, You don't change that. However, when a seq scan asks the storage layer for blocks that it knows don't actually exist, it can simply skip over them or return "empty" records or something equivilant... Yes, that's hand-wavy, but I also think it's doable. > I suspect there are > slightly less obvious problems that would turn out to be highly > intractable. Entirely possible. :) > The assumption that block numbers are dense is probably > embedded in the system in a lot of subtle ways; if we start trying to > change I think we're dooming ourselves to an unending series of crocks > trying to undo the mess we've created. Perhaps. > Also, I think that's really a red herring anyway. Relation extension > per se is not slow - we can grow a file by adding zero bytes at a > pretty good clip, and don't really gain anything at the database level > by spreading the growth across multiple files. That's true when the file is on a single filesystem and a single set of drives. Make them be split across multiple filesystems/volumes where you get more drives involved... > The problem is the > relation extension LOCK, and I think that's where we should be > focusing our attention. I'm pretty confident we can find a way to > take the pressure off the lock without actually changing anything all > at the storage layer. That would certainly be very neat and if possible might render my idea moot, which I would be more than happy with. > As a thought experiment, suppose for example > that we have a background process that knows, by magic, how many new > blocks will be needed in each relation. And it knows this just enough > in advance to have time to extend each such relation by the requisite > number of blocks and add those blocks to the free space map. Since > only that process ever needs a relation extension lock, there is no > longer any contention for any such lock. Problem solved! Sounds cute, but perhaps a bit too cute to be realistic (that's certainly been my opinion when suggested by others, which is has been, in the past). > Actually, I'm not convinced that a background process is the right > approach at all, and of course there's no actual magic that lets us > foresee exact extension needs. But I still feel like that thought > experiment indicates that there must be a solution here just by > rejiggering the locking, and maybe with a bit of modest pre-extension. > The mediocre results of my last couple tries must indicate that I > wasn't entirely successful in getting the backends out of each others' > way, but I tend to think that's just an indication that I don't > understand exactly what's happening in the contention scenarios yet, > rather than a fundamental difficulty with the approach. Perhaps. > > How many concurrent writers did you have and what kind of filesystem was > > backing this? Was it a temp filesystem where writes are essentially to > > memory, causing this relation extention lock to be much more > > contentious? > > 10. ext4. No. Ok. > If I took 30 seconds to pre-extend the relation before writing any > data into it, then writing the data went pretty much exactly 10 times > faster with 10 writers than with 1. That's rather fantastic.. > But small on-the-fly > pre-extensions during the write didn't work as well. I don't remember > exactly what formulas I tried, but I do remember that the few I tried > were not really any better than "always pre-extend by 1 extra block"; > and that alone eliminated about half the contention, but then I > couldn't do better. That seems quite odd to me- I would have thought extending by more than 2 blocks would have helped with the contention. Still, it sounds like extending requires a fair bit of writing, and that sucks in its own right because we're just going to rewrite that- is that correct? If so, I like proposal even more... > I wonder if I need to use LWLockAcquireOrWait(). I'm not seeing how/why that might help? Thanks, Stephen
On Thu, May 16, 2013 at 9:36 PM, Stephen Frost <sfrost@snowman.net> wrote: >> What we do have is >> a relation extension lock, but the size of the segment on disk has >> nothing to do with that: there's only one for the whole relation, and >> you hold it when adding a block to the relation. > > Yes, which is farrr too small. I'm certainly aware that the segments on > disk are dealt with in the storage layer- currently. My proposal was to > consider how we might change that, a bit, to allow improved throughput > when there are multiple writers. > > Consider this, for example- when we block on the relation extension > lock, rather than sit and wait or continue to compete with the other > threads, simply tell the storage layer to give us a dedicated file to > work with. Once we're ready to commit, move that file into place as the > next segment (through some command to the storage layer), using an > atomic command to ensure that it either works and doesn't overwrite > anything, or fails and we try again by moving the segment number up. > > We would need to work out, at the storage layer, how to handle cases > where the file is less than 1G and realize that we should just skip over > those blocks on disk as being known-to-be-empty. Those blocks would > also be then put on the free space map and used for later processes > which need to find somewhere to put new data, etc. I think it's pretty unrealistic to suppose that this can be made to work. The most obvious problem is that a sequential scan is coded to assume that every block between 0 and the last block in the relation is worth reading, and changing that would (a) be a lot of work and (b) render the term "sequential" scan a misnomer, a choice that I think would have more bad consequences than good. I suspect there are slightly less obvious problems that would turn out to be highly intractable. The assumption that block numbers are dense is probably embedded in the system in a lot of subtle ways; if we start trying to change I think we're dooming ourselves to an unending series of crocks trying to undo the mess we've created. Also, I think that's really a red herring anyway. Relation extension per se is not slow - we can grow a file by adding zero bytes at a pretty good clip, and don't really gain anything at the database level by spreading the growth across multiple files. The problem is the relation extension LOCK, and I think that's where we should be focusing our attention. I'm pretty confident we can find a way to take the pressure off the lock without actually changing anything all at the storage layer. As a thought experiment, suppose for example that we have a background process that knows, by magic, how many new blocks will be needed in each relation. And it knows this just enough in advance to have time to extend each such relation by the requisite number of blocks and add those blocks to the free space map. Since only that process ever needs a relation extension lock, there is no longer any contention for any such lock. Problem solved! Actually, I'm not convinced that a background process is the right approach at all, and of course there's no actual magic that lets us foresee exact extension needs. But I still feel like that thought experiment indicates that there must be a solution here just by rejiggering the locking, and maybe with a bit of modest pre-extension.The mediocre results of my last couple tries must indicatethat I wasn't entirely successful in getting the backends out of each others' way, but I tend to think that's just an indication that I don't understand exactly what's happening in the contention scenarios yet, rather than a fundamental difficulty with the approach. >> This does help... >> but at least in my tests, extending by 2 blocks instead of 1 was the >> big winner, and after that you didn't get much further relief. > > How many concurrent writers did you have and what kind of filesystem was > backing this? Was it a temp filesystem where writes are essentially to > memory, causing this relation extention lock to be much more > contentious? 10. ext4. No. >> Another thing I tried was pre-extending the relation to the estimated >> final size. That worked a lot better, and might be worth doing (e.g. >> ALTER TABLE zorp SET MINIMUM SIZE 1GB) but a less manual solution >> would be preferable if we can come up with one. > > Slightly confused here- above you said that '2' was way better than '1', > but you implied that "more than 2 wasn't really much better"- yet "wayyy > more than 2 is much better"? Did I follow that right? I can certainly > understand such a case, just want to understand it and make sure it's > what you meant. What "small-number" options did you try? If I took 30 seconds to pre-extend the relation before writing any data into it, then writing the data went pretty much exactly 10 times faster with 10 writers than with 1. But small on-the-fly pre-extensions during the write didn't work as well. I don't remember exactly what formulas I tried, but I do remember that the few I tried were not really any better than "always pre-extend by 1 extra block"; and that alone eliminated about half the contention, but then I couldn't do better. I wonder if I need to use LWLockAcquireOrWait(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 16, 2013 at 11:55 PM, Stephen Frost <sfrost@snowman.net> wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: >> I think it's pretty unrealistic to suppose that this can be made to >> work. The most obvious problem is that a sequential scan is coded to >> assume that every block between 0 and the last block in the relation >> is worth reading, > > You don't change that. However, when a seq scan asks the storage layer > for blocks that it knows don't actually exist, it can simply skip over > them or return "empty" records or something equivilant... Yes, that's > hand-wavy, but I also think it's doable. And slow. And it will involve locking and shared memory data structures of its own, to keep track of which blocks actually exist at the storage layer. I suspect the results would be more kinds of locks than we have at present, not less. >> Also, I think that's really a red herring anyway. Relation extension >> per se is not slow - we can grow a file by adding zero bytes at a >> pretty good clip, and don't really gain anything at the database level >> by spreading the growth across multiple files. > > That's true when the file is on a single filesystem and a single set of > drives. Make them be split across multiple filesystems/volumes where > you get more drives involved... I'd be interested to hear how fast dd if=/dev/zero of=somefile is on your machine compared to a single-threaded COPY into a relation. Dividing those two numbers gives us the level of concurrency at which the speed at which we can extend the relation becomes the bottleneck. On the system I tested, I think it was in the multiple tens until the kernel cache filled up ... and then it dropped way off. But I don't have access to a high-end storage system. >> If I took 30 seconds to pre-extend the relation before writing any >> data into it, then writing the data went pretty much exactly 10 times >> faster with 10 writers than with 1. > > That's rather fantastic.. One sadly relevant detail is that the relation was unlogged. Even so, yes, it's fantastic. >> But small on-the-fly >> pre-extensions during the write didn't work as well. I don't remember >> exactly what formulas I tried, but I do remember that the few I tried >> were not really any better than "always pre-extend by 1 extra block"; >> and that alone eliminated about half the contention, but then I >> couldn't do better. > > That seems quite odd to me- I would have thought extending by more than > 2 blocks would have helped with the contention. Still, it sounds like > extending requires a fair bit of writing, and that sucks in its own > right because we're just going to rewrite that- is that correct? If so, > I like proposal even more... > >> I wonder if I need to use LWLockAcquireOrWait(). > > I'm not seeing how/why that might help? Thinking about it more, my guess is that backend A grabs the relation extension lock. Before it actually extends the relation, backends B, C, D, and E all notice that no free pages are available and queue for the lock. Backend A pre-extends the relation by some number of pages page and then extends it by a second page for its own use. It then releases the relation extension lock. At this point, however, backends B, C, D, and E are already committed to extending the relation, even though some or all of them could now satisfy their need for free pages from the fsm. If they used LWLockAcquireOrWait(), then they'd all wake up when A released the lock. One of them would have the lock, and the rest could go retry the fsm and requeue on the lock if that failed. But as it is, what I bet is happening is that they each take the lock in turn and each extend the relation in turn. Then on the next block they write they all find free pages in the fsm, because they all pre-extended the relation, but when those free pages are used up, they all queue up on the lock again, practically at the same instant, because the fsm becomes empty at the same time for all of them. I should play around with this a bit more... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
* Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, May 16, 2013 at 11:55 PM, Stephen Frost <sfrost@snowman.net> wrote: > > You don't change that. However, when a seq scan asks the storage layer > > for blocks that it knows don't actually exist, it can simply skip over > > them or return "empty" records or something equivilant... Yes, that's > > hand-wavy, but I also think it's doable. > > And slow. And it will involve locking and shared memory data > structures of its own, to keep track of which blocks actually exist at > the storage layer. I suspect the results would be more kinds of locks > than we have at present, not less. I'm not sure that I see that- we already have to figure out where the "end" of a relation is... We would just do something similar for each component of the relation, and if gets extended after you've past it, too bad, because it'd be just like someone adding blocks into pages you've already seqscan'd after you're done with them. Perhaps we'd have to actually do some kind of locking on those non-existant pages, somewhere, but I'm not entirely sure where? Figuring out where and what kind of locks we take out for seqscans today on a page-level (or lower) would be the way to determine that; tbh, I'm not entirely sure what we do there because I tended to figure we didn't do much of anything.. > > That's true when the file is on a single filesystem and a single set of > > drives. Make them be split across multiple filesystems/volumes where > > you get more drives involved... > > I'd be interested to hear how fast dd if=/dev/zero of=somefile is on > your machine compared to a single-threaded COPY into a relation. I can certainly do a bit of that profiling, but it's not an uncommon issue.. Stefan ran into the exact same problem not 2 months ago, where it became very clear that it was faster to run a single-thread loading data than to parallelize it at all, due to the extension locking, for a single relation. If you move to multiple relations, where you won't have the contention, things look much better, of course. > Dividing those two numbers gives us the level of concurrency at which > the speed at which we can extend the relation becomes the bottleneck. Well, we'd want to consider a parallel case, right? That's where the bottleneck of this lock really shows itself- otherwise you're just measuring the overhead from COPY parsing data. Even with COPY, I've managed to saturate a 2Gbps FC link between the server and the drives over on the SAN with PG- when running 10 threads to 10 different relations on 10 different tablespaces which go to 10 different drive pairs on the SAN (I can push > 50MB/s to each drive pair, when writing to only one drive pair, but it drops below that due to the 2Gbps fabric when I'm writing to all of them at once). Try getting anywhere close to that with any kind of system you want when parallelizing writes into a single relation and you'll see where this lock really kills performance. > On the system I tested, I think it was in the multiple tens until the > kernel cache filled up ... and then it dropped way off. But I don't > have access to a high-end storage system. The above tests were done using simple, large and relatively "slow" (spinning metal) drives- but they can sustain 50-100MB/s of seq writes without too much trouble. When you add all those up, either through tablespaces and partitions or with a RAID10 setup, you can get quite a bit of overall throughput- enough to easily make COPY be your bottleneck due to the CPU utilization, making you want to parallelize it, but then you hit this lock and performance goes into the toilet. > One sadly relevant detail is that the relation was unlogged. Even so, > yes, it's fantastic. That's not *ideal*, but it's also not the end of the world, since we can create an unlogged table and have it visible to multiple clients, allowing for parallel writes. We don't have any unlogged tables, but I can't recall if the above performance runs included a truncate (in the same transaction) before COPY for each of the individual partitions or not.. I don't *think* it did, but not 100% sure. We do have wal level set to minimum on these particular systems. > >> I wonder if I need to use LWLockAcquireOrWait(). > > > > I'm not seeing how/why that might help? > > Thinking about it more, my guess is that backend A grabs the relation > extension lock. Before it actually extends the relation, backends B, > C, D, and E all notice that no free pages are available and queue for > the lock. Backend A pre-extends the relation by some number of pages > page and then extends it by a second page for its own use. It then > releases the relation extension lock. At this point, however, > backends B, C, D, and E are already committed to extending the > relation, even though some or all of them could now satisfy their need > for free pages from the fsm. If they used LWLockAcquireOrWait(), then > they'd all wake up when A released the lock. One of them would have > the lock, and the rest could go retry the fsm and requeue on the lock > if that failed. Hmmm, yes, that could help then. I'm surprised to hear that it was ever set up that way, to be honest.. > But as it is, what I bet is happening is that they each take the lock > in turn and each extend the relation in turn. Then on the next block > they write they all find free pages in the fsm, because they all > pre-extended the relation, but when those free pages are used up, they > all queue up on the lock again, practically at the same instant, > because the fsm becomes empty at the same time for all of them. Ouch, yea, that could get quite painful. > I should play around with this a bit more... That sounds like a fantastic idea... ;) Thanks! Stephen
Robert, > But I still feel like that thought > experiment indicates that there must be a solution here just by > rejiggering the locking, and maybe with a bit of modest pre-extension. > The mediocre results of my last couple tries must indicate that I > wasn't entirely successful in getting the backends out of each others' > way, but I tend to think that's just an indication that I don't > understand exactly what's happening in the contention scenarios yet, > rather than a fundamental difficulty with the approach. Well, our practice of extending relations 8K-at-a-time is suboptimal on quite a number of storage platforms. It leads to increased file fragmentation, and increases write sizes on SSDs which have a default 128K block size. Also, on a large bulk load we spend way too much time extending the relation. My suggestion would be to have a storage parameter which defined the new extent size for growing the table, and allocate that much free space in the form of empty pages whenever we need new pages. The default would be 1MB, but users could adjust it to anywhere between 8K and 1GB. We'd still need an extent lock to add the 1MB (or whatever), but there's a 128X difference between allocating 8K and 1MB. The drawback to this is whatever size we choose is liable to be wrong for some users. Users who currently have a lot of 16K tables would see their databases grow alarmingly. But a default of 8K or 16K or 32K wouldn't improve the current behavior except for the very advanced users who know how to tinker with storage parameters. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 05/17/2013 11:38 AM, Robert Haas wrote: > maybe with a bit of modest pre-extension. When it comes to pre-extension, is it realistic to get a count of backends waiting on the lock and extend the relation by (say) 2x the number of waiting backends? Getting a list of lock waiters is always a racey proposition, but in this case we don't need an accurate count, only an estimate, and the count can only grow between getting the count and completing the relation extension. Assuming it's even remotely feasible to get a count of lock waiters at all. If there are lots of procs waiting to extend the relation a fair chunk could be preallocated with posix_fallocate on supported platforms. If it's possible this would avoid the need to attempt any recency-of-last-extension based preallocation with the associated problem of how to store and access the last-extended time efficiently, while still hopefully reducing contention on the relation extension lock and without delaying the backend doing the extension too much more. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 05/18/2013 03:15 AM, Josh Berkus wrote: > The drawback to this is whatever size we choose is liable to be wrong > for some users. Users who currently have a lot of 16K tables would see > their databases grow alarmingly. This only becomes a problem for tables that're tiny, right? If your table is already 20MB you don't care if it grows to 20.1MB or 21MB next time it's extended. What about applying the relation extent size only *after* an extent's worth of blocks have been allocated in small blocks, per current behaviour? So their 32k tables stay 32k, but once they step over the 1MB barrier (or whatever) in table size the allocation mode switches to bulk-allocating large extents? Or just setting an size threshold after which extent-sized preallocation is used? -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
* Craig Ringer (craig@2ndquadrant.com) wrote: > On 05/17/2013 11:38 AM, Robert Haas wrote: > > maybe with a bit of modest pre-extension. > When it comes to pre-extension, is it realistic to get a count of > backends waiting on the lock and extend the relation by (say) 2x the > number of waiting backends? Having the process which has the lock do more work before releasing it, and having the other processes realize that there is room available after blocking on the lock (and not trying to extend the relation themselves..), might help. One concern that came up in Ottawa is over autovacuum coming along and discovering empty pages at the end of the relation and deciding to try and truncate it. I'm not convinced that would happen due to the locks involved but if we actually extend the relation by enough that the individual processes can continue writing for a while before another extension is needed, then perhaps it could. On the other hand, I do feel like people are worried about over-extending a relation and wasting disk space- but with the way that vacuum can clean up pages at the end, that would only be a temporary situation anyway. > If it's possible this would avoid the need to attempt any > recency-of-last-extension based preallocation with the associated > problem of how to store and access the last-extended time efficiently, > while still hopefully reducing contention on the relation extension lock > and without delaying the backend doing the extension too much more. I do like the idea of getting an idea of how many blocks are being asked for, based on how many other backends are trying to write, but I've been thinking a simple algorithm might also work well, eg: alloc_size = 1 page extend_time = 0 while(writing) if(blocked and extend_time < 5s) alloc_size *= 2 extend_start_time = now() extend(alloc_size)extend_time= now() - extend_start_time Thanks, Stephen
On Tue, May 28, 2013 at 7:36 AM, Stephen Frost <sfrost@snowman.net> wrote: > > On the other hand, I do feel like people are worried about > over-extending a relation and wasting disk space- but with the way that > vacuum can clean up pages at the end, that would only be a temporary > situation anyway. > Hi, Maybe i'm wrong but this should be easily solved by an autovacuum_no_truncate_empty_pages or an autovacuum_empty_pages_limit GUC/reloption. Just to clarify the second one autovacuum will allow until that limit of empty pages, and will remove excess from there We can also think in GUC/reloption for next_extend_blocks so formula is needed, or of course the automated calculation that has been proposed -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157
On Tue, May 28, 2013 at 8:38 AM, Jaime Casanova <jaime@2ndquadrant.com> wrote: > > We can also think in GUC/reloption for next_extend_blocks so formula > is needed, or of course the automated calculation that has been > proposed > s/so formula is needed/so *no* formula is needed btw, we can also use a next_extend_blocks GUC/reloption as a limit for autovacuum so it will allow that empty pages at the end of the table -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157
* Jaime Casanova (jaime@2ndquadrant.com) wrote: > btw, we can also use a next_extend_blocks GUC/reloption as a limit for > autovacuum so it will allow that empty pages at the end of the table I'm really not, at all, excited about adding in GUCs for this. We just need to realize when the only available space in the relation is at the end and people are writing to it and avoid truncating pages off the end- if we don't already have locks that prevent vacuum from doing this already. I'd want to see where it's actually happening before stressing over it terribly much. Thanks, Stephen
On Tue, May 28, 2013 at 9:07 AM, Stephen Frost <sfrost@snowman.net> wrote: > * Jaime Casanova (jaime@2ndquadrant.com) wrote: >> btw, we can also use a next_extend_blocks GUC/reloption as a limit for >> autovacuum so it will allow that empty pages at the end of the table > > I'm really not, at all, excited about adding in GUCs for this. We just > need to realize when the only available space in the relation is at the > end and people are writing to it and avoid truncating pages off the end- > if we don't already have locks that prevent vacuum from doing this > already. I'd want to see where it's actually happening before stressing > over it terribly much. +1 autovacuum configuration is already much too complex as it is...we should be removing/consolidating options, not adding them. merlin
On 2013-05-28 10:07:06 -0400, Stephen Frost wrote: > * Jaime Casanova (jaime@2ndquadrant.com) wrote: > > btw, we can also use a next_extend_blocks GUC/reloption as a limit for > > autovacuum so it will allow that empty pages at the end of the table > > I'm really not, at all, excited about adding in GUCs for this. But I thought you were in favor of doing complex stuff like mapping segments filled somewhere else into place :P But I agree. This needs to work without much manual intervention. I think we just need to make autovacuum truncate only if it finds more free space than whatever amount we might have added at that relation size plus some slop. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
* Andres Freund (andres@2ndquadrant.com) wrote: > On 2013-05-28 10:07:06 -0400, Stephen Frost wrote: > > I'm really not, at all, excited about adding in GUCs for this. > > But I thought you were in favor of doing complex stuff like mapping > segments filled somewhere else into place :P That wouldn't require a GUC.. ;) > But I agree. This needs to work without much manual intervention. I > think we just need to make autovacuum truncate only if it finds more > free space than whatever amount we might have added at that relation > size plus some slop. Agreed. Thanks, Stephen
On Tue, May 28, 2013 at 10:53 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > But I agree. This needs to work without much manual intervention. I > think we just need to make autovacuum truncate only if it finds more > free space than whatever amount we might have added at that relation > size plus some slop. > And how do you decide the amount of that "slop"? -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157
* Jaime Casanova (jaime@2ndquadrant.com) wrote: > On Tue, May 28, 2013 at 10:53 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > But I agree. This needs to work without much manual intervention. I > > think we just need to make autovacuum truncate only if it finds more > > free space than whatever amount we might have added at that relation > > size plus some slop. > > And how do you decide the amount of that "slop"? How about % of table size? Thanks, Stephen