Thread: Vacuum: allow usage of more than 1GB of work mem
The attached patch allows setting maintainance_work_mem or autovacuum_work_mem higher than 1GB (and be effective), by turning the allocation of the dead_tuples into a huge allocation. This results in fewer index scans for heavily bloated tables, and could be a lifesaver in many situations (in particular, the situation I'm living right now in production, where we don't have enough room for a vacuum full, and have just deleted 75% of a table to make room but have to rely on regular lazy vacuum to free the space). The patch also makes vacuum free the dead_tuples before starting truncation. It didn't seem necessary to hold onto it beyond that point, and it might help give the OS more cache, especially if work mem is configured very high to avoid multiple index scans. Tested with pgbench scale 4000 after deleting the whole pgbench_accounts table, seemed to work fine.
Attachment
On 3 September 2016 at 04:25, Claudio Freire <klaussfreire@gmail.com> wrote: > The attached patch allows setting maintainance_work_mem or > autovacuum_work_mem higher than 1GB (and be effective), by turning the > allocation of the dead_tuples into a huge allocation. > > This results in fewer index scans for heavily bloated tables, and > could be a lifesaver in many situations (in particular, the situation > I'm living right now in production, where we don't have enough room > for a vacuum full, and have just deleted 75% of a table to make room > but have to rely on regular lazy vacuum to free the space). This part looks fine. I'm inclined to commit the attached patch soon. > The patch also makes vacuum free the dead_tuples before starting > truncation. It didn't seem necessary to hold onto it beyond that > point, and it might help give the OS more cache, especially if work > mem is configured very high to avoid multiple index scans. How long does that part ever take? Is there any substantial gain from this? Lets discuss that as a potential second patch. > Tested with pgbench scale 4000 after deleting the whole > pgbench_accounts table, seemed to work fine. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 3 September 2016 at 04:25, Claudio Freire <klaussfreire@gmail.com> wrote: >> The patch also makes vacuum free the dead_tuples before starting >> truncation. It didn't seem necessary to hold onto it beyond that >> point, and it might help give the OS more cache, especially if work >> mem is configured very high to avoid multiple index scans. > > How long does that part ever take? Is there any substantial gain from this? > > Lets discuss that as a potential second patch. In the test case I mentioned, it takes longer than the vacuum part itself. Other than freeing RAM there's no gain. I didn't measure any speed difference while testing, but that's probably because the backward scan doesn't benefit from the cache, but other activity on the system might. So, depending on the workload on the server, extra available RAM may be a significant gain on its own or not. It just didn't seem there was a reason to keep that RAM reserved, especially after making it a huge allocation. I'm fine either way. I can remove that from the patch or leave it as-is. It just seemed like a good idea at the time.
On Mon, Sep 5, 2016 at 11:50 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 3 September 2016 at 04:25, Claudio Freire <klaussfreire@gmail.com> wrote: >>> The patch also makes vacuum free the dead_tuples before starting >>> truncation. It didn't seem necessary to hold onto it beyond that >>> point, and it might help give the OS more cache, especially if work >>> mem is configured very high to avoid multiple index scans. >> >> How long does that part ever take? Is there any substantial gain from this? >> >> Lets discuss that as a potential second patch. > > In the test case I mentioned, it takes longer than the vacuum part itself. > > Other than freeing RAM there's no gain. I didn't measure any speed > difference while testing, but that's probably because the backward > scan doesn't benefit from the cache, but other activity on the system > might. So, depending on the workload on the server, extra available > RAM may be a significant gain on its own or not. It just didn't seem > there was a reason to keep that RAM reserved, especially after making > it a huge allocation. > > I'm fine either way. I can remove that from the patch or leave it > as-is. It just seemed like a good idea at the time. Rebased and split versions attached
Attachment
On 5 September 2016 at 15:50, Claudio Freire <klaussfreire@gmail.com> wrote: > On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 3 September 2016 at 04:25, Claudio Freire <klaussfreire@gmail.com> wrote: >>> The patch also makes vacuum free the dead_tuples before starting >>> truncation. It didn't seem necessary to hold onto it beyond that >>> point, and it might help give the OS more cache, especially if work >>> mem is configured very high to avoid multiple index scans. >> >> How long does that part ever take? Is there any substantial gain from this? >> >> Lets discuss that as a potential second patch. > > In the test case I mentioned, it takes longer than the vacuum part itself. Please provide a test case and timings so we can see what's happening. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Sep 5, 2016 at 5:36 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 5 September 2016 at 15:50, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On 3 September 2016 at 04:25, Claudio Freire <klaussfreire@gmail.com> wrote: >>>> The patch also makes vacuum free the dead_tuples before starting >>>> truncation. It didn't seem necessary to hold onto it beyond that >>>> point, and it might help give the OS more cache, especially if work >>>> mem is configured very high to avoid multiple index scans. >>> >>> How long does that part ever take? Is there any substantial gain from this? >>> >>> Lets discuss that as a potential second patch. >> >> In the test case I mentioned, it takes longer than the vacuum part itself. > > Please provide a test case and timings so we can see what's happening. The referenced test case is the one I mentioned on the OP: - createdb pgbench - pgbench -i -s 4000 pgbench - psql pgbench -c 'delete from pgbench_accounts;' - vacuumdb -v -t pgbench_accounts pgbench fsync=off, autovacuum=off, maintainance_work_mem=4GB From what I remember, it used ~2.7GB of RAM up until the truncate phase, where it freed it. It performed a single index scan over the PK. I don't remember timings, and I didn't take them, so I'll have to repeat the test to get them. It takes all day and makes my laptop unusably slow, so I'll post them later, but they're not very interesting. The only interesting bit is that it does a single index scan instead of several, which on TB-or-more tables it's kinda nice. Btw, without a further patch to prefetch pages on the backward scan for truncate, however, my patience ran out before it finished truncating. I haven't submitted that patch because there was an identical patch in an older thread that was discussed and more or less rejected since it slightly penalized SSDs. While I'm confident my version of the patch is a little easier on SSDs, I haven't got an SSD at hand to confirm, so that patch is still waiting on the queue until I get access to an SSD. The tests I'll make include that patch, so the timing regarding truncate won't be representative of HEAD (I really can't afford to run the tests on a scale=4000 pgbench without that patch, it crawls, and smaller scales don't fill the dead_tuples array).
On 5 September 2016 at 21:58, Claudio Freire <klaussfreire@gmail.com> wrote: >>>> How long does that part ever take? Is there any substantial gain from this? > Btw, without a further patch to prefetch pages on the backward scan > for truncate, however, my patience ran out before it finished > truncating. I haven't submitted that patch because there was an > identical patch in an older thread that was discussed and more or less > rejected since it slightly penalized SSDs. OK, thats enough context. Sorry for being forgetful on that point. Please post that new patch also. This whole idea of backwards scanning to confirm truncation seems wrong. What we want is an O(1) solution. Thinking. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 9/4/16 1:46 AM, Simon Riggs wrote: >> > The patch also makes vacuum free the dead_tuples before starting >> > truncation. It didn't seem necessary to hold onto it beyond that >> > point, and it might help give the OS more cache, especially if work >> > mem is configured very high to avoid multiple index scans. > How long does that part ever take? Is there any substantial gain from this? If you're asking about how long the dealloc takes, we're going to have to pay that cost anyway when the context gets destroyed/reset, no? Doing that sooner rather than later certainly seems like a good idea since we've seen that truncation can take quite some time. Might as well give the memory back to the OS ASAP. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On Sat, Sep 3, 2016 at 8:55 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > The attached patch allows setting maintainance_work_mem or > autovacuum_work_mem higher than 1GB (and be effective), by turning the > allocation of the dead_tuples into a huge allocation. > > This results in fewer index scans for heavily bloated tables, and > could be a lifesaver in many situations (in particular, the situation > I'm living right now in production, where we don't have enough room > for a vacuum full, and have just deleted 75% of a table to make room > but have to rely on regular lazy vacuum to free the space). > > The patch also makes vacuum free the dead_tuples before starting > truncation. It didn't seem necessary to hold onto it beyond that > point, and it might help give the OS more cache, especially if work > mem is configured very high to avoid multiple index scans. > > Tested with pgbench scale 4000 after deleting the whole > pgbench_accounts table, seemed to work fine. The problem with this is that we allocate the entire amount of maintenance_work_mem even when the number of actual dead tuples turns out to be very small. That's not so bad if the amount of memory we're potentially wasting is limited to ~1 GB, but it seems pretty dangerous to remove the 1 GB limit, because somebody might have maintenance_work_mem set to tens or hundreds of gigabytes to speed index creation, and allocating that much space for a VACUUM that encounters 1 dead tuple does not seem like a good plan. What I think we need to do is make some provision to initially allocate only a small amount of memory and then grow the allocation later if needed. For example, instead of having vacrelstats->dead_tuples be declared as ItemPointer, declare it as ItemPointer * and allocate the array progressively in segments. I'd actually argue that the segment size should be substantially smaller than 1 GB, like say 64MB; there are still some people running systems which are small enough that allocating 1 GB when we may need only 6 bytes can drive the system into OOM. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Sep 4, 2016 at 8:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 9/4/16 1:46 AM, Simon Riggs wrote: >>> >>> > The patch also makes vacuum free the dead_tuples before starting >>> > truncation. It didn't seem necessary to hold onto it beyond that >>> > point, and it might help give the OS more cache, especially if work >>> > mem is configured very high to avoid multiple index scans. >> >> How long does that part ever take? Is there any substantial gain from >> this? > > > If you're asking about how long the dealloc takes, we're going to have to > pay that cost anyway when the context gets destroyed/reset, no? Doing that > sooner rather than later certainly seems like a good idea since we've seen > that truncation can take quite some time. Might as well give the memory back > to the OS ASAP. AFAIK, except on debug builds where it has to memset the whole thing, the cost is constant (unrelated to the allocated block size), so it should be rather small in this context. On Tue, Sep 6, 2016 at 1:42 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Sep 3, 2016 at 8:55 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> The attached patch allows setting maintainance_work_mem or >> autovacuum_work_mem higher than 1GB (and be effective), by turning the >> allocation of the dead_tuples into a huge allocation. >> >> This results in fewer index scans for heavily bloated tables, and >> could be a lifesaver in many situations (in particular, the situation >> I'm living right now in production, where we don't have enough room >> for a vacuum full, and have just deleted 75% of a table to make room >> but have to rely on regular lazy vacuum to free the space). >> >> The patch also makes vacuum free the dead_tuples before starting >> truncation. It didn't seem necessary to hold onto it beyond that >> point, and it might help give the OS more cache, especially if work >> mem is configured very high to avoid multiple index scans. >> >> Tested with pgbench scale 4000 after deleting the whole >> pgbench_accounts table, seemed to work fine. > > The problem with this is that we allocate the entire amount of > maintenance_work_mem even when the number of actual dead tuples turns > out to be very small. That's not so bad if the amount of memory we're > potentially wasting is limited to ~1 GB, but it seems pretty dangerous > to remove the 1 GB limit, because somebody might have > maintenance_work_mem set to tens or hundreds of gigabytes to speed > index creation, and allocating that much space for a VACUUM that > encounters 1 dead tuple does not seem like a good plan. > > What I think we need to do is make some provision to initially > allocate only a small amount of memory and then grow the allocation > later if needed. For example, instead of having > vacrelstats->dead_tuples be declared as ItemPointer, declare it as > ItemPointer * and allocate the array progressively in segments. I'd > actually argue that the segment size should be substantially smaller > than 1 GB, like say 64MB; there are still some people running systems > which are small enough that allocating 1 GB when we may need only 6 > bytes can drive the system into OOM. This would however incur the cost of having to copy the whole GB-sized chunk every time it's expanded. It woudln't be cheap. I've monitored the vacuum as it runs and the OS doesn't map the whole block unless it's touched, which it isn't until dead tuples are found. Surely, if overcommit is disabled (as it should), it could exhaust the virtual address space if set very high, but it wouldn't really use the memory unless it's needed, it would merely reserve it. To fix that, rather than repalloc the whole thing, dead_tuples would have to be an ItemPointer** of sorted chunks. That'd be a significantly more complex patch, but at least it wouldn't incur the memcpy. I could attempt that, but I don't see the difference between vacuum and create index in this case. Both could allocate a huge chunk of the virtual address space if maintainance work mem says so, both proportional to the size of the table. I can't see how that could take any DBA by surprise. A sensible compromise could be dividing the maintainance_work_mem by autovacuum_max_workers when used in autovacuum, as is done for cost limits, to protect those that set both rather high.
On Tue, Sep 6, 2016 at 10:28 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> The problem with this is that we allocate the entire amount of >> maintenance_work_mem even when the number of actual dead tuples turns >> out to be very small. That's not so bad if the amount of memory we're >> potentially wasting is limited to ~1 GB, but it seems pretty dangerous >> to remove the 1 GB limit, because somebody might have >> maintenance_work_mem set to tens or hundreds of gigabytes to speed >> index creation, and allocating that much space for a VACUUM that >> encounters 1 dead tuple does not seem like a good plan. >> >> What I think we need to do is make some provision to initially >> allocate only a small amount of memory and then grow the allocation >> later if needed. For example, instead of having >> vacrelstats->dead_tuples be declared as ItemPointer, declare it as >> ItemPointer * and allocate the array progressively in segments. I'd >> actually argue that the segment size should be substantially smaller >> than 1 GB, like say 64MB; there are still some people running systems >> which are small enough that allocating 1 GB when we may need only 6 >> bytes can drive the system into OOM. > > This would however incur the cost of having to copy the whole GB-sized > chunk every time it's expanded. It woudln't be cheap. No, I don't want to end up copying the whole array; that's what I meant by allocating it progressively in segments. Something like what you go on to propose. > I've monitored the vacuum as it runs and the OS doesn't map the whole > block unless it's touched, which it isn't until dead tuples are found. > Surely, if overcommit is disabled (as it should), it could exhaust the > virtual address space if set very high, but it wouldn't really use the > memory unless it's needed, it would merely reserve it. Yeah, but I've seen actual breakage from exactly this issue on customer systems even with the 1GB limit, and when we start allowing 100GB it's going to get a whole lot worse. > To fix that, rather than repalloc the whole thing, dead_tuples would > have to be an ItemPointer** of sorted chunks. That'd be a > significantly more complex patch, but at least it wouldn't incur the > memcpy. Right, this is what I had in mind. I don't think this is actually very complicated, because the way we use this array is really simple. We basically just keep appending to the array until we run out of space, and that's not very hard to implement with an array-of-arrays. The chunks are, in some sense, sorted, as you say, but you don't need to do qsort() or anything like that. You're just replacing a single flat array with a data structure that can be grown incrementally in fixed-size chunks. > I could attempt that, but I don't see the difference between > vacuum and create index in this case. Both could allocate a huge chunk > of the virtual address space if maintainance work mem says so, both > proportional to the size of the table. I can't see how that could take > any DBA by surprise. Really? CREATE INDEX isn't going to allocate more storage space than the size of the data actually being sorted, because tuplesort.c is smart about that kind of thing. But VACUUM will very happily allocate vastly more memory than the number of dead tuples. It is thankfully smart enough not to allocate more storage than the number of line pointers that could theoretically exist in a relation of the given size, but that only helps for very small relations. In a large relation that divergence between the amount of storage space that could theoretically be needed and the amount that is actually needed is likely to be extremely high. 1 TB relation = 2^27 blocks, each of which can contain MaxHeapTuplesPerPage dead line pointers. On my system, MaxHeapTuplesPerPage is 291, so that's 291 * 2^27 possible dead line pointers, which at 6 bytes each is 291 * 6 * 2^27 = ~218GB, but the expected number of dead line pointers is much less than that. Even if this is a vacuum triggered by autovacuum_vacuum_scale_factor and you're using the default of 0.2 (probably too high for such a large table), assuming there are about 60 tuples for page (which is what I get with pgbench -i) the table would have about 2^27 * 60 = 7.7 billion tuples of which 1.5 billion would be dead, meaning we need about 9-10GB of space to store all of those dead tuples. Allocating as much as 218GB when we need 9-10GB is going to sting, and I don't see how you will get a comparable distortion with CREATE INDEX. I might be missing something, though. There's no real issue when there's only one process running on the system at a time. If the user set maintenance_work_mem to an amount of memory that he can't afford to pay even once, then that's simple misconfiguration and it's not really our problem. The issue is that when there are 3 or potentially more VACUUM processes running plus a CREATE INDEX or two at the same time. If you set maintenance_work_mem to a value that is large enough to make the CREATE INDEX run fast, now with your patch that is also going to cause each VACUUM process to gobble up lots of extra memory that it probably doesn't need, and now you may well start to get failures. I've seen this happen even with the current 1GB limit, though you need a pretty small system - e.g. 8GB RAM - for it to be a problem. I think it is really really likely to cause big problems for us if we dramatically increase that limit without making the allocation algorithm smarter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 6, 2016 at 2:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I could attempt that, but I don't see the difference between >> vacuum and create index in this case. Both could allocate a huge chunk >> of the virtual address space if maintainance work mem says so, both >> proportional to the size of the table. I can't see how that could take >> any DBA by surprise. > > Really? CREATE INDEX isn't going to allocate more storage space than > the size of the data actually being sorted, because tuplesort.c is > smart about that kind of thing. But VACUUM will very happily allocate > vastly more memory than the number of dead tuples. It is thankfully > smart enough not to allocate more storage than the number of line > pointers that could theoretically exist in a relation of the given > size, but that only helps for very small relations. In a large > relation that divergence between the amount of storage space that > could theoretically be needed and the amount that is actually needed > is likely to be extremely high. 1 TB relation = 2^27 blocks, each of > which can contain MaxHeapTuplesPerPage dead line pointers. On my > system, MaxHeapTuplesPerPage is 291, so that's 291 * 2^27 possible > dead line pointers, which at 6 bytes each is 291 * 6 * 2^27 = ~218GB, > but the expected number of dead line pointers is much less than that. > Even if this is a vacuum triggered by autovacuum_vacuum_scale_factor > and you're using the default of 0.2 (probably too high for such a > large table), assuming there are about 60 tuples for page (which is > what I get with pgbench -i) the table would have about 2^27 * 60 = 7.7 > billion tuples of which 1.5 billion would be dead, meaning we need > about 9-10GB of space to store all of those dead tuples. Allocating > as much as 218GB when we need 9-10GB is going to sting, and I don't > see how you will get a comparable distortion with CREATE INDEX. I > might be missing something, though. CREATE INDEX could also allocate 218GB, you just need to index enough columns and you'll get that. Aside from the fact that CREATE INDEX will only allocate what is going to be used and VACUUM will overallocate, the potential to fully allocate the amount given is still there for both cases. > There's no real issue when there's only one process running on the > system at a time. If the user set maintenance_work_mem to an amount > of memory that he can't afford to pay even once, then that's simple > misconfiguration and it's not really our problem. The issue is that > when there are 3 or potentially more VACUUM processes running plus a > CREATE INDEX or two at the same time. If you set maintenance_work_mem > to a value that is large enough to make the CREATE INDEX run fast, now > with your patch that is also going to cause each VACUUM process to > gobble up lots of extra memory that it probably doesn't need, and now > you may well start to get failures. I've seen this happen even with > the current 1GB limit, though you need a pretty small system - e.g. > 8GB RAM - for it to be a problem. I think it is really really likely > to cause big problems for us if we dramatically increase that limit > without making the allocation algorithm smarter. Ok, a pity it will invalidate all the testing already done though (I was almost done with the testing). I guess I'll send the results anyway.
On Tue, Sep 6, 2016 at 11:22 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > CREATE INDEX could also allocate 218GB, you just need to index enough > columns and you'll get that. > > Aside from the fact that CREATE INDEX will only allocate what is going > to be used and VACUUM will overallocate, the potential to fully > allocate the amount given is still there for both cases. I agree with that, but I think there's a big difference between allocating the memory only when it's needed and allocating it whether it is needed or not. YMMV, of course, but that's what I think.... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Yeah, but I've seen actual breakage from exactly this issue on > customer systems even with the 1GB limit, and when we start allowing > 100GB it's going to get a whole lot worse. While it's not necessarily a bad idea to consider these things, I think people are greatly overestimating the consequences of the patch-as-proposed. AFAICS, it does *not* let you tell VACUUM to eat 100GB of workspace. Note the line right in front of the one being changed: maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData); maxtuples = Min(maxtuples, INT_MAX); - maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData)); + maxtuples = Min(maxtuples, MaxAllocHugeSize / sizeof(ItemPointerData)); Regardless of what vac_work_mem is, we aren't gonna let you have more than INT_MAX ItemPointers, hence 12GB at the most. So the worst-case increase from the patch as given is 12X. Maybe that's enough to cause bad consequences on some systems, but it's not the sort of disaster Robert posits above. It's also worth re-reading the lines just after this, which constrain the allocation a whole lot more for small tables. Robert comments: > ... But VACUUM will very happily allocate > vastly more memory than the number of dead tuples. It is thankfully > smart enough not to allocate more storage than the number of line > pointers that could theoretically exist in a relation of the given > size, but that only helps for very small relations. In a large > relation that divergence between the amount of storage space that > could theoretically be needed and the amount that is actually needed > is likely to be extremely high. 1 TB relation = 2^27 blocks, each of > which can contain MaxHeapTuplesPerPage dead line pointers. On my > system, MaxHeapTuplesPerPage is 291, so that's 291 * 2^27 possible > dead line pointers, which at 6 bytes each is 291 * 6 * 2^27 = ~218GB, > but the expected number of dead line pointers is much less than that. If we think the expected number of dead pointers is so much less than that, why don't we just decrease LAZY_ALLOC_TUPLES, and take a hit in extra index vacuum cycles when we're wrong? (Actually, what I'd be inclined to do is let it have MaxHeapTuplesPerPage slots per page up till a few meg, and then start tailing off the space-per-page, figuring that the law of large numbers will probably kick in.) regards, tom lane
On 6 September 2016 at 19:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Yeah, but I've seen actual breakage from exactly this issue on >> customer systems even with the 1GB limit, and when we start allowing >> 100GB it's going to get a whole lot worse. > > While it's not necessarily a bad idea to consider these things, > I think people are greatly overestimating the consequences of the > patch-as-proposed. AFAICS, it does *not* let you tell VACUUM to > eat 100GB of workspace. Note the line right in front of the one > being changed: > > maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData); > maxtuples = Min(maxtuples, INT_MAX); > - maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData)); > + maxtuples = Min(maxtuples, MaxAllocHugeSize / sizeof(ItemPointerData)); > > Regardless of what vac_work_mem is, we aren't gonna let you have more > than INT_MAX ItemPointers, hence 12GB at the most. So the worst-case > increase from the patch as given is 12X. Maybe that's enough to cause > bad consequences on some systems, but it's not the sort of disaster > Robert posits above. Is there a reason we can't use repalloc here? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 6, 2016 at 2:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Yeah, but I've seen actual breakage from exactly this issue on >> customer systems even with the 1GB limit, and when we start allowing >> 100GB it's going to get a whole lot worse. > > While it's not necessarily a bad idea to consider these things, > I think people are greatly overestimating the consequences of the > patch-as-proposed. AFAICS, it does *not* let you tell VACUUM to > eat 100GB of workspace. Note the line right in front of the one > being changed: > > maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData); > maxtuples = Min(maxtuples, INT_MAX); > - maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData)); > + maxtuples = Min(maxtuples, MaxAllocHugeSize / sizeof(ItemPointerData)); > > Regardless of what vac_work_mem is, we aren't gonna let you have more > than INT_MAX ItemPointers, hence 12GB at the most. So the worst-case > increase from the patch as given is 12X. Maybe that's enough to cause > bad consequences on some systems, but it's not the sort of disaster > Robert posits above. Hmm, OK. Yes, that is a lot less bad. (I think it's still bad.) > If we think the expected number of dead pointers is so much less than > that, why don't we just decrease LAZY_ALLOC_TUPLES, and take a hit in > extra index vacuum cycles when we're wrong? Because that's really inefficient. Growing the array, even with a stupid approach that copies all of the TIDs every time, is a heck of a lot faster than incurring an extra index vac cycle. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 6, 2016 at 2:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 6 September 2016 at 19:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> Yeah, but I've seen actual breakage from exactly this issue on >>> customer systems even with the 1GB limit, and when we start allowing >>> 100GB it's going to get a whole lot worse. >> >> While it's not necessarily a bad idea to consider these things, >> I think people are greatly overestimating the consequences of the >> patch-as-proposed. AFAICS, it does *not* let you tell VACUUM to >> eat 100GB of workspace. Note the line right in front of the one >> being changed: >> >> maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData); >> maxtuples = Min(maxtuples, INT_MAX); >> - maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData)); >> + maxtuples = Min(maxtuples, MaxAllocHugeSize / sizeof(ItemPointerData)); >> >> Regardless of what vac_work_mem is, we aren't gonna let you have more >> than INT_MAX ItemPointers, hence 12GB at the most. So the worst-case >> increase from the patch as given is 12X. Maybe that's enough to cause >> bad consequences on some systems, but it's not the sort of disaster >> Robert posits above. > > Is there a reason we can't use repalloc here? There are two possible problems, either of which is necessarily fatal: 1. I expect repalloc probably works by allocating the new space, copying from old to new, and freeing the old. That could work out badly if we are nearly the edge of the system's allocation limit. 2. It's slower than the approach proposed upthread of allocating the array in segments. With that approach, we never need to memcpy() anything. On the plus side, it's probably less code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 6, 2016 at 2:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: > There are two possible problems, either of which is necessarily fatal: I meant to write "neither of which" not "either of which". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Simon Riggs <simon@2ndquadrant.com> writes: > Is there a reason we can't use repalloc here? (1) repalloc will probably copy the data. (2) that answer doesn't excuse you from choosing a limit. We could get around (1) by something like Robert's idea of segmented allocation, but TBH I've seen nothing on this thread to make me think it's necessary or would even result in any performance improvement at all. The bigger we make that array, the worse index-cleaning is going to perform, and complicating the data structure will add another hit on top of that. regards, tom lane
On Tue, Sep 6, 2016 at 3:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > We could get around (1) by something like Robert's idea of segmented > allocation, but TBH I've seen nothing on this thread to make me think > it's necessary or would even result in any performance improvement > at all. The bigger we make that array, the worse index-cleaning > is going to perform, and complicating the data structure will add > another hit on top of that. I wouldn't be so sure, I've seen cases where two binary searches were faster than a single binary search, especially when working with humongus arrays like this tid array, because touching less (memory) pages for a search does pay off considerably. I'd try before giving up on the idea. The test results (which I'll post in a second) do give credit to your expectation that making the array bigger/more complex does impact index scan performance. It's still faster than scanning several times though.
On 6 September 2016 at 19:09, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Sep 6, 2016 at 2:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 6 September 2016 at 19:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Robert Haas <robertmhaas@gmail.com> writes: >>>> Yeah, but I've seen actual breakage from exactly this issue on >>>> customer systems even with the 1GB limit, and when we start allowing >>>> 100GB it's going to get a whole lot worse. >>> >>> While it's not necessarily a bad idea to consider these things, >>> I think people are greatly overestimating the consequences of the >>> patch-as-proposed. AFAICS, it does *not* let you tell VACUUM to >>> eat 100GB of workspace. Note the line right in front of the one >>> being changed: >>> >>> maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData); >>> maxtuples = Min(maxtuples, INT_MAX); >>> - maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData)); >>> + maxtuples = Min(maxtuples, MaxAllocHugeSize / sizeof(ItemPointerData)); >>> >>> Regardless of what vac_work_mem is, we aren't gonna let you have more >>> than INT_MAX ItemPointers, hence 12GB at the most. So the worst-case >>> increase from the patch as given is 12X. Maybe that's enough to cause >>> bad consequences on some systems, but it's not the sort of disaster >>> Robert posits above. >> >> Is there a reason we can't use repalloc here? > > There are two possible problems, either of which is necessarily fatal: > > 1. I expect repalloc probably works by allocating the new space, > copying from old to new, and freeing the old. That could work out > badly if we are nearly the edge of the system's allocation limit. > > 2. It's slower than the approach proposed upthread of allocating the > array in segments. With that approach, we never need to memcpy() > anything. > > On the plus side, it's probably less code. Hmm, OK. What occurs to me is that we can exactly predict how many tuples we are going to get when we autovacuum, since we measure that and we know what the number is when we trigger it. So there doesn't need to be any guessing going on at all, nor do we need it to be flexible. My proposal now is to pass in the number of rows changed since last vacuum and use that (+10% to be safe) as the size of the array, up to the defined limit. Manual VACUUM still needs to guess, so we might need a flexible solution there, but generally we don't. We could probably estimate it from the VM. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 6, 2016 at 2:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > What occurs to me is that we can exactly predict how many tuples we > are going to get when we autovacuum, since we measure that and we know > what the number is when we trigger it. > > So there doesn't need to be any guessing going on at all, nor do we > need it to be flexible. No, that's not really true. A lot can change between the time it's triggered and the time it happens, or even while it's happening. Somebody can run a gigantic bulk delete just after we start the VACUUM. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 6, 2016 at 3:45 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 5 September 2016 at 21:58, Claudio Freire <klaussfreire@gmail.com> wrote: > >>>>> How long does that part ever take? Is there any substantial gain from this? > >> Btw, without a further patch to prefetch pages on the backward scan >> for truncate, however, my patience ran out before it finished >> truncating. I haven't submitted that patch because there was an >> identical patch in an older thread that was discussed and more or less >> rejected since it slightly penalized SSDs. > > OK, thats enough context. Sorry for being forgetful on that point. > > Please post that new patch also. Attached. On Mon, Sep 5, 2016 at 5:58 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Mon, Sep 5, 2016 at 5:36 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 5 September 2016 at 15:50, Claudio Freire <klaussfreire@gmail.com> wrote: >>> On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> On 3 September 2016 at 04:25, Claudio Freire <klaussfreire@gmail.com> wrote: >>>>> The patch also makes vacuum free the dead_tuples before starting >>>>> truncation. It didn't seem necessary to hold onto it beyond that >>>>> point, and it might help give the OS more cache, especially if work >>>>> mem is configured very high to avoid multiple index scans. >>>> >>>> How long does that part ever take? Is there any substantial gain from this? >>>> >>>> Lets discuss that as a potential second patch. >>> >>> In the test case I mentioned, it takes longer than the vacuum part itself. >> >> Please provide a test case and timings so we can see what's happening. Robert made a strong point for a change in the approach, so the information below is applicable only to the old patch (to be rewritten). I'm sending this merely to document the testing done, it will be a while before I can get the proposed design running and tested. > The referenced test case is the one I mentioned on the OP: > > - createdb pgbench > - pgbench -i -s 4000 pgbench > - psql pgbench -c 'delete from pgbench_accounts;' > - vacuumdb -v -t pgbench_accounts pgbench > > fsync=off, autovacuum=off, maintainance_work_mem=4GB > > From what I remember, it used ~2.7GB of RAM up until the truncate > phase, where it freed it. It performed a single index scan over the > PK. > > I don't remember timings, and I didn't take them, so I'll have to > repeat the test to get them. It takes all day and makes my laptop > unusably slow, so I'll post them later, but they're not very > interesting. The only interesting bit is that it does a single index > scan instead of several, which on TB-or-more tables it's kinda nice. So, the test results below: During setup, maybe for context, the delete took 52m 50s real time (measured with time psql pgbench -c 'delete from pgbench_accounts;') During the delete, my I/O was on average like the following, which should give an indication of what my I/O subsystem is capable of (not much, granted): Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.47 5.27 35.53 77.42 17.58 42.95 1097.51 145.22 1295.23 33.47 1874.36 8.85 100.00 Since it's a 5k RPM laptop drive, it's rather slow on IOPS, and since I'm using the defaults for shared buffers and checkpoints, write thoughput isn't stellar either. But that's not the point of the test anyway, it's just for context. The hardware is an HP envy laptop with a 1TB 5.4k RPM hard drive, 12GB RAM, core i7-4722HQ, no weird performance tweaking of any kind (ie: cpu scaling left intact). The system was not dedicated of course, being a laptop, but it had little else going on while the test was running. Given the size of the test, I don't believe there's any chance concurrent activity could invalidate the results. The timing for setup was comparable with both versions (patched and unpatched), so I'm reporting the patched times only. The vacuum phase: patched: $ vacuumdb -v -t pgbench_accounts pgbench INFO: vacuuming "public.pgbench_accounts" INFO: scanned index "pgbench_accounts_pkey" to remove 400000000 row versions DETAIL: CPU 12.46s/48.76u sec elapsed 566.47 sec. INFO: "pgbench_accounts": removed 400000000 row versions in 6557378 pages DETAIL: CPU 56.68s/28.90u sec elapsed 1872.76 sec. INFO: index "pgbench_accounts_pkey" now contains 0 row versions in 1096762 pages DETAIL: 400000000 index row versions were removed. 1092896 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.47 sec. INFO: "pgbench_accounts": found 400000000 removable, 0 nonremovable row versions in 6557378 out of 6557378 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. Skipped 0 pages due to buffer pins. 0 pages are entirely empty. CPU 129.24s/127.24u sec elapsed 3877.13 sec. INFO: "pgbench_accounts": truncated 6557378 to 0 pages DETAIL: CPU 34.88s/7.91u sec elapsed 1645.90 sec. Total elapsed time: ~92 minutes I/O during initial heap scan: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 1.52 99.78 72.63 62.47 31.94 33.22 987.80 146.71 1096.29 25.39 2341.48 7.40 100.00 Index scan: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 7.08 3.87 55.18 59.87 17.06 31.83 870.33 146.61 1243.34 31.42 2360.44 8.69 100.00 Final heap scan: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.78 8.65 65.32 57.32 31.50 32.96 1076.56 152.22 1928.67 1410.63 2519.01 8.15 100.00 Truncate (with prefetch): Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 159.67 0.87 3720.03 2.82 30.31 0.12 16.74 19.11 5.13 4.95 242.60 0.27 99.23 Without prefetch, rMB/s during truncation varies between 4MB/s and 6MB/s, so it's on average 6 times slower, meaning it would take over 3 hours. Peak memory used: 2369MB RSS, 4260MB VIRT (source: top) Unpatched + prefetch (same config, effective work mem 1GB due to non-huge allocation limit): $ vacuumdb -v -t pgbench_accounts pgbench INFO: vacuuming "public.pgbench_accounts" INFO: scanned index "pgbench_accounts_pkey" to remove 178956737 row versions DETAIL: CPU 5.88s/53.77u sec elapsed 263.63 sec. INFO: "pgbench_accounts": removed 178956737 row versions in 2933717 pages DETAIL: CPU 22.28s/12.94u sec elapsed 757.45 sec. INFO: scanned index "pgbench_accounts_pkey" to remove 178956737 row versions DETAIL: CPU 7.44s/31.28u sec elapsed 282.41 sec. INFO: "pgbench_accounts": removed 178956737 row versions in 2933717 pages DETAIL: CPU 22.24s/13.30u sec elapsed 806.54 sec. INFO: scanned index "pgbench_accounts_pkey" to remove 42086526 row versions DETAIL: CPU 4.30s/5.83u sec elapsed 170.30 sec. INFO: "pgbench_accounts": removed 42086526 row versions in 689944 pages DETAIL: CPU 3.35s/3.23u sec elapsed 126.22 sec. INFO: index "pgbench_accounts_pkey" now contains 0 row versions in 1096762 pages DETAIL: 400000000 index row versions were removed. 1096351 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.40 sec. INFO: "pgbench_accounts": found 400000000 removable, 0 nonremovable row versions in 6557378 out of 6557378 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. Skipped 0 pages due to buffer pins. 0 pages are entirely empty. CPU 123.82s/183.76u sec elapsed 4071.54 sec. INFO: "pgbench_accounts": truncated 6557378 to 0 pages DETAIL: CPU 40.36s/7.72u sec elapsed 1648.22 sec. Total elapsed time: ~95m I/O during initial heap scan: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 1.48 32.53 66.10 60.50 31.95 34.88 1081.06 149.20 1175.78 25.44 2432.59 8.02 101.59 First index scan: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 1.17 14.95 43.85 70.07 19.65 40.18 1075.57 145.98 1278.39 31.86 2058.51 8.78 100.00 Final index scan: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 17.12 1.50 169.85 2.28 68.33 0.67 820.95 158.32 312.00 28.14 21426.95 5.81 100.00 Truncation: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 142.93 1.23 3444.70 4.65 28.16 0.36 16.93 18.52 5.37 5.25 91.17 0.29 99.22 Peak memory used is 1135MB RSS and 1188MB VIRT Comparing: Time reportedly spent scanning indexes: 716.34 unpatched, 566.47 patched Time reportedly spent scanning heap: 1690.21 unpatched, 1872.76 patched Total vacuum scan as reported: 4071.54 unpatched, 3877.13 patched Surely I didn't expect it to be such a close call. I believe the key reason is the speedup it got during the final index scan for not having to delete so many tuples. Clearly, having to interleave reads and writes is stressing my HD, and the last index scan, having to write less, was thus faster. I don't believe this would have happened if the index hadn't been pristine and in almost physical (heap) order, so I'd expect real-world cases (with properly aged, shuffled and bloated indexes) to show a more pronounced difference. Or when using a cost limit, that will artificially limit the I/O rate vacuum can reach. Clearly the patch is of use when I/O is the limiting factor, either due to vacuum cost limits, or due to the I/O subsystem being the bottleneck, as was the case during the above test case. Since more work mem will mean a slower lookup of the dead_tuples array, not only due to the extra comparisons but also poorer cache locality, I believe it won't benefit the runtime cost of CPU-bound cases, but it should at least generate less WAL since that's another benefit of scanning the indexes fewer times (increased WAL rates during vacuum is another problem we regularly face in our production setup). Given the I/O subsystem on my test machine isn't able to produce a CPU-bound test case for the amount of dead_tuples involved in stressing the patch, I cannot confirm the above statement, but it should be evident given the implementation.
Attachment
On 6 September 2016 at 19:23, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Sep 6, 2016 at 2:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> What occurs to me is that we can exactly predict how many tuples we >> are going to get when we autovacuum, since we measure that and we know >> what the number is when we trigger it. >> >> So there doesn't need to be any guessing going on at all, nor do we >> need it to be flexible. > > No, that's not really true. A lot can change between the time it's > triggered and the time it happens, or even while it's happening. > Somebody can run a gigantic bulk delete just after we start the > VACUUM. Which wouldn't be removed by the VACUUM, so can be ignored. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 6, 2016 at 2:51 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 6 September 2016 at 19:23, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Sep 6, 2016 at 2:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> What occurs to me is that we can exactly predict how many tuples we >>> are going to get when we autovacuum, since we measure that and we know >>> what the number is when we trigger it. >>> >>> So there doesn't need to be any guessing going on at all, nor do we >>> need it to be flexible. >> >> No, that's not really true. A lot can change between the time it's >> triggered and the time it happens, or even while it's happening. >> Somebody can run a gigantic bulk delete just after we start the >> VACUUM. > > Which wouldn't be removed by the VACUUM, so can be ignored. OK, true. But I still think it's very unlikely that we can calculate an exact count of how many dead tuples we might run into. I think we shouldn't rely on the stats collector to be perfectly correct anyway - for one thing, you can turn it off - and instead cope with the uncertainty. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Simon Riggs <simon@2ndquadrant.com> writes: > On 6 September 2016 at 19:23, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Sep 6, 2016 at 2:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> What occurs to me is that we can exactly predict how many tuples we >>> are going to get when we autovacuum, since we measure that and we know >>> what the number is when we trigger it. >>> So there doesn't need to be any guessing going on at all, nor do we >>> need it to be flexible. >> No, that's not really true. A lot can change between the time it's >> triggered and the time it happens, or even while it's happening. >> Somebody can run a gigantic bulk delete just after we start the >> VACUUM. > Which wouldn't be removed by the VACUUM, so can be ignored. (1) If the delete commits just before the vacuum starts, it may be removable. I think you're nuts to imagine there are no race conditions here. (2) Stats from the stats collector never have been, and likely never will be, anything but approximate. That goes double for dead-tuple counts, which are inaccurate even as sent from backends, never mind the multiple ways that the collector might lose the counts. The idea of looking to the stats to *guess* about how many tuples are removable doesn't seem bad at all. But imagining that that's going to be exact is folly of the first magnitude. regards, tom lane
On 6 September 2016 at 19:59, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The idea of looking to the stats to *guess* about how many tuples are > removable doesn't seem bad at all. But imagining that that's going to be > exact is folly of the first magnitude. Yes. Bear in mind I had already referred to allowing +10% to be safe, so I think we agree that a reasonably accurate, yet imprecise calculation is possible in most cases. If a recent transaction has committed, we will see both committed dead rows and stats to show they exist. I'm sure there are corner cases and race conditions where a major effect (greater than 10%) could occur, in which case we run the index scan more than once, just as we do now. The attached patch raises the limits as suggested by Claudio, allowing for larger memory allocations if possible, yet limits the allocation for larger tables based on the estimate gained from pg_stats, while adding 10% for caution. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 6 September 2016 at 19:59, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> The idea of looking to the stats to *guess* about how many tuples are >> removable doesn't seem bad at all. But imagining that that's going to be >> exact is folly of the first magnitude. > > Yes. Bear in mind I had already referred to allowing +10% to be safe, > so I think we agree that a reasonably accurate, yet imprecise > calculation is possible in most cases. That would all be well and good if it weren't trivial to do what Robert suggested. This is just a large unsorted list that we need to iterate throught. Just allocate chunks of a few megabytes and when it's full allocate a new chunk and keep going. There's no need to get tricky with estimates and resizing and whatever. -- greg
On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark <stark@mit.edu> wrote: > On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 6 September 2016 at 19:59, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >>> The idea of looking to the stats to *guess* about how many tuples are >>> removable doesn't seem bad at all. But imagining that that's going to be >>> exact is folly of the first magnitude. >> >> Yes. Bear in mind I had already referred to allowing +10% to be safe, >> so I think we agree that a reasonably accurate, yet imprecise >> calculation is possible in most cases. > > That would all be well and good if it weren't trivial to do what > Robert suggested. This is just a large unsorted list that we need to > iterate throught. Just allocate chunks of a few megabytes and when > it's full allocate a new chunk and keep going. There's no need to get > tricky with estimates and resizing and whatever. I agree. While the idea of estimating the right size sounds promising a priori, considering the estimate can go wrong and over or underallocate quite severely, the risks outweigh the benefits when you consider the alternative of a dynamic allocation strategy. Unless the dynamic strategy has a bigger CPU impact than expected, I believe it's a superior approach.
On Wed, Sep 7, 2016 at 2:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Sep 6, 2016 at 10:28 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> The problem with this is that we allocate the entire amount of >>> maintenance_work_mem even when the number of actual dead tuples turns >>> out to be very small. That's not so bad if the amount of memory we're >>> potentially wasting is limited to ~1 GB, but it seems pretty dangerous >>> to remove the 1 GB limit, because somebody might have >>> maintenance_work_mem set to tens or hundreds of gigabytes to speed >>> index creation, and allocating that much space for a VACUUM that >>> encounters 1 dead tuple does not seem like a good plan. >>> >>> What I think we need to do is make some provision to initially >>> allocate only a small amount of memory and then grow the allocation >>> later if needed. For example, instead of having >>> vacrelstats->dead_tuples be declared as ItemPointer, declare it as >>> ItemPointer * and allocate the array progressively in segments. I'd >>> actually argue that the segment size should be substantially smaller >>> than 1 GB, like say 64MB; there are still some people running systems >>> which are small enough that allocating 1 GB when we may need only 6 >>> bytes can drive the system into OOM. >> >> This would however incur the cost of having to copy the whole GB-sized >> chunk every time it's expanded. It woudln't be cheap. > > No, I don't want to end up copying the whole array; that's what I > meant by allocating it progressively in segments. Something like what > you go on to propose. > >> I've monitored the vacuum as it runs and the OS doesn't map the whole >> block unless it's touched, which it isn't until dead tuples are found. >> Surely, if overcommit is disabled (as it should), it could exhaust the >> virtual address space if set very high, but it wouldn't really use the >> memory unless it's needed, it would merely reserve it. > > Yeah, but I've seen actual breakage from exactly this issue on > customer systems even with the 1GB limit, and when we start allowing > 100GB it's going to get a whole lot worse. > >> To fix that, rather than repalloc the whole thing, dead_tuples would >> have to be an ItemPointer** of sorted chunks. That'd be a >> significantly more complex patch, but at least it wouldn't incur the >> memcpy. > > Right, this is what I had in mind. I don't think this is actually > very complicated, because the way we use this array is really simple. > We basically just keep appending to the array until we run out of > space, and that's not very hard to implement with an array-of-arrays. > The chunks are, in some sense, sorted, as you say, but you don't need > to do qsort() or anything like that. You're just replacing a single > flat array with a data structure that can be grown incrementally in > fixed-size chunks. > If we replaced dead_tuples with an array-of-array, isn't there negative performance impact for lazy_tid_reap()? As chunk is added, that performance would be decrease. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 9/8/16 3:48 AM, Masahiko Sawada wrote: > If we replaced dead_tuples with an array-of-array, isn't there > negative performance impact for lazy_tid_reap()? > As chunk is added, that performance would be decrease. Yes, it certainly would, as you'd have to do 2 binary searches. I'm not sure how much that matters though; presumably the index scans are normally IO-bound? Another option would be to use the size estimation ideas others have mentioned to create one array. If the estimates prove to be wrong you could then create a single additional segment; by that point you should have a better idea of how far off the original estimate was. That means the added search cost would only be a compare and a second pointer redirect. Something else that occurred to me... AFAIK the only reason we don't support syncscan with VACUUM is because it would require sorting the TID list. If we just added a second TID list we would be able to support syncscan, swapping over to the 'low' list when we hit the end of the relation. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On Wed, Sep 7, 2016 at 10:18 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark <stark@mit.edu> wrote:
> On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 6 September 2016 at 19:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>>> The idea of looking to the stats to *guess* about how many tuples are
>>> removable doesn't seem bad at all. But imagining that that's going to be
>>> exact is folly of the first magnitude.
>>
>> Yes. Bear in mind I had already referred to allowing +10% to be safe,
>> so I think we agree that a reasonably accurate, yet imprecise
>> calculation is possible in most cases.
>
> That would all be well and good if it weren't trivial to do what
> Robert suggested. This is just a large unsorted list that we need to
> iterate throught. Just allocate chunks of a few megabytes and when
> it's full allocate a new chunk and keep going. There's no need to get
> tricky with estimates and resizing and whatever.
I agree. While the idea of estimating the right size sounds promising
a priori, considering the estimate can go wrong and over or
underallocate quite severely, the risks outweigh the benefits when you
consider the alternative of a dynamic allocation strategy.
Unless the dynamic strategy has a bigger CPU impact than expected, I
believe it's a superior approach.
How about a completely different representation for the TID array? Now this is probably not something new, but I couldn't find if the exact same idea was discussed before. I also think it's somewhat orthogonal to what we are trying to do here, and will probably be a bigger change. But I thought I'll mention since we are at the topic.
What I propose is to use a simple bitmap to represent the tuples. If a tuple at <block, offset> is dead then the corresponding bit in the bitmap is set. So clearly the searches through dead tuples are O(1) operations, important for very large tables and large arrays.
Challenge really is that a heap page can theoretically have MaxOffsetNumber of line pointers (or to be precise maximum possible offset number). For a 8K block, that comes be about 2048. Having so many bits per page is neither practical nor optimal. But in practice the largest offset on a heap page should not be significantly greater than MaxHeapTuplesPerPage, which is a more reasonable value of 291 on my machine. Again, that's with zero sized tuple and for real life large tables, with much wider tuples, the number may go down even further.
So we cap the offsets represented in the bitmap to some realistic value, computed by looking at page density and then multiplying it by a small factor (not more than two) to take into account LP_DEAD and LP_REDIRECT line pointers. That should practically represent majority of the dead tuples in the table, but we then keep an overflow area to record tuples beyond the limit set for per page. The search routine will do a direct lookup for offsets less than the limit and search in the sorted overflow area for offsets beyond the limit.
For example, for a table with 60 bytes wide tuple (including 24 byte header), each page can approximately have 8192/60 = 136 tuples. Say we provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the bitmap. First 272 offsets in every page are represented in the bitmap and anything greater than are in overflow region. On the other hand, the current representation will need about 16 bytes per page assuming 2% dead tuples, 40 bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead tuples. So bitmap will take more space for small tuples or when vacuum is run very aggressively, both seems unlikely for very large tables. Of course the calculation does not take into account the space needed by the overflow area, but I expect that too be small.
I guess we can make a choice between two representations at the start looking at various table stats. We can also be smart and change from bitmap to traditional representation as we scan the table and see many more tuples in the overflow region than we provisioned for. There will be some challenges in converting representation mid-way, especially in terms of memory allocation, but I think those can be sorted out if we think that the idea has merit.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 8, 2016 at 11:54 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > For example, for a table with 60 bytes wide tuple (including 24 byte > header), each page can approximately have 8192/60 = 136 tuples. Say we > provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the > bitmap. First 272 offsets in every page are represented in the bitmap and > anything greater than are in overflow region. On the other hand, the current > representation will need about 16 bytes per page assuming 2% dead tuples, 40 > bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead > tuples. So bitmap will take more space for small tuples or when vacuum is > run very aggressively, both seems unlikely for very large tables. Of course > the calculation does not take into account the space needed by the overflow > area, but I expect that too be small. I thought about something like this, but it could be extremely inefficient for mostly frozen tables, since the bitmap cannot account for frozen pages without losing the O(1) lookup characteristic
On Thu, Sep 8, 2016 at 8:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Thu, Sep 8, 2016 at 11:54 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> For example, for a table with 60 bytes wide tuple (including 24 byte
> header), each page can approximately have 8192/60 = 136 tuples. Say we
> provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
> bitmap. First 272 offsets in every page are represented in the bitmap and
> anything greater than are in overflow region. On the other hand, the current
> representation will need about 16 bytes per page assuming 2% dead tuples, 40
> bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead
> tuples. So bitmap will take more space for small tuples or when vacuum is
> run very aggressively, both seems unlikely for very large tables. Of course
> the calculation does not take into account the space needed by the overflow
> area, but I expect that too be small.
I thought about something like this, but it could be extremely
inefficient for mostly frozen tables, since the bitmap cannot account
for frozen pages without losing the O(1) lookup characteristic
Well, that's correct. But I thought the whole point is when there are large number of dead tuples which requires large memory. If my math was correct as explained above, then even at 5% dead tuples, bitmap representation will consume approximate same memory but provide O(1) search time.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 8, 2016 at 11:54 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > > > On Wed, Sep 7, 2016 at 10:18 PM, Claudio Freire <klaussfreire@gmail.com> > wrote: >> >> On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark <stark@mit.edu> wrote: >> > On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs <simon@2ndquadrant.com> >> > wrote: >> >> On 6 September 2016 at 19:59, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> >> >>> The idea of looking to the stats to *guess* about how many tuples are >> >>> removable doesn't seem bad at all. But imagining that that's going to >> >>> be >> >>> exact is folly of the first magnitude. >> >> >> >> Yes. Bear in mind I had already referred to allowing +10% to be safe, >> >> so I think we agree that a reasonably accurate, yet imprecise >> >> calculation is possible in most cases. >> > >> > That would all be well and good if it weren't trivial to do what >> > Robert suggested. This is just a large unsorted list that we need to >> > iterate throught. Just allocate chunks of a few megabytes and when >> > it's full allocate a new chunk and keep going. There's no need to get >> > tricky with estimates and resizing and whatever. >> >> I agree. While the idea of estimating the right size sounds promising >> a priori, considering the estimate can go wrong and over or >> underallocate quite severely, the risks outweigh the benefits when you >> consider the alternative of a dynamic allocation strategy. >> >> Unless the dynamic strategy has a bigger CPU impact than expected, I >> believe it's a superior approach. >> > > How about a completely different representation for the TID array? Now this > is probably not something new, but I couldn't find if the exact same idea > was discussed before. I also think it's somewhat orthogonal to what we are > trying to do here, and will probably be a bigger change. But I thought I'll > mention since we are at the topic. > > What I propose is to use a simple bitmap to represent the tuples. If a tuple > at <block, offset> is dead then the corresponding bit in the bitmap is set. > So clearly the searches through dead tuples are O(1) operations, important > for very large tables and large arrays. > > Challenge really is that a heap page can theoretically have MaxOffsetNumber > of line pointers (or to be precise maximum possible offset number). For a 8K > block, that comes be about 2048. Having so many bits per page is neither > practical nor optimal. But in practice the largest offset on a heap page > should not be significantly greater than MaxHeapTuplesPerPage, which is a > more reasonable value of 291 on my machine. Again, that's with zero sized > tuple and for real life large tables, with much wider tuples, the number may > go down even further. > > So we cap the offsets represented in the bitmap to some realistic value, > computed by looking at page density and then multiplying it by a small > factor (not more than two) to take into account LP_DEAD and LP_REDIRECT line > pointers. That should practically represent majority of the dead tuples in > the table, but we then keep an overflow area to record tuples beyond the > limit set for per page. The search routine will do a direct lookup for > offsets less than the limit and search in the sorted overflow area for > offsets beyond the limit. > > For example, for a table with 60 bytes wide tuple (including 24 byte > header), each page can approximately have 8192/60 = 136 tuples. Say we > provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the > bitmap. First 272 offsets in every page are represented in the bitmap and > anything greater than are in overflow region. On the other hand, the current > representation will need about 16 bytes per page assuming 2% dead tuples, 40 > bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead > tuples. So bitmap will take more space for small tuples or when vacuum is > run very aggressively, both seems unlikely for very large tables. Of course > the calculation does not take into account the space needed by the overflow > area, but I expect that too be small. > > I guess we can make a choice between two representations at the start > looking at various table stats. We can also be smart and change from bitmap > to traditional representation as we scan the table and see many more tuples > in the overflow region than we provisioned for. There will be some > challenges in converting representation mid-way, especially in terms of > memory allocation, but I think those can be sorted out if we think that the > idea has merit. > Making the vacuum possible to choose between two data representations sounds good. I implemented the patch that changes dead tuple representation to bitmap before. I will measure the performance of bitmap representation again and post them. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Sep 8, 2016 at 11:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
-- Making the vacuum possible to choose between two data representations
sounds good.
I implemented the patch that changes dead tuple representation to bitmap before.
I will measure the performance of bitmap representation again and post them.
Sounds great! I haven't seen your patch, but what I would suggest is to compute page density (D) = relpages/(dead+live tuples) and experiment with bitmap of sizes of D to 2D bits per page. May I also suggest that instead of putting in efforts in implementing the overflow area, just count how many dead TIDs would fall under overflow area for a given choice of bitmap size.
It might be a good idea to experiment with different vacuum scale factor, varying between 2% to 20% (may be 2, 5, 10, 20). You can probably run a longish pgbench test on a large table and then save the data directory for repeated experiments, although I'm not sure if pgbench will be a good choice because HOT will prevent accumulation of dead pointers, in which case you may try adding another index on abalance column.
It'll be worth measuring memory consumption of both representations as well as performance implications on index vacuum. I don't expect to see any major difference in either heap scans.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Sep 9, 2016 at 12:33 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > > > On Thu, Sep 8, 2016 at 11:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> >> >> Making the vacuum possible to choose between two data representations >> sounds good. >> I implemented the patch that changes dead tuple representation to bitmap >> before. >> I will measure the performance of bitmap representation again and post >> them. > > > Sounds great! I haven't seen your patch, but what I would suggest is to > compute page density (D) = relpages/(dead+live tuples) and experiment with > bitmap of sizes of D to 2D bits per page. May I also suggest that instead of > putting in efforts in implementing the overflow area, just count how many > dead TIDs would fall under overflow area for a given choice of bitmap size. > Isn't that formula "page density (D) = (dead+live tuples)/relpages"? > It might be a good idea to experiment with different vacuum scale factor, > varying between 2% to 20% (may be 2, 5, 10, 20). You can probably run a > longish pgbench test on a large table and then save the data directory for > repeated experiments, although I'm not sure if pgbench will be a good choice > because HOT will prevent accumulation of dead pointers, in which case you > may try adding another index on abalance column. Thank you, I will experiment with this. > > It'll be worth measuring memory consumption of both representations as well > as performance implications on index vacuum. I don't expect to see any major > difference in either heap scans. > Yeah, it would be effective for the index vacuum speed and the number of execution of index vacuum. Attached PoC patch changes the representation of dead tuple locations to the hashmap having tuple bitmap. The one hashmap entry consists of the block number and the TID bitmap of corresponding block, and the block number is the hash key of hashmap. Current implementation of this patch is not smart yet because each hashmap entry allocates the tuple bitmap with fixed size(LAZY_ALLOC_TUPLES), so each hashentry can store up to LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples. In case where one block can store only the several tens tuples, the most bits are would be waste. After improved this patch as you suggested, I will measure performance benefit. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Attached PoC patch changes the representation of dead tuple locations > to the hashmap having tuple bitmap. > The one hashmap entry consists of the block number and the TID bitmap > of corresponding block, and the block number is the hash key of > hashmap. > Current implementation of this patch is not smart yet because each > hashmap entry allocates the tuple bitmap with fixed > size(LAZY_ALLOC_TUPLES), so each hashentry can store up to > LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples. > In case where one block can store only the several tens tuples, the > most bits are would be waste. > > After improved this patch as you suggested, I will measure performance benefit. We also need to consider the amount of memory gets used. What I proposed - replacing the array with an array of arrays - would not increase memory utilization significantly. I don't think it would have much impact on CPU utilization either. It would require replacing the call to bsearch() in lazy_heap_reaptid() with an open-coded implementation of bsearch, or with one bsearch to find the chunk and another to find the TID within the chunk, but that shouldn't be very expensive. For one thing, if the array chunks are around the size I proposed (64MB), you've got more than ten million tuples per chunk, so you can't have very many chunks unless your table is both really large and possessed of quite a bit of dead stuff. Now, if I'm reading it correctly, this patch allocates a 132-byte structure for every page with at least one dead tuple. In the worst case where there's just one dead tuple per page, that's a 20x regression in memory usage. Actually, it's more like 40x, because AllocSetAlloc rounds small allocation sizes up to the next-higher power of two, which really stings for a 132-byte allocation, and then adds a 16-byte header to each chunk. But even 20x is clearly not good. There are going to be lots of real-world cases where this uses way more memory to track the same number of dead tuples, and I'm guessing that benchmarking is going to reveal that it's not faster, either. I think it's probably wrong to worry that an array-of-arrays is going to be meaningfully slower than a single array here. It's basically costing you some small number of additional memory references per tuple, which I suspect isn't all that relevant for a bulk operation that does I/O, writes WAL, locks buffers, etc. But if it is relevant, then I think there are other ways to buy that performance back which are likely to be more memory efficient than converting this to use a hash table. For example, we could keep a bitmap with one bit per K pages. If the bit is set, there is at least 1 dead tuple on that page; if clear, there are none. When we see an index tuple, we consult the bitmap to determine whether we need to search the TID list. We select K to be the smallest power of 2 such that the bitmap uses less memory than some threshold, perhaps 64kB. Assuming that updates and deletes to the table have some locality, we should be able to skip a large percentage of the TID searches with a probe into this very compact bitmap. Note that we can set K = 1 for tables up to 4GB in size, and even a 1TB table only needs K = 256. Odds are very good that a 1TB table being vacuumed has many 256-page ranges containing no dead tuples at all ... and if this proves to be false and the dead tuples are scattered uniformly throughout the table, then you should probably be more worried about the fact that you're dumping a bare minimum of 4GB of random I/O on your hapless disk controller than about how efficient the TID search is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 13, 2016 at 11:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I think it's probably wrong to worry that an array-of-arrays is going > to be meaningfully slower than a single array here. It's basically > costing you some small number of additional memory references per > tuple, which I suspect isn't all that relevant for a bulk operation > that does I/O, writes WAL, locks buffers, etc. This analysis makes perfect sense to me. -- Peter Geoghegan
On Tue, Sep 13, 2016 at 3:51 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Attached PoC patch changes the representation of dead tuple locations >> to the hashmap having tuple bitmap. >> The one hashmap entry consists of the block number and the TID bitmap >> of corresponding block, and the block number is the hash key of >> hashmap. >> Current implementation of this patch is not smart yet because each >> hashmap entry allocates the tuple bitmap with fixed >> size(LAZY_ALLOC_TUPLES), so each hashentry can store up to >> LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples. >> In case where one block can store only the several tens tuples, the >> most bits are would be waste. >> >> After improved this patch as you suggested, I will measure performance benefit. > > We also need to consider the amount of memory gets used. What I > proposed - replacing the array with an array of arrays - would not > increase memory utilization significantly. I don't think it would > have much impact on CPU utilization either. I've finished writing that patch, I'm in the process of testing its CPU impact. First test seemed to hint at a 40% increase in CPU usage, which seems rather steep compared to what I expected, so I'm trying to rule out some methodology error here. > It would require > replacing the call to bsearch() in lazy_heap_reaptid() with an > open-coded implementation of bsearch, or with one bsearch to find the > chunk and another to find the TID within the chunk, but that shouldn't > be very expensive. I did a linear search to find the chunk, with exponentially growing chunks, and then a bsearch to find the item inside the chunk. With the typical number of segments and given the 12GB limit, the segment array size is well within the range that favors linear search. > For example, we could keep a bitmap with one bit per K > pages. If the bit is set, there is at least 1 dead tuple on that > page; if clear, there are none. When we see an index tuple, we > consult the bitmap to determine whether we need to search the TID > list. We select K to be the smallest power of 2 such that the bitmap > uses less memory than some threshold, perhaps 64kB. I've been pondering something like that, but that's an optimization that's quite orthogonal to the multiarray stuff. > Assuming that > updates and deletes to the table have some locality, we should be able > to skip a large percentage of the TID searches with a probe into this > very compact bitmap. I don't think you can assume locality
On Tue, Sep 13, 2016 at 2:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > I've finished writing that patch, I'm in the process of testing its CPU impact. > > First test seemed to hint at a 40% increase in CPU usage, which seems > rather steep compared to what I expected, so I'm trying to rule out > some methodology error here. Hmm, wow. That's pretty steep. Maybe lazy_heap_reaptid() is hotter than I think it is, but even if it accounts for 10% of total CPU usage within a vacuum, which seems like an awful lot, you'd have to make it 4x as expensive, which also seems like an awful lot. >> It would require >> replacing the call to bsearch() in lazy_heap_reaptid() with an >> open-coded implementation of bsearch, or with one bsearch to find the >> chunk and another to find the TID within the chunk, but that shouldn't >> be very expensive. > > I did a linear search to find the chunk, with exponentially growing > chunks, and then a bsearch to find the item inside the chunk. > > With the typical number of segments and given the 12GB limit, the > segment array size is well within the range that favors linear search. Ah, OK. >> For example, we could keep a bitmap with one bit per K >> pages. If the bit is set, there is at least 1 dead tuple on that >> page; if clear, there are none. When we see an index tuple, we >> consult the bitmap to determine whether we need to search the TID >> list. We select K to be the smallest power of 2 such that the bitmap >> uses less memory than some threshold, perhaps 64kB. > > I've been pondering something like that, but that's an optimization > that's quite orthogonal to the multiarray stuff. Sure, but if this really does increase CPU time, it'd be reasonable to do something to decrease it again in order to get the other benefits of this patch - i.e. increasing the maintenance_work_mem limit while reducing the chances that overallocation will cause OOM. >> Assuming that >> updates and deletes to the table have some locality, we should be able >> to skip a large percentage of the TID searches with a probe into this >> very compact bitmap. > > I don't think you can assume locality Really? If you have a 1TB table, how many 2MB ranges of that table do you think will contain dead tuples for a typical vacuum? I think most tables of that size are going to be mostly static, and the all-visible and all-frozen bits are going to be mostly set. You *could* have something like a pgbench-type workload that does scattered updates across the entire table, but that's going to perform pretty poorly because you'll constantly be updating blocks that have to be pulled in from disk. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 13, 2016 at 4:06 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Sep 13, 2016 at 2:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> I've finished writing that patch, I'm in the process of testing its CPU impact. >> >> First test seemed to hint at a 40% increase in CPU usage, which seems >> rather steep compared to what I expected, so I'm trying to rule out >> some methodology error here. > > Hmm, wow. That's pretty steep. Maybe lazy_heap_reaptid() is hotter > than I think it is, but even if it accounts for 10% of total CPU usage > within a vacuum, which seems like an awful lot, you'd have to make it > 4x as expensive, which also seems like an awful lot. IIRC perf top reported a combined 45% between layz_heap_reaptid + vac_cmp_itemptr (after patching). vac_cmp_itemptr was around 15% on its own Debug build of couse (I need the assertions and the debug symbols), I'll retest with optimizations once debug tests make sense. >>> For example, we could keep a bitmap with one bit per K >>> pages. If the bit is set, there is at least 1 dead tuple on that >>> page; if clear, there are none. When we see an index tuple, we >>> consult the bitmap to determine whether we need to search the TID >>> list. We select K to be the smallest power of 2 such that the bitmap >>> uses less memory than some threshold, perhaps 64kB. >> >> I've been pondering something like that, but that's an optimization >> that's quite orthogonal to the multiarray stuff. > > Sure, but if this really does increase CPU time, it'd be reasonable to > do something to decrease it again in order to get the other benefits > of this patch - i.e. increasing the maintenance_work_mem limit while > reducing the chances that overallocation will cause OOM. I was hoping it wouldn't regress performance so much. I'd rather micro-optimize the multiarray implementation until it doesn't and then think of orthogonal optimizations. >>> Assuming that >>> updates and deletes to the table have some locality, we should be able >>> to skip a large percentage of the TID searches with a probe into this >>> very compact bitmap. >> >> I don't think you can assume locality > > Really? If you have a 1TB table, how many 2MB ranges of that table do > you think will contain dead tuples for a typical vacuum? I think most > tables of that size are going to be mostly static, and the all-visible > and all-frozen bits are going to be mostly set. You *could* have > something like a pgbench-type workload that does scattered updates > across the entire table, but that's going to perform pretty poorly > because you'll constantly be updating blocks that have to be pulled in > from disk. I have a few dozen of those in my biggest database. They do updates and deletes all over the place and, even if they were few, they're scattered almost uniformly. Thing is, I think we really need to not worsen that case, which seems rather common (almost any OLTP with a big enough user base, or a K-V type of table, or TOAST tables).
On Wed, Sep 14, 2016 at 12:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached PoC patch changes the representation of dead tuple locations
> to the hashmap having tuple bitmap.
> The one hashmap entry consists of the block number and the TID bitmap
> of corresponding block, and the block number is the hash key of
> hashmap.
> Current implementation of this patch is not smart yet because each
> hashmap entry allocates the tuple bitmap with fixed
> size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
> LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
> In case where one block can store only the several tens tuples, the
> most bits are would be waste.
>
> After improved this patch as you suggested, I will measure performance benefit.
Now, if I'm reading it correctly, this patch allocates a 132-byte
structure for every page with at least one dead tuple. In the worst
case where there's just one dead tuple per page, that's a 20x
regression in memory usage. Actually, it's more like 40x, because
AllocSetAlloc rounds small allocation sizes up to the next-higher
power of two, which really stings for a 132-byte allocation, and then
adds a 16-byte header to each chunk. But even 20x is clearly not
good. There are going to be lots of real-world cases where this uses
way more memory to track the same number of dead tuples, and I'm
guessing that benchmarking is going to reveal that it's not faster,
either.
Sawada-san offered to reimplement the patch based on what I proposed upthread. In the new scheme of things, we will allocate a fixed size bitmap of length 2D bits per page where D is average page density of live + dead tuples. (The rational behind multiplying D by a factor of 2 is to consider worst case scenario where every tuple also has a LP_DIRECT line pointer). The value of D in most real world, large tables should not go much beyond, say 100, assuming 80 bytes wide tuple and 8K blocksize. That translates to about 25 bytes/page. So all TIDs with offset less than 2D can be represented by a single bit. We augment this with an overflow to track tuples which fall outside this limit. I believe this area will be small, say 10% of the total allocation.
This representation is at least as good the current representation if there are at least 4-5% dead tuples. I don't think very large tables will be vacuumed with a scale factor less than that. And assuming 10% dead tuples, this representation will actually be much more optimal.
The idea can fail when (a) there are very few dead tuples in the table, say less than 5% or (b) there are large number of tuples falling outside the 2D limit. While I don't expect any of these to hold for real world, very large tables, (a) we can anticipate when the vacuum starts and use current representation. (b) we can detect at run time and do a one time switch between representation. You may argue that managing two representations is clumsy, which I agree, but the code is completely isolated and probably not more than a few hundred lines.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Sep 14, 2016 at 8:47 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
Sawada-san offered to reimplement the patch based on what I proposed upthread. In the new scheme of things, we will allocate a fixed size bitmap of length 2D bits per page where D is average page density of live + dead tuples. (The rational behind multiplying D by a factor of 2 is to consider worst case scenario where every tuple also has a LP_DIRECT line pointer). The value of D in most real world, large tables should not go much beyond, say 100, assuming 80 bytes wide tuple and 8K blocksize. That translates to about 25 bytes/page. So all TIDs with offset less than 2D can be represented by a single bit. We augment this with an overflow to track tuples which fall outside this limit. I believe this area will be small, say 10% of the total allocation.
So I cooked up the attached patch to track number of live/dead tuples found at offset 1 to MaxOffsetNumber. The idea was to see how many tuples actually go beyond the threshold of 2D offsets per page. Note that I am proposing to track 2D offsets via bitmap and rest via regular TID array.
So I ran a pgbench test for 2hrs with scale factor 500. autovacuum scale factor was set to 0.1 or 10%.
Some interesting bits:
postgres=# select relname, n_tup_ins, n_tup_upd, n_tup_hot_upd, n_live_tup, n_dead_tup, pg_relation_size(relid)/8192 as relsize, (n_live_tup+n_dead_tup)/(pg_relation_size(relid)/8192) as density from pg_stat_user_tables ;
relname | n_tup_ins | n_tup_upd | n_tup_hot_upd | n_live_tup | n_dead_tup | relsize | density
------------------+-----------+-----------+---------------+------------+------------+---------+---------
pgbench_tellers | 5000 | 95860289 | 87701578 | 5000 | 0 | 3493 | 1
pgbench_branches | 500 | 95860289 | 94158081 | 967 | 0 | 1544 | 0
pgbench_accounts | 50000000 | 95860289 | 93062567 | 51911657 | 3617465 | 865635 | 64
pgbench_history | 95860289 | 0 | 0 | 95258548 | 0 | 610598 | 156
(4 rows)
Smaller tellers and branches tables bloat so much that the density as computed by live + dead tuples falls close to 1 tuple/page. So for such tables, the idea of 2D bits/page will fail miserably. But I think these tables are worst representatives and I would be extremely surprised if we ever find very large table bloated so much. But even then, this probably tells us that we can't solely rely on the density measure.
Another interesting bit about these small tables is that the largest used offset for these tables never went beyond 291 which is the value of MaxHeapTuplesPerPage. I don't know if there is something that prevents inserting more than MaxHeapTuplesPerPage offsets per heap page and I don't know at this point if this gives us upper limit for bits per page (may be it does).
--
For pgbench_accounts table, the maximum offset used was 121, though bulk of the used offsets were at the start of the page (see attached graph). Now the test did not create enough dead tuples to trigger autovacuum on pgbench_accounts table. So I ran a manul vacuum at the end. (There are about 5% dead tuples in the table by the time test finished)
postgres=# VACUUM VERBOSE pgbench_accounts ;
INFO: vacuuming "public.pgbench_accounts"
INFO: scanned index "pgbench_accounts_pkey" to remove 2797722 row versions
DETAIL: CPU 0.00s/9.39u sec elapsed 9.39 sec.
INFO: "pgbench_accounts": removed 2797722 row versions in 865399 pages
DETAIL: CPU 0.10s/7.01u sec elapsed 7.11 sec.
INFO: index "pgbench_accounts_pkey" now contains 50000000 row versions in 137099 pages
DETAIL: 2797722 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pgbench_accounts": found 852487 removable, 50000000 nonremovable row versions in 865635 out of 865635 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 802256 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.73s/27.20u sec elapsed 27.98 sec.tuple count at each offset (offnum:all_tuples:dead_tuples)
For 2797722 dead line pointers, the current representation would have used 2797722 x 6 = 16786332 bytes of memory. The most optimal bitmap would have used 121 bits/page x 865399 pages = 13089159 bytes where as if we had provisioned 2D bits/page and assuming D = 64 based on the above calculation, we would have used 13846384 bytes of memory. This is about 18% less than the current representation. Of course, we would have allocated some space for overflow region, which will make the difference smaller/negligible. But the bitmaps would be extremely cheap to lookup during index scans.
Now may be I got lucky, may be I did nor run tests long enough (though I believe that may have worked in favour of bitmap), may be mostly HOT updated tables are not good candidate for testing and may be there are situations where the proposed bitmap representation will fail badly. But these tests show that the idea is at least worth considering and we can improve things for at least some workload. The question is can be avoid regression in not-so-good cases.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Wed, Sep 14, 2016 at 5:45 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > Another interesting bit about these small tables is that the largest used > offset for these tables never went beyond 291 which is the value of > MaxHeapTuplesPerPage. I don't know if there is something that prevents > inserting more than MaxHeapTuplesPerPage offsets per heap page and I don't > know at this point if this gives us upper limit for bits per page (may be it > does). From PageAddItemExtended: /* Reject placing items beyond heap boundary, if heap */ if ((flags & PAI_IS_HEAP) != 0 && offsetNumber > MaxHeapTuplesPerPage) { elog(WARNING, "can't put more than MaxHeapTuplesPerPage items in a heap page"); return InvalidOffsetNumber; } Also see the comment where MaxHeapTuplesPerPage is defined: * Note: with HOT, there could theoretically be more line pointers (not actual* tuples) than this on a heap page. Howeverwe constrain the number of line* pointers to this anyway, to avoid excessive line-pointer bloat and not* require increasesin the size of work arrays. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Sep 14, 2016 at 5:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
-- On Wed, Sep 14, 2016 at 5:45 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> Another interesting bit about these small tables is that the largest used
> offset for these tables never went beyond 291 which is the value of
> MaxHeapTuplesPerPage. I don't know if there is something that prevents
> inserting more than MaxHeapTuplesPerPage offsets per heap page and I don't
> know at this point if this gives us upper limit for bits per page (may be it
> does).
From PageAddItemExtended:
/* Reject placing items beyond heap boundary, if heap */
if ((flags & PAI_IS_HEAP) != 0 && offsetNumber > MaxHeapTuplesPerPage)
{
elog(WARNING, "can't put more than MaxHeapTuplesPerPage items
in a heap page");
return InvalidOffsetNumber;
}
Also see the comment where MaxHeapTuplesPerPage is defined:
* Note: with HOT, there could theoretically be more line pointers (not actual
* tuples) than this on a heap page. However we constrain the number of line
* pointers to this anyway, to avoid excessive line-pointer bloat and not
* require increases in the size of work arrays.
Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per page bitmap size. Thats about 36 bytes for 8K page. IOW if on an average there are 6 or more dead tuples per page, bitmap will outperform the current representation, assuming max allocation for bitmap. If we can use additional estimates to restrict the size to somewhat more conservative value and then keep overflow area, then probably the break-even happens even earlier than that. I hope this gives us a good starting point, but let me know if you think it's still a wrong approach to pursue.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Sep 14, 2016 at 8:16 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per page > bitmap size. Thats about 36 bytes for 8K page. IOW if on an average there > are 6 or more dead tuples per page, bitmap will outperform the current > representation, assuming max allocation for bitmap. If we can use additional > estimates to restrict the size to somewhat more conservative value and then > keep overflow area, then probably the break-even happens even earlier than > that. I hope this gives us a good starting point, but let me know if you > think it's still a wrong approach to pursue. Well, it's certainly a bigger change. I think the big concern is that the amount of memory now becomes fixed based on the table size. So one problem is that you have to figure out what you're going to do if the bitmap doesn't fit in maintenance_work_mem. A related problem is that it might fit but use more memory than before, which could cause problems for some people. Now on the other hand it could also use less memory for some people, and that would be good. I am kind of doubtful about this whole line of investigation because we're basically trying pretty hard to fix something that I'm not sure is broken. I do agree that, all other things being equal, the TID lookups will probably be faster with a bitmap than with a binary search, but maybe not if the table is large and the number of dead TIDs is small, because cache efficiency is pretty important. But even if it's always faster, does TID lookup speed even really matter to overall VACUUM performance? Claudio's early results suggest that it might, but maybe that's just a question of some optimization that hasn't been done yet. I'm fairly sure that our number one priority should be to minimize the number of cases where we need to do multiple scans of the indexes to stay within maintenance_work_mem. If we're satisfied we've met that goal, then within that we should try to make VACUUM as fast as possible with as little memory usage as possible. I'm not 100% sure I know how to get there, or how much work it's worth expending. In theory we could even start with the list of TIDs and switch to the bitmap if the TID list becomes larger than the bitmap would have been, but I don't know if it's worth the effort. /me thinks a bit. Actually, I think that probably *is* worthwhile, specifically because it might let us avoid multiple index scans in cases where we currently require them. Right now, our default maintenance_work_mem value is 64MB, which is enough to hold a little over ten million tuples. It's also large enough to hold a bitmap for a 14GB table. So right now if you deleted, say, 100 tuples per page you would end up with an index vacuum cycles for every ~100,000 pages = 800MB, whereas switching to the bitmap representation for such cases would require only one index vacuum cycle for every 14GB, more than an order of magnitude improvement! On the other hand, if we switch to the bitmap as the ONLY possible representation, we will lose badly when there are scattered updates - e.g. 1 deleted tuple every 10 pages. So it seems like we probably want to have both options. One tricky part is figuring out how we switch between them when memory gets tight; we have to avoid bursting above our memory limit while making the switch. And even if our memory limit is very high, we want to avoid using memory gratuitously; I think we should try to grow memory usage incrementally with either representation. For instance, one idea to grow memory usage incrementally would be to store dead tuple information separately for each 1GB segment of the relation. So we have an array of dead-tuple-representation objects, one for every 1GB of the relation. If there are no dead tuples in a given 1GB segment, then this pointer can just be NULL. Otherwise, it can point to either the bitmap representation (which will take ~4.5MB) or it can point to an array of TIDs (which will take 6 bytes/TID). That could handle an awfully wide variety of usage patterns efficiently; it's basically never worse than what we're doing today, and when the dead tuple density is high for any portion of the relation it's a lot better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > I am kind of doubtful about this whole line of investigation because > we're basically trying pretty hard to fix something that I'm not sure > is broken. I do agree that, all other things being equal, the TID > lookups will probably be faster with a bitmap than with a binary > search, but maybe not if the table is large and the number of dead > TIDs is small, because cache efficiency is pretty important. But even > if it's always faster, does TID lookup speed even really matter to > overall VACUUM performance? Claudio's early results suggest that it > might, but maybe that's just a question of some optimization that > hasn't been done yet. FYI, the reported impact was on CPU time, not runtime. There was no significant difference in runtime (real time), because my test is heavily I/O bound. I tested with a few small tables and there was no significant difference either, but small tables don't stress the array lookup anyway so that's expected. But on the assumption that some systems may be CPU bound during vacuum (particularly those able to do more than 300-400MB/s sequential I/O), in those cases the increased or decreased cost of lazy_tid_reaped will directly correlate to runtime. It's just none of my systems, which all run on amazon and is heavily bandwidth constrained (fastest I/O subsystem I can get my hands on does 200MB/s).
<p dir="ltr"><p dir="ltr">On Sep 14, 2016 5:18 PM, "Robert Haas" <<a href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br /> ><br /> > On Wed, Sep 14, 2016 at 8:16AM, Pavan Deolasee<br /> > <<a href="mailto:pavan.deolasee@gmail.com">pavan.deolasee@gmail.com</a>> wrote:<br/> > > Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per page<br /> > > bitmapsize. Thats about 36 bytes for 8K page. IOW if on an average there<br /> > > are 6 or more dead tuples per page,bitmap will outperform the current<br /> > > representation, assuming max allocation for bitmap. If we can useadditional<br /> > > estimates to restrict the size to somewhat more conservative value and then<br /> > >keep overflow area, then probably the break-even happens even earlier than<br /> > > that. I hope this gives usa good starting point, but let me know if you<br /> > > think it's still a wrong approach to pursue.<br /> ><br/> > Well, it's certainly a bigger change. I think the big concern is that<br /> > the amount of memory nowbecomes fixed based on the table size. So<br /> > one problem is that you have to figure out what you're going todo if<br /> > the bitmap doesn't fit in maintenance_work_mem. A related problem is<br /> > that it might fit butuse more memory than before, which could cause<br /> > problems for some people. Now on the other hand it could alsouse<br /> > less memory for some people, and that would be good.<br /> ><br /> > I am kind of doubtful aboutthis whole line of investigation because<br /> > we're basically trying pretty hard to fix something that I'm notsure<br /> > is broken. I do agree that, all other things being equal, the TID<br /> > lookups will probablybe faster with a bitmap than with a binary<br /> > search, but maybe not if the table is large and the numberof dead<br /> > TIDs is small, because cache efficiency is pretty important. But even<br /> > if it's alwaysfaster, does TID lookup speed even really matter to<br /> > overall VACUUM performance? Claudio's early resultssuggest that it<br /> > might, but maybe that's just a question of some optimization that<br /> > hasn't beendone yet.<br /> ><br /> > I'm fairly sure that our number one priority should be to minimize the<br /> > numberof cases where we need to do multiple scans of the indexes to<br /> > stay within maintenance_work_mem. If we'resatisfied we've met that<br /> > goal, then within that we should try to make VACUUM as fast as<br /> > possiblewith as little memory usage as possible. I'm not 100% sure I<br /> > know how to get there, or how much workit's worth expending. In<br /> > theory we could even start with the list of TIDs and switch to the<br /> > bitmapif the TID list becomes larger than the bitmap would have been,<br /> > but I don't know if it's worth the effort.<br/> ><br /> > /me thinks a bit.<br /> ><br /> > Actually, I think that probably *is* worthwhile, specificallybecause<br /> > it might let us avoid multiple index scans in cases where we currently<br /> > requirethem. Right now, our default maintenance_work_mem value is<br /> > 64MB, which is enough to hold a little overten million tuples. It's<br /> > also large enough to hold a bitmap for a 14GB table. So right now if<br /> >you deleted, say, 100 tuples per page you would end up with an index<br /> > vacuum cycles for every ~100,000 pages= 800MB, whereas switching to<br /> > the bitmap representation for such cases would require only one index<br />> vacuum cycle for every 14GB, more than an order of magnitude<br /> > improvement!<br /> ><br /> > On theother hand, if we switch to the bitmap as the ONLY possible<br /> > representation, we will lose badly when there arescattered updates -<br /> > e.g. 1 deleted tuple every 10 pages. So it seems like we probably<br /> > want to haveboth options. One tricky part is figuring out how we<br /> > switch between them when memory gets tight; we haveto avoid bursting<br /> > above our memory limit while making the switch. And even if our<br /> > memory limitis very high, we want to avoid using memory gratuitously;<br /> > I think we should try to grow memory usage incrementallywith either<br /> > representation.<br /> ><br /> > For instance, one idea to grow memory usage incrementallywould be to<br /> > store dead tuple information separately for each 1GB segment of the<br /> > relation. So we have an array of dead-tuple-representation objects,<br /> > one for every 1GB of the relation. If thereare no dead tuples in a<br /> > given 1GB segment, then this pointer can just be NULL. Otherwise, it<br /> >can point to either the bitmap representation (which will take ~4.5MB)<br /> > or it can point to an array of TIDs(which will take 6 bytes/TID).<br /> > That could handle an awfully wide variety of usage patterns<br /> > efficiently;it's basically never worse than what we're doing today,<br /> > and when the dead tuple density is high forany portion of the<br /> > relation it's a lot better.<br /> ><br /> > --<br /> > Robert Haas<br /> > EnterpriseDB:<a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br /> > The Enterprise PostgreSQL Company<br/> ><br /> ><br /> > --<br /> > Sent via pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> > To make changes to your subscription:<br/> > <a href="http://www.postgresql.org/mailpref/pgsql-hackers">http://www.postgresql.org/mailpref/pgsql-hackers</a><p dir="ltr">I'dsay it's an idea worth pursuing. It's the base idea behind roaring bitmaps, arguably the best overall compressedbitmap implementation. <br />
On Wed, Sep 14, 2016 at 8:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I am kind of doubtful about this whole line of investigation because
we're basically trying pretty hard to fix something that I'm not sure
is broken. I do agree that, all other things being equal, the TID
lookups will probably be faster with a bitmap than with a binary
search, but maybe not if the table is large and the number of dead
TIDs is small, because cache efficiency is pretty important. But even
if it's always faster, does TID lookup speed even really matter to
overall VACUUM performance? Claudio's early results suggest that it
might, but maybe that's just a question of some optimization that
hasn't been done yet.
Yeah, I wouldn't worry only about lookup speedup, but if does speeds up, that's a bonus. But the bitmaps seem to win even for memory consumption. As theory and experiments both show, at 10% dead tuple ratio, bitmaps will win handsomely.
In
theory we could even start with the list of TIDs and switch to the
bitmap if the TID list becomes larger than the bitmap would have been,
but I don't know if it's worth the effort.
Yes, that works too. Or may be even better because we already know the bitmap size requirements, definitely for the tuples collected so far. We might need to maintain some more stats to further optimise the representation, but that seems like unnecessary detailing at this point.
On the other hand, if we switch to the bitmap as the ONLY possible
representation, we will lose badly when there are scattered updates -
e.g. 1 deleted tuple every 10 pages.
Sure. I never suggested that. What I'd suggested is to switch back to array representation once we realise bitmaps are not going to work. But I see it's probably better the other way round.
So it seems like we probably
want to have both options. One tricky part is figuring out how we
switch between them when memory gets tight; we have to avoid bursting
above our memory limit while making the switch.
Yes, I was thinking about this problem. Some modelling will be necessary to ensure that we don't go (much) beyond the maintenance_work_mem while switching representation, which probably means you need to do that earlier than necessary.
For instance, one idea to grow memory usage incrementally would be to
store dead tuple information separately for each 1GB segment of the
relation. So we have an array of dead-tuple-representation objects,
one for every 1GB of the relation. If there are no dead tuples in a
given 1GB segment, then this pointer can just be NULL. Otherwise, it
can point to either the bitmap representation (which will take ~4.5MB)
or it can point to an array of TIDs (which will take 6 bytes/TID).
That could handle an awfully wide variety of usage patterns
efficiently; it's basically never worse than what we're doing today,
and when the dead tuple density is high for any portion of the
relation it's a lot better.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: > For instance, one idea to grow memory usage incrementally would be to > store dead tuple information separately for each 1GB segment of the > relation. So we have an array of dead-tuple-representation objects, > one for every 1GB of the relation. If there are no dead tuples in a > given 1GB segment, then this pointer can just be NULL. Otherwise, it > can point to either the bitmap representation (which will take ~4.5MB) > or it can point to an array of TIDs (which will take 6 bytes/TID). > That could handle an awfully wide variety of usage patterns > efficiently; it's basically never worse than what we're doing today, > and when the dead tuple density is high for any portion of the > relation it's a lot better. If you compress the list into a bitmap a posteriori, you know the number of tuples per page, so you could encode the bitmap even more efficiently. It's not a bad idea, one that can be slapped on top of the multiarray patch - when closing a segment, it can be decided whether to turn it into a bitmap or not.
Robert Haas wrote: > Actually, I think that probably *is* worthwhile, specifically because > it might let us avoid multiple index scans in cases where we currently > require them. Right now, our default maintenance_work_mem value is > 64MB, which is enough to hold a little over ten million tuples. It's > also large enough to hold a bitmap for a 14GB table. So right now if > you deleted, say, 100 tuples per page you would end up with an index > vacuum cycles for every ~100,000 pages = 800MB, whereas switching to > the bitmap representation for such cases would require only one index > vacuum cycle for every 14GB, more than an order of magnitude > improvement! Yeah, this sounds worthwhile. If we switch to the more compact in-memory representation close to the point where we figure the TID array is not going to fit in m_w_m, then we're saving some number of additional index scans, and I'm pretty sure that the time to transform from array to bitmap is going to be more than paid back by the I/O savings. One thing not quite clear to me is how do we create the bitmap representation starting from the array representation in midflight without using twice as much memory transiently. Are we going to write the array to a temp file, free the array memory, then fill the bitmap by reading the array from disk? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 14 September 2016 at 11:19, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: >> In >> theory we could even start with the list of TIDs and switch to the >> bitmap if the TID list becomes larger than the bitmap would have been, >> but I don't know if it's worth the effort. >> > > Yes, that works too. Or may be even better because we already know the > bitmap size requirements, definitely for the tuples collected so far. We > might need to maintain some more stats to further optimise the > representation, but that seems like unnecessary detailing at this point. That sounds best to me... build the simple representation, but as we do maintain stats to show to what extent that set of tuples is compressible. When we hit the limit on memory we can then selectively compress chunks to stay within memory, starting with the most compressible chunks. I think we should use the chunking approach Robert suggests, though mainly because that allows us to consider how parallel VACUUM should work - writing the chunks to shmem. That would also allow us to apply a single global limit for vacuum memory rather than an allocation per VACUUM. We can then scan multiple indexes at once in parallel, all accessing the shmem data structure. We should also find the compression is better when we consider chunks rather than the whole data structure at once. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
One thing not quite clear to me is how do we create the bitmap
representation starting from the array representation in midflight
without using twice as much memory transiently. Are we going to write
the array to a temp file, free the array memory, then fill the bitmap by
reading the array from disk?
We could do that. Or may be compress TID array when consumed half m_w_m and do this repeatedly with remaining memory. For example, if we start with 1GB memory, we decide to compress at 512MB. Say that results in 300MB for bitmap. We then continue to accumulate TID and do another round of fold up when another 350MB is consumed.
I think we should maintain per offset count of number of dead tuples to choose the most optimal bitmap size (that needs overflow region). We can also track how many blocks or block ranges have at least one dead tuple to know if it's worthwhile to have some sort of indirection. Together that can tell us how much compression can be achieved and allow us to choose the most optimal representation.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
Pavan Deolasee <pavan.deolasee@gmail.com> writes: > On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera <alvherre@2ndquadrant.com> > wrote: >> One thing not quite clear to me is how do we create the bitmap >> representation starting from the array representation in midflight >> without using twice as much memory transiently. Are we going to write >> the array to a temp file, free the array memory, then fill the bitmap by >> reading the array from disk? > We could do that. People who are vacuuming because they are out of disk space will be very very unhappy with that solution. regards, tom lane
On Wed, Sep 14, 2016 at 1:23 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> Actually, I think that probably *is* worthwhile, specifically because >> it might let us avoid multiple index scans in cases where we currently >> require them. Right now, our default maintenance_work_mem value is >> 64MB, which is enough to hold a little over ten million tuples. It's >> also large enough to hold a bitmap for a 14GB table. So right now if >> you deleted, say, 100 tuples per page you would end up with an index >> vacuum cycles for every ~100,000 pages = 800MB, whereas switching to >> the bitmap representation for such cases would require only one index >> vacuum cycle for every 14GB, more than an order of magnitude >> improvement! > > Yeah, this sounds worthwhile. If we switch to the more compact > in-memory representation close to the point where we figure the TID > array is not going to fit in m_w_m, then we're saving some number of > additional index scans, and I'm pretty sure that the time to transform > from array to bitmap is going to be more than paid back by the I/O > savings. Yes, that seems pretty clear. The indexes can be arbitrarily large and there can be arbitrarily many of them, so we could be save a LOT of I/O. > One thing not quite clear to me is how do we create the bitmap > representation starting from the array representation in midflight > without using twice as much memory transiently. Are we going to write > the array to a temp file, free the array memory, then fill the bitmap by > reading the array from disk? I was just thinking about this exact problem while I was out to lunch.[1] I wonder if we could do something like this: 1. Allocate an array large enough for one pointer per gigabyte of the underlying relation. 2. Allocate 64MB, or the remaining amount of maintenance_work_mem if it's less, to store TIDs. 3. At the beginning of each 1GB chunk, add a pointer to the first free byte in the slab allocated in step 2 to the array allocated in step 1. Write a header word that identifies this as a TID list (rather than a bitmap) and leave space for a TID count; then, write the TIDs afterwards. Continue doing this until one of the following things happens: (a) we reach the end of the 1GB chunk - if that happens, restart step 3 for the next chunk; (b) we fill the chunk - see step 4, or (c) we write more TIDs for the chunk than the space being used for TIDs now exceeds the space needed for a bitmap - see step 5. 4. When we fill up one of the slabs allocated in step 2, allocate a new one and move the tuples for the current 1GB chunk to the beginning of the new slab using memmove(). This is wasteful of both CPU time and memory, but I think it's not that bad. The maximum possible waste is less than 10%, and many allocators have more overhead than that. We could reduce the waste by using, say, 256MB chunks rather than 1GB chunks. If no new slab can be allocated because maintenance_work_mem is completely exhausted (or the remaining space isn't enough for the TIDs that would need to be moved immediately), then stop and do an index vacuum cycle. 5. When we write a large enough number of TIDs for 1GB chunk that the bitmap would be smaller, check whether sufficient maintenance_work_mem remains to allocate a bitmap for 1GB chunk (~4.5MB). If not, never mind; continue with step 3 as if the bitmap representation did not exist. If so, allocate space for a bitmap, move all of the TIDs for the current chunk into it, and update the array allocated in step 1 to point to it. Then, finish scanning the current 1GB chunk, updating that bitmap rather than inserting TIDs into the slab. Rewind our pointer into the slab to where it was at the beginning of the current 1GB chunk, so that the memory we consumed for TIDs can be reused now that those TIDs have been transferred to a bitmap. If, earlier in the current 1GB chunk, we did a memmove-to-next-slab operation as described in step 4, this "rewind" might move our pointer back into the previous slab, in which case we can free the now-empty slab. (The next 1GB segment might have few enough TIDs that they will fit into the leftover space in the previous slab.) With this algorithm, we never exceed maintenance_work_mem, not even transiently. When memory is no longer sufficient to convert to the bitmap representation without bursting above maintenance_work_mem, we simply don't perform the conversion. Also, we do very little memory copying. An alternative I considered was to do a separate allocation for each 1GB chunk rather than carving the dead-tuple space out of slabs. But the problem with that is that you'll have to start those out small (in case you don't find many dead tuples) and then grow them, which means reallocating, which is bad both because it can burst above maintenance_work_mem while the repalloc is in process and also because you have to keep copying the data from the old chunk to the new, bigger chunk. This algorithm only needs to copy TIDs when it runs off the end of a chunk, and that can't happen more than once every dozen or so chunks; in contrast, progressively growing the TID arrays for a given 1GB chunk would potentially memcpy() multiple times per 1GB chunk, and if you used power-of-two reallocation as we normally do the waste would be much more than what step 4 of this algorithm leaves on the table. There are, nevertheless, corner cases where this can lose: if you had a number of TIDs that were going to just BARELY fit within maintenance_work_mem, and if the bitmap representation never wins for you, the additional array allocated in step 1 and the end-of-slab wastage in step 4 could push you over the line. We might be able to tweak things here or there to reduce the potential for that, but it's pretty well unavoidable; the flat array we're using right now has exactly zero allocator overhead, and anything more complex will have some. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] Or am I always out to lunch?
On Thu, Sep 15, 2016 at 2:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 14 September 2016 at 11:19, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > >>> In >>> theory we could even start with the list of TIDs and switch to the >>> bitmap if the TID list becomes larger than the bitmap would have been, >>> but I don't know if it's worth the effort. >>> >> >> Yes, that works too. Or may be even better because we already know the >> bitmap size requirements, definitely for the tuples collected so far. We >> might need to maintain some more stats to further optimise the >> representation, but that seems like unnecessary detailing at this point. > > That sounds best to me... build the simple representation, but as we > do maintain stats to show to what extent that set of tuples is > compressible. > > When we hit the limit on memory we can then selectively compress > chunks to stay within memory, starting with the most compressible > chunks. > > I think we should use the chunking approach Robert suggests, though > mainly because that allows us to consider how parallel VACUUM should > work - writing the chunks to shmem. That would also allow us to apply > a single global limit for vacuum memory rather than an allocation per > VACUUM. > We can then scan multiple indexes at once in parallel, all accessing > the shmem data structure. > Yeah, the chunking approach Robert suggested seems like a good idea but considering implementing parallel vacuum, it would be more complicated IMO. I think It's better the multiple process simply allocate memory space for its process than that the single process allocate huge memory space using complicated way. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 09/14/2016 07:57 PM, Tom Lane wrote: > Pavan Deolasee <pavan.deolasee@gmail.com> writes: >> On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera <alvherre@2ndquadrant.com> >> wrote: >>> One thing not quite clear to me is how do we create the bitmap >>> representation starting from the array representation in midflight >>> without using twice as much memory transiently. Are we going to write >>> the array to a temp file, free the array memory, then fill the bitmap by >>> reading the array from disk? > >> We could do that. > > People who are vacuuming because they are out of disk space will be very > very unhappy with that solution. The people are usually running out of space for data, while these files would be temporary files placed wherever temp_tablespaces points to. I'd argue if this is a source of problems, the people are already in deep trouble due to sorts, CREATE INDEX, ... as those commands may also generate a lot of temporary files. regards Tomas
On 09/14/2016 05:17 PM, Robert Haas wrote: > I am kind of doubtful about this whole line of investigation because > we're basically trying pretty hard to fix something that I'm not sure > is broken. I do agree that, all other things being equal, the TID > lookups will probably be faster with a bitmap than with a binary > search, but maybe not if the table is large and the number of dead > TIDs is small, because cache efficiency is pretty important. But even > if it's always faster, does TID lookup speed even really matter to > overall VACUUM performance? Claudio's early results suggest that it > might, but maybe that's just a question of some optimization that > hasn't been done yet. Regarding the lookup performance, I don't think the bitmap alone can significantly improve that - it's more efficient memory-wise, no doubt about that, but it's still likely larger than CPU caches and accessed mostly randomly (when vacuuming the indexes). IMHO the best way to speed-up lookups (if it's really an issue, haven't done any benchmarks) would be to build a small bloom filter in front of the TID array / bitmap. It shall be fairly small (depending on the number of TIDs, error rate etc.) and likely to fit into L2/L3, and eliminate a lot of probes into the much larger array/bitmap. Of course, it's another layer of complexity - the good thing is we don't need to build the filter until after we collect the TIDs, so we got pretty good inputs for the bloom filter parameters. But all this is based on the assumption that the lookups are actually expensive, not sure about that. regards Tomas
On Thu, Sep 15, 2016 at 12:50 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 09/14/2016 07:57 PM, Tom Lane wrote: >> >> Pavan Deolasee <pavan.deolasee@gmail.com> writes: >>> >>> On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera >>> <alvherre@2ndquadrant.com> >>> wrote: >>>> >>>> One thing not quite clear to me is how do we create the bitmap >>>> representation starting from the array representation in midflight >>>> without using twice as much memory transiently. Are we going to write >>>> the array to a temp file, free the array memory, then fill the bitmap by >>>> reading the array from disk? >> >> >>> We could do that. >> >> >> People who are vacuuming because they are out of disk space will be very >> very unhappy with that solution. > > > The people are usually running out of space for data, while these files > would be temporary files placed wherever temp_tablespaces points to. I'd > argue if this is a source of problems, the people are already in deep > trouble due to sorts, CREATE INDEX, ... as those commands may also generate > a lot of temporary files. One would not expect "CREATE INDEX" to succeed when space is tight, but VACUUM is quite the opposite. Still, temporary storage could be used if available, and gracefully fall back to some other technique (like not using bitmaps) when not. Not sure it's worth the trouble, though. On Wed, Sep 14, 2016 at 12:24 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> I am kind of doubtful about this whole line of investigation because >> we're basically trying pretty hard to fix something that I'm not sure >> is broken. I do agree that, all other things being equal, the TID >> lookups will probably be faster with a bitmap than with a binary >> search, but maybe not if the table is large and the number of dead >> TIDs is small, because cache efficiency is pretty important. But even >> if it's always faster, does TID lookup speed even really matter to >> overall VACUUM performance? Claudio's early results suggest that it >> might, but maybe that's just a question of some optimization that >> hasn't been done yet. > > FYI, the reported impact was on CPU time, not runtime. There was no > significant difference in runtime (real time), because my test is > heavily I/O bound. > > I tested with a few small tables and there was no significant > difference either, but small tables don't stress the array lookup > anyway so that's expected. > > But on the assumption that some systems may be CPU bound during vacuum > (particularly those able to do more than 300-400MB/s sequential I/O), > in those cases the increased or decreased cost of lazy_tid_reaped will > directly correlate to runtime. It's just none of my systems, which all > run on amazon and is heavily bandwidth constrained (fastest I/O > subsystem I can get my hands on does 200MB/s). Attached is the patch with the multiarray version. The tests are weird. Best case comparison over several runs, to remove the impact of concurrent activity on this host (I couldn't remove all background activity even when running the tests overnight, the distro adds tons of crons and background cleanup tasks it would seem), there's only very mild CPU impact. I'd say insignificant, as it's well below the mean variance. Worst case: DETAIL: CPU 9.90s/80.94u sec elapsed 1232.42 sec. Best case: DETAIL: CPU 12.10s/63.82u sec elapsed 832.79 sec. There seems to be more variance with the multiarray approach than the single array one, but I could not figure out why. Even I/O seems less stable: Worst case: INFO: "pgbench_accounts": removed 400000000 row versions in 6557382 pages DETAIL: CPU 64.31s/37.60u sec elapsed 2573.88 sec. Best case: INFO: "pgbench_accounts": removed 400000000 row versions in 6557378 pages DETAIL: CPU 54.48s/31.78u sec elapsed 1552.18 sec. Since this test takes several hours to complete, I could only run a few runs of each version, so the statistical significance of the test isn't very bright. I'll try to compare with smaller pgbench scale numbers and more runs over the weekend (gotta script that). It's certainly puzzling, I cannot explain the increased variance, especially in I/O, since the I/O should be exactly the same. I'm betting it's my system that's unpredictable somehow. So I'm posting the patch in case someone gets inspired and can spot the reason, and because there's been a lot of talk about this very same approach, so I thought I'd better post the code ;) I'll also try to get a more predictable system to run the tests on.
Attachment
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On 09/14/2016 07:57 PM, Tom Lane wrote: >> People who are vacuuming because they are out of disk space will be very >> very unhappy with that solution. > The people are usually running out of space for data, while these files > would be temporary files placed wherever temp_tablespaces points to. I'd > argue if this is a source of problems, the people are already in deep > trouble due to sorts, CREATE INDEX, ... as those commands may also > generate a lot of temporary files. Except that if you are trying to recover disk space, VACUUM is what you are doing, not CREATE INDEX. Requiring extra disk space to perform a vacuum successfully is exactly the wrong direction to be going in. See for example this current commitfest entry: https://commitfest.postgresql.org/10/649/ Regardless of what you think of the merits of that patch, it's trying to solve a real-world problem. And as Robert has already pointed out, making this aspect of VACUUM more complicated is not solving any pressing problem. "But we made it faster" is going to be a poor answer for the next person who finds themselves up against the wall with no recourse. regards, tom lane
On Thu, Sep 15, 2016 at 12:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On 09/14/2016 07:57 PM, Tom Lane wrote: >>> People who are vacuuming because they are out of disk space will be very >>> very unhappy with that solution. > >> The people are usually running out of space for data, while these files >> would be temporary files placed wherever temp_tablespaces points to. I'd >> argue if this is a source of problems, the people are already in deep >> trouble due to sorts, CREATE INDEX, ... as those commands may also >> generate a lot of temporary files. > > Except that if you are trying to recover disk space, VACUUM is what you > are doing, not CREATE INDEX. Requiring extra disk space to perform a > vacuum successfully is exactly the wrong direction to be going in. > See for example this current commitfest entry: > https://commitfest.postgresql.org/10/649/ > Regardless of what you think of the merits of that patch, it's trying > to solve a real-world problem. And as Robert has already pointed out, > making this aspect of VACUUM more complicated is not solving any > pressing problem. "But we made it faster" is going to be a poor answer > for the next person who finds themselves up against the wall with no > recourse. I very much agree. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 09/15/2016 06:40 PM, Robert Haas wrote: > On Thu, Sep 15, 2016 at 12:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >>> On 09/14/2016 07:57 PM, Tom Lane wrote: >>>> People who are vacuuming because they are out of disk space will be very >>>> very unhappy with that solution. >> >>> The people are usually running out of space for data, while these files >>> would be temporary files placed wherever temp_tablespaces points to. I'd >>> argue if this is a source of problems, the people are already in deep >>> trouble due to sorts, CREATE INDEX, ... as those commands may also >>> generate a lot of temporary files. >> >> Except that if you are trying to recover disk space, VACUUM is what you >> are doing, not CREATE INDEX. Requiring extra disk space to perform a >> vacuum successfully is exactly the wrong direction to be going in. >> See for example this current commitfest entry: >> https://commitfest.postgresql.org/10/649/ >> Regardless of what you think of the merits of that patch, it's trying >> to solve a real-world problem. And as Robert has already pointed out, >> making this aspect of VACUUM more complicated is not solving any >> pressing problem. "But we made it faster" is going to be a poor answer >> for the next person who finds themselves up against the wall with no >> recourse. > > I very much agree. > How does VACUUM alone help with recovering disk space? AFAIK it only makes the space available for new data, it does not reclaim the disk space at all. Sure, we truncate empty pages at the end of the last segment, but how likely is that in practice? What I do see people doing is usually either VACUUM FULL (which is however doomed for obvious reasons) or VACUUM + reindexing to get rid of index bloat (which however leads to CREATE INDEX using temporary files). I'm not sure I agree with your claim there's no pressing problem. We do see quite a few people having to do VACUUM with multiple index scans (because the TIDs don't fit into m_w_m), which certainly has significant impact on production systems - both in terms of performance and it also slows down reclaiming the space. Sure, being able to set m_w_m above 1GB is an improvement, but perhaps using a more efficient TID storage would improve the situation further. Writing the TIDs to a temporary file may not the right approach, but I don't see why that would make the original problem less severe? For example, we always allocate the TID array as large as we can fit into m_w_m, but maybe we don't need to wait with switching to the bitmap until filling the whole array - we could wait as long as the bitmap fits into the remaining part of the array, build it there and then copy it to the beginning (and use the bitmap from that point). regards Tomas
On Thu, Sep 15, 2016 at 3:48 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > For example, we always allocate the TID array as large as we can fit into > m_w_m, but maybe we don't need to wait with switching to the bitmap until > filling the whole array - we could wait as long as the bitmap fits into the > remaining part of the array, build it there and then copy it to the > beginning (and use the bitmap from that point). The bitmap can be created like that, but grow from the end of the segment backwards. So the scan can proceed until the bitmap fills the whole segment (filling backwards), no copy required. I'll try that later, but first I'd like to get multiarray approach right since that's the basis of it.
On Fri, Sep 16, 2016 at 12:24 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Thu, Sep 15, 2016 at 3:48 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> For example, we always allocate the TID array as large as we can fit into
> m_w_m, but maybe we don't need to wait with switching to the bitmap until
> filling the whole array - we could wait as long as the bitmap fits into the
> remaining part of the array, build it there and then copy it to the
> beginning (and use the bitmap from that point).
The bitmap can be created like that, but grow from the end of the
segment backwards.
So the scan can proceed until the bitmap fills the whole segment
(filling backwards), no copy required.
But I actually wonder if we are over engineering things and overestimating cost of memmove etc. How about this simpler approach:
1. Divide table in some fixed size chunks like Robert suggested. Say 1GB
2. Allocate pointer array to store a pointer to each segment. For 1TB table, thats about 8192 bytes.
3. Allocate a bitmap which can hold MaxHeapTuplesPerPage * chunk size in pages. For 8192 block and 1GB chunk, thats about 4.6MB. Note: I'm suggesting to use a bitmap here because provisioning for worst case, fixed size TID array will cost us 200MB+ where as a bitmap is just 4.6MB.
4. Collect dead tuples in that 1GB chunk. Also collect stats so that we know about the most optimal representation.
5. At the end of 1GB scan, if no dead tuples found, set the chunk pointer to NULL, move to next chunk and restart step 4. If dead tuples found, then check:
a. If bitmap can be further compressed by using less number of bits per page. If so, allocate a new bitmap and compress the bitmap.
b. If TID array will be a more compact representation. If so, allocate a TID array of right size and convert bitmap into an array.
c. Set chunk pointer to whichever representation we choose (of course add headers etc to interpret the representation)
6. Continue until we consume all m_w_m or end-of-table is reached. If we consume all m_w_m then do a round of index cleanup and restart.
I also realised that we can compact the TID array in step (b) above because we only need to store 17 bits for block numbers (we already know which 1GB segment they belong to). Given that usable offsets are also just 13 bits, TID array needs only 4 bytes per TID instead of 6.
Many people are working on implementing different ideas, and I can volunteer to write a patch on these lines unless someone wants to do that.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Sep 16, 2016 at 9:09 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
I also realised that we can compact the TID array in step (b) above because we only need to store 17 bits for block numbers (we already know which 1GB segment they belong to). Given that usable offsets are also just 13 bits, TID array needs only 4 bytes per TID instead of 6.
Actually this seems like a clear savings of at least 30% for all use cases, at the cost of allocating in smaller chunks and doing some transformations. But given the problem we are trying to solve, seems like a small price to pay.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 15, 2016 at 11:39 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > But I actually wonder if we are over engineering things and overestimating > cost of memmove etc. How about this simpler approach: Don't forget that you need to handle the case where maintenance_work_mem is quite small. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 16, 2016 at 7:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 15, 2016 at 11:39 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> But I actually wonder if we are over engineering things and overestimating
> cost of memmove etc. How about this simpler approach:
Don't forget that you need to handle the case where
maintenance_work_mem is quite small.
How small? The default IIRC these days is 64MB and minimum is 1MB. I think we can do some special casing for very small values and ensure that things at the very least work and hopefully don't regress for them.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Sep 16, 2016 at 9:47 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > On Fri, Sep 16, 2016 at 7:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Sep 15, 2016 at 11:39 PM, Pavan Deolasee >> <pavan.deolasee@gmail.com> wrote: >> > But I actually wonder if we are over engineering things and >> > overestimating >> > cost of memmove etc. How about this simpler approach: >> >> Don't forget that you need to handle the case where >> maintenance_work_mem is quite small. > > How small? The default IIRC these days is 64MB and minimum is 1MB. I think > we can do some special casing for very small values and ensure that things > at the very least work and hopefully don't regress for them. Sounds like you need to handle values as small as 1MB, then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 15, 2016 at 1:16 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Sep 14, 2016 at 12:24 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> >>> I am kind of doubtful about this whole line of investigation because >>> we're basically trying pretty hard to fix something that I'm not sure >>> is broken. I do agree that, all other things being equal, the TID >>> lookups will probably be faster with a bitmap than with a binary >>> search, but maybe not if the table is large and the number of dead >>> TIDs is small, because cache efficiency is pretty important. But even >>> if it's always faster, does TID lookup speed even really matter to >>> overall VACUUM performance? Claudio's early results suggest that it >>> might, but maybe that's just a question of some optimization that >>> hasn't been done yet. >> >> FYI, the reported impact was on CPU time, not runtime. There was no >> significant difference in runtime (real time), because my test is >> heavily I/O bound. >> >> I tested with a few small tables and there was no significant >> difference either, but small tables don't stress the array lookup >> anyway so that's expected. >> >> But on the assumption that some systems may be CPU bound during vacuum >> (particularly those able to do more than 300-400MB/s sequential I/O), >> in those cases the increased or decreased cost of lazy_tid_reaped will >> directly correlate to runtime. It's just none of my systems, which all >> run on amazon and is heavily bandwidth constrained (fastest I/O >> subsystem I can get my hands on does 200MB/s). > > Attached is the patch with the multiarray version. > > The tests are weird. Best case comparison over several runs, to remove > the impact of concurrent activity on this host (I couldn't remove all > background activity even when running the tests overnight, the distro > adds tons of crons and background cleanup tasks it would seem), > there's only very mild CPU impact. I'd say insignificant, as it's well > below the mean variance. I reran the tests on a really dedicated system, and with a script that captured a lot more details about the runs. The system isn't impressive, an i5 with a single consumer HD and 8GB RAM, but it did the job. These tests make more sense, so I bet it was the previous tests that were spoiled by concurrent activity on the host. Attached is the raw output of the test, the script used to create it, and just in case the patch set used. I believe it's the same as the last one I posted, just rebased. In the results archive, the .vacuum prefix is the patched version with both patch 1 and 2, .git.ref is just patch 1 (without which the truncate takes unholily long). Grepping the results a bit, picking an average run out of all runs on each scale: Timings: Patched: s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s. s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s. s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s. Unpatched: s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s. s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s. s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s. Total I/O (in MB) Patched: s100: R:2.4 - W:5862 s400: R:1337.4 - W:29385.6 s4000: R:318631 - W:370154 Unpatched: s100: R:1412.4 - W:7644.6 s400: R:3180.6 - W:36281.4 s4000: R:330683 - W:370391 So, in essence, CPU time didn't get adversely affected. If anything, it got improved by about 20% on the biggest case (scale 4000). While total runtime didn't change much, I believe this is only due to the fact that the index is perfectly correlated (clustered?) since it's a pristine index, so index scans either remove or skip full pages, never leaving things half-way. A bloated index would probably show a substantially different behavior, I'll try to get a script that does it by running pgbench a while before the vacuum or something like that. However, the total I/O metric already shows remarkable improvement. This metric is measuring all the I/O including pgbench init, the initial vacuum pgbench init does, the delete and the final vacuum. So it's not just the I/O for the vacuum itself, but the whole run. We can see the patched version reading a lot less (less scans over the indexes will do that), and in some cases writing less too (again, less index scans may be performing less redundant writes when cleaning up/reclaiming index pages). I'll post when I get the results for the bloated case, but I believe this already shows substantial improvement as is. This approach can later be improved upon by turning tid segments into bitmaps if they're packed densely enough, but I believe this patch represents a sensible first step before attempting that.
Attachment
On Thu, Oct 27, 2016 at 5:25 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Thu, Sep 15, 2016 at 1:16 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Wed, Sep 14, 2016 at 12:24 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> >>>> I am kind of doubtful about this whole line of investigation because >>>> we're basically trying pretty hard to fix something that I'm not sure >>>> is broken. I do agree that, all other things being equal, the TID >>>> lookups will probably be faster with a bitmap than with a binary >>>> search, but maybe not if the table is large and the number of dead >>>> TIDs is small, because cache efficiency is pretty important. But even >>>> if it's always faster, does TID lookup speed even really matter to >>>> overall VACUUM performance? Claudio's early results suggest that it >>>> might, but maybe that's just a question of some optimization that >>>> hasn't been done yet. >>> >>> FYI, the reported impact was on CPU time, not runtime. There was no >>> significant difference in runtime (real time), because my test is >>> heavily I/O bound. >>> >>> I tested with a few small tables and there was no significant >>> difference either, but small tables don't stress the array lookup >>> anyway so that's expected. >>> >>> But on the assumption that some systems may be CPU bound during vacuum >>> (particularly those able to do more than 300-400MB/s sequential I/O), >>> in those cases the increased or decreased cost of lazy_tid_reaped will >>> directly correlate to runtime. It's just none of my systems, which all >>> run on amazon and is heavily bandwidth constrained (fastest I/O >>> subsystem I can get my hands on does 200MB/s). >> >> Attached is the patch with the multiarray version. >> >> The tests are weird. Best case comparison over several runs, to remove >> the impact of concurrent activity on this host (I couldn't remove all >> background activity even when running the tests overnight, the distro >> adds tons of crons and background cleanup tasks it would seem), >> there's only very mild CPU impact. I'd say insignificant, as it's well >> below the mean variance. > > I reran the tests on a really dedicated system, and with a script that > captured a lot more details about the runs. > > The system isn't impressive, an i5 with a single consumer HD and 8GB > RAM, but it did the job. > > These tests make more sense, so I bet it was the previous tests that > were spoiled by concurrent activity on the host. > > Attached is the raw output of the test, the script used to create it, > and just in case the patch set used. I believe it's the same as the > last one I posted, just rebased. I glanced at the patches but the both patches don't obey the coding style of PostgreSQL. Please refer to [1]. [1] http://wiki.postgresql.org/wiki/Developer_FAQ#What.27s_the_formatting_style_used_in_PostgreSQL_source_code.3F. > > In the results archive, the .vacuum prefix is the patched version with > both patch 1 and 2, .git.ref is just patch 1 (without which the > truncate takes unholily long). Did you measure the performance benefit of 0001 patch by comparing HEAD with HEAD+0001 patch? > Grepping the results a bit, picking an average run out of all runs on > each scale: > > Timings: > > Patched: > > s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s. > s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s. > s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s. > > Unpatched: > > s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s. > s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s. > s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s. > > Total I/O (in MB) > > Patched: > > s100: R:2.4 - W:5862 > s400: R:1337.4 - W:29385.6 > s4000: R:318631 - W:370154 > > Unpatched: > > s100: R:1412.4 - W:7644.6 > s400: R:3180.6 - W:36281.4 > s4000: R:330683 - W:370391 > > > So, in essence, CPU time didn't get adversely affected. If anything, > it got improved by about 20% on the biggest case (scale 4000). And this test case deletes all tuples in relation and then measure duration of vacuum. It would not be effect much in practical use case. > While total runtime didn't change much, I believe this is only due to > the fact that the index is perfectly correlated (clustered?) since > it's a pristine index, so index scans either remove or skip full > pages, never leaving things half-way. A bloated index would probably > show a substantially different behavior, I'll try to get a script that > does it by running pgbench a while before the vacuum or something like > that. > > However, the total I/O metric already shows remarkable improvement. > This metric is measuring all the I/O including pgbench init, the > initial vacuum pgbench init does, the delete and the final vacuum. So > it's not just the I/O for the vacuum itself, but the whole run. We can > see the patched version reading a lot less (less scans over the > indexes will do that), and in some cases writing less too (again, less > index scans may be performing less redundant writes when cleaning > up/reclaiming index pages). What value of maintenance_work_mem did you use for this test? Since DeadTuplesSegment struct still stores array of ItemPointerData(6byte) representing dead tuple I supposed that the frequency of index vacuum does not change. But according to the test result, a index vacuum is invoked once and removes 400000000 rows at the time. It means that the vacuum stored about 2289 MB memory during heap vacuum. On the other side, the result of test without 0002 patch show that a index vacuum remove 178956737 rows at the time, which means 1GB memory was used. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Nov 17, 2016 at 2:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I glanced at the patches but the both patches don't obey the coding > style of PostgreSQL. > Please refer to [1]. > > [1] http://wiki.postgresql.org/wiki/Developer_FAQ#What.27s_the_formatting_style_used_in_PostgreSQL_source_code.3F. I thought I had. I'll go through that list to check what I missed. >> In the results archive, the .vacuum prefix is the patched version with >> both patch 1 and 2, .git.ref is just patch 1 (without which the >> truncate takes unholily long). > > Did you measure the performance benefit of 0001 patch by comparing > HEAD with HEAD+0001 patch? Not the whole test, but yes. Without the 0001 patch the backward scan step during truncate goes between 3 and 5 times slower. I don't have timings because the test never finished without the patch. It would have finished, but it would have taken about a day. >> Grepping the results a bit, picking an average run out of all runs on >> each scale: >> >> Timings: >> >> Patched: >> >> s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s. >> s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s. >> s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s. >> >> Unpatched: >> >> s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s. >> s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s. >> s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s. >> >> Total I/O (in MB) >> >> Patched: >> >> s100: R:2.4 - W:5862 >> s400: R:1337.4 - W:29385.6 >> s4000: R:318631 - W:370154 >> >> Unpatched: >> >> s100: R:1412.4 - W:7644.6 >> s400: R:3180.6 - W:36281.4 >> s4000: R:330683 - W:370391 >> >> >> So, in essence, CPU time didn't get adversely affected. If anything, >> it got improved by about 20% on the biggest case (scale 4000). > > And this test case deletes all tuples in relation and then measure > duration of vacuum. > It would not be effect much in practical use case. Well, this patch set started because I had to do exactly that, and realized just how inefficient vacuum was in that case. But it doesn't mean it won't benefit more realistic use cases. Almost any big database ends up hitting this 1GB limit because big tables take very long to vacuum and accumulate a lot of bloat in-between vacuums. If you have a specific test case in mind I can try to run it. >> While total runtime didn't change much, I believe this is only due to >> the fact that the index is perfectly correlated (clustered?) since >> it's a pristine index, so index scans either remove or skip full >> pages, never leaving things half-way. A bloated index would probably >> show a substantially different behavior, I'll try to get a script that >> does it by running pgbench a while before the vacuum or something like >> that. >> >> However, the total I/O metric already shows remarkable improvement. >> This metric is measuring all the I/O including pgbench init, the >> initial vacuum pgbench init does, the delete and the final vacuum. So >> it's not just the I/O for the vacuum itself, but the whole run. We can >> see the patched version reading a lot less (less scans over the >> indexes will do that), and in some cases writing less too (again, less >> index scans may be performing less redundant writes when cleaning >> up/reclaiming index pages). > > What value of maintenance_work_mem did you use for this test? 4GB on both, patched and HEAD. > Since > DeadTuplesSegment struct still stores array of ItemPointerData(6byte) > representing dead tuple I supposed that the frequency of index vacuum > does not change. But according to the test result, a index vacuum is > invoked once and removes 400000000 rows at the time. It means that the > vacuum stored about 2289 MB memory during heap vacuum. On the other > side, the result of test without 0002 patch show that a index vacuum > remove 178956737 rows at the time, which means 1GB memory was used. 1GB is a hardcoded limit on HEAD for vacuum work mem. This patch removes that limit and lets vacuum use all the memory the user gave to vacuum. In the test, in both cases, 4GB was used as maintenance_work_mem value, but HEAD cannot use all the 4GB, and it will limit itself to just 1GB.
On Thu, Nov 17, 2016 at 2:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Thu, Nov 17, 2016 at 2:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> I glanced at the patches but the both patches don't obey the coding >> style of PostgreSQL. >> Please refer to [1]. >> >> [1] http://wiki.postgresql.org/wiki/Developer_FAQ#What.27s_the_formatting_style_used_in_PostgreSQL_source_code.3F. > > I thought I had. I'll go through that list to check what I missed. Attached is patch 0002 with pgindent applied over it I don't think there's any other formatting issue, but feel free to point a finger to it if I missed any
Attachment
On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > Attached is patch 0002 with pgindent applied over it > > I don't think there's any other formatting issue, but feel free to > point a finger to it if I missed any Hmm, I had imagined making all of the segments the same size rather than having the size grow exponentially. The whole point of this is to save memory, and even in the worst case you don't end up with that many segments as long as you pick a reasonable base size (e.g. 64MB). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > Attached is patch 0002 with pgindent applied over it > > I don't think there's any other formatting issue, but feel free to > point a finger to it if I missed any Hmm, I had imagined making all of the segments the same size rather than having the size grow exponentially. The whole point of this is to save memory, and even in the worst case you don't end up with that many segments as long as you pick a reasonable base size (e.g. 64MB). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Attached is patch 0002 with pgindent applied over it >> >> I don't think there's any other formatting issue, but feel free to >> point a finger to it if I missed any > > Hmm, I had imagined making all of the segments the same size rather > than having the size grow exponentially. The whole point of this is > to save memory, and even in the worst case you don't end up with that > many segments as long as you pick a reasonable base size (e.g. 64MB). Wastage is bound by a fraction of the total required RAM, that is, it's proportional to the amount of required RAM, not the amount allocated. So it should still be fine, and the exponential strategy should improve lookup performance considerably.
On Fri, Nov 18, 2016 at 6:54 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> Attached is patch 0002 with pgindent applied over it >>> >>> I don't think there's any other formatting issue, but feel free to >>> point a finger to it if I missed any >> >> Hmm, I had imagined making all of the segments the same size rather >> than having the size grow exponentially. The whole point of this is >> to save memory, and even in the worst case you don't end up with that >> many segments as long as you pick a reasonable base size (e.g. 64MB). > > Wastage is bound by a fraction of the total required RAM, that is, > it's proportional to the amount of required RAM, not the amount > allocated. So it should still be fine, and the exponential strategy > should improve lookup performance considerably. I'm concerned that it could use repalloc for large memory area when vacrelstats->dead_tuples.dead_tuple is bloated. It would be overhead and slow. What about using semi-fixed memroy space without repalloc; Allocate the array of ItemPointerData array, and each ItemPointerData array stores the dead tuple locations. The size of ItemPointerData array starts with small size (e.g. 16MB or 32MB). After we used an array up, we then allocate next segment with twice size as previous segment. But to prevent over allocating memory, it would be better to set a limit of allocating size of ItemPointerData array to 512MB or 1GB. We could expand array of array using repalloc if needed, but it would not be happend much. Other design is similar to current your patch; the offset of the array of array and the offset of a ItemPointerData array control current location, which are cleared after finished reclaiming garbage on heap and index. And allocated array is re-used by subsequent scanning heap. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Nov 21, 2016 at 2:15 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Nov 18, 2016 at 6:54 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>>> Attached is patch 0002 with pgindent applied over it >>>> >>>> I don't think there's any other formatting issue, but feel free to >>>> point a finger to it if I missed any >>> >>> Hmm, I had imagined making all of the segments the same size rather >>> than having the size grow exponentially. The whole point of this is >>> to save memory, and even in the worst case you don't end up with that >>> many segments as long as you pick a reasonable base size (e.g. 64MB). >> >> Wastage is bound by a fraction of the total required RAM, that is, >> it's proportional to the amount of required RAM, not the amount >> allocated. So it should still be fine, and the exponential strategy >> should improve lookup performance considerably. > > I'm concerned that it could use repalloc for large memory area when > vacrelstats->dead_tuples.dead_tuple is bloated. It would be overhead > and slow. How large? That array cannot be very large. It contains pointers to exponentially-growing arrays, but the repalloc'd array itself is small: one struct per segment, each segment starts at 128MB and grows exponentially. In fact, IIRC, it can be proven that such a repalloc strategy has an amortized cost of O(log log n) per tuple. If it repallocd the whole tid array it would be O(log n), but since it handles only pointers to segments of exponentially growing tuples it becomes O(log log n). Furthermore, n there is still limited to MAX_INT, which means the cost per tuple is bound by O(log log 2^32) = 5. And that's an absolute worst case that's ignoring the 128MB constant factor which is indeed relevant. > What about using semi-fixed memroy space without repalloc; > Allocate the array of ItemPointerData array, and each ItemPointerData > array stores the dead tuple locations. The size of ItemPointerData > array starts with small size (e.g. 16MB or 32MB). After we used an > array up, we then allocate next segment with twice size as previous > segment. That's what the patch does. > But to prevent over allocating memory, it would be better to > set a limit of allocating size of ItemPointerData array to 512MB or > 1GB. There already is a limit to 1GB (the maximum amount palloc can allocate)
On Tue, Nov 22, 2016 at 4:53 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Mon, Nov 21, 2016 at 2:15 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Nov 18, 2016 at 6:54 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>>> Attached is patch 0002 with pgindent applied over it
>>>>
>>>> I don't think there's any other formatting issue, but feel free to
>>>> point a finger to it if I missed any
>>>
>>> Hmm, I had imagined making all of the segments the same size rather
>>> than having the size grow exponentially. The whole point of this is
>>> to save memory, and even in the worst case you don't end up with that
>>> many segments as long as you pick a reasonable base size (e.g. 64MB).
>>
>> Wastage is bound by a fraction of the total required RAM, that is,
>> it's proportional to the amount of required RAM, not the amount
>> allocated. So it should still be fine, and the exponential strategy
>> should improve lookup performance considerably.
>
> I'm concerned that it could use repalloc for large memory area when
> vacrelstats->dead_tuples.dead_tuple is bloated. It would be overhead
> and slow.
How large?
That array cannot be very large. It contains pointers to
exponentially-growing arrays, but the repalloc'd array itself is
small: one struct per segment, each segment starts at 128MB and grows
exponentially.
In fact, IIRC, it can be proven that such a repalloc strategy has an
amortized cost of O(log log n) per tuple. If it repallocd the whole
tid array it would be O(log n), but since it handles only pointers to
segments of exponentially growing tuples it becomes O(log log n).
Furthermore, n there is still limited to MAX_INT, which means the cost
per tuple is bound by O(log log 2^32) = 5. And that's an absolute
worst case that's ignoring the 128MB constant factor which is indeed
relevant.
> What about using semi-fixed memroy space without repalloc;
> Allocate the array of ItemPointerData array, and each ItemPointerData
> array stores the dead tuple locations. The size of ItemPointerData
> array starts with small size (e.g. 16MB or 32MB). After we used an
> array up, we then allocate next segment with twice size as previous
> segment.
That's what the patch does.
> But to prevent over allocating memory, it would be better to
> set a limit of allocating size of ItemPointerData array to 512MB or
> 1GB.
There already is a limit to 1GB (the maximum amount palloc can allocate)
Moved to next CF with "needs review" status.
Regards,
Hari Babu
Fujitsu Australia
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: not tested Spec compliant: not tested Documentation: not tested Hi, I haven't read through the thread yet, just tried to apply the patch and run tests. And it seems that the last attached version is outdated now. It doesn't apply to the master and after I've tried to fix merge conflict, it segfaults at initdb. So, looking forward to a new version for review. The new status of this patch is: Waiting on Author
On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova <lubennikovaav@gmail.com> wrote: > The following review has been posted through the commitfest application: > make installcheck-world: tested, failed > Implements feature: not tested > Spec compliant: not tested > Documentation: not tested > > Hi, > I haven't read through the thread yet, just tried to apply the patch and run tests. > And it seems that the last attached version is outdated now. It doesn't apply to the master > and after I've tried to fix merge conflict, it segfaults at initdb. I'll rebase when I get some time to do it and post an updated version
On Thu, Dec 22, 2016 at 12:22 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova > <lubennikovaav@gmail.com> wrote: >> The following review has been posted through the commitfest application: >> make installcheck-world: tested, failed >> Implements feature: not tested >> Spec compliant: not tested >> Documentation: not tested >> >> Hi, >> I haven't read through the thread yet, just tried to apply the patch and run tests. >> And it seems that the last attached version is outdated now. It doesn't apply to the master >> and after I've tried to fix merge conflict, it segfaults at initdb. > > > I'll rebase when I get some time to do it and post an updated version Attached rebased patches. I called them both v3 to be consistent. I'm not sure how you ran it, but this works fine for me: ./configure --enable-debug --enable-cassert make clean make check ... after a while ... ======================= All 168 tests passed. ======================= I reckon the above is equivalent to installcheck, but doesn't require sudo or actually installing the server, so installcheck should work assuming the install went ok. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
22.12.2016 21:18, Claudio Freire: > On Thu, Dec 22, 2016 at 12:22 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova >> <lubennikovaav@gmail.com> wrote: >>> The following review has been posted through the commitfest application: >>> make installcheck-world: tested, failed >>> Implements feature: not tested >>> Spec compliant: not tested >>> Documentation: not tested >>> >>> Hi, >>> I haven't read through the thread yet, just tried to apply the patch and run tests. >>> And it seems that the last attached version is outdated now. It doesn't apply to the master >>> and after I've tried to fix merge conflict, it segfaults at initdb. >> >> I'll rebase when I get some time to do it and post an updated version > Attached rebased patches. I called them both v3 to be consistent. > > I'm not sure how you ran it, but this works fine for me: > > ./configure --enable-debug --enable-cassert > make clean > make check > > ... after a while ... > > ======================= > All 168 tests passed. > ======================= > > I reckon the above is equivalent to installcheck, but doesn't require > sudo or actually installing the server, so installcheck should work > assuming the install went ok. I found the reason. I configure postgres with CFLAGS="-O0" and it causes Segfault on initdb. It works fine and passes tests with default configure flags, but I'm pretty sure that we should fix segfault before testing the feature. If you need it, I'll send a core dump. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Fri, Dec 23, 2016 at 1:39 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: >> On Thu, Dec 22, 2016 at 12:22 PM, Claudio Freire <klaussfreire@gmail.com> >> wrote: >>> >>> On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova >>> <lubennikovaav@gmail.com> wrote: >>>> >>>> The following review has been posted through the commitfest application: >>>> make installcheck-world: tested, failed >>>> Implements feature: not tested >>>> Spec compliant: not tested >>>> Documentation: not tested >>>> >>>> Hi, >>>> I haven't read through the thread yet, just tried to apply the patch and >>>> run tests. >>>> And it seems that the last attached version is outdated now. It doesn't >>>> apply to the master >>>> and after I've tried to fix merge conflict, it segfaults at initdb. >>> >>> >>> I'll rebase when I get some time to do it and post an updated version >> >> Attached rebased patches. I called them both v3 to be consistent. >> >> I'm not sure how you ran it, but this works fine for me: >> >> ./configure --enable-debug --enable-cassert >> make clean >> make check >> >> ... after a while ... >> >> ======================= >> All 168 tests passed. >> ======================= >> >> I reckon the above is equivalent to installcheck, but doesn't require >> sudo or actually installing the server, so installcheck should work >> assuming the install went ok. > > > I found the reason. I configure postgres with CFLAGS="-O0" and it causes > Segfault on initdb. > It works fine and passes tests with default configure flags, but I'm pretty > sure that we should fix segfault before testing the feature. > If you need it, I'll send a core dump. I just ran it with CFLAGS="-O0" and it passes all checks too: CFLAGS='-O0' ./configure --enable-debug --enable-cassert make clean && make -j8 && make check-world A stacktrace and a thorough description of your build environment would be helpful to understand why it breaks on your system.
23.12.2016 22:54, Claudio Freire:
I ran configure using following set of flags:
./configure --enable-tap-tests --enable-cassert --enable-debug --enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer"
And then ran make check. Here is the stacktrace:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
1417 tblk = ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]);
(gdb) bt
#0 0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
#1 0x0000000000693dfe in lazy_scan_heap (onerel=0x1ec2360, options=9, vacrelstats=0x1ef6e00, Irel=0x1ef7168, nindexes=2, aggressive=1 '\001')
at vacuumlazy.c:1337
#2 0x0000000000691e66 in lazy_vacuum_rel (onerel=0x1ec2360, options=9, params=0x7ffe0f866310, bstrategy=0x1f1c4a8) at vacuumlazy.c:290
#3 0x000000000069191f in vacuum_rel (relid=1247, relation=0x0, options=9, params=0x7ffe0f866310) at vacuum.c:1418
#4 0x0000000000690122 in vacuum (options=9, relation=0x0, relid=0, params=0x7ffe0f866310, va_cols=0x0, bstrategy=0x1f1c4a8,
isTopLevel=1 '\001') at vacuum.c:320
#5 0x000000000068fd0b in vacuum (options=-1652367447, relation=0x0, relid=3324614038, params=0x1f11bf0, va_cols=0xb59f63,
bstrategy=0x1f1c620, isTopLevel=0 '\000') at vacuum.c:150
#6 0x0000000000852993 in standard_ProcessUtility (parsetree=0x1f07e60, queryString=0x1f07468 "VACUUM FREEZE;\n",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0xea5cc0 <debugtupDR>, completionTag=0x7ffe0f866750 "") at utility.c:669
#7 0x00000000008520da in standard_ProcessUtility (parsetree=0x401ef6cd8, queryString=0x18 <error: Cannot access memory at address 0x18>,
context=PROCESS_UTILITY_TOPLEVEL, params=0x68, dest=0x9e5d62 <AllocSetFree+60>, completionTag=0x7ffe0f8663f0 "`~\360\001")
at utility.c:360
#8 0x0000000000851161 in PortalRunMulti (portal=0x7ffe0f866750, isTopLevel=0 '\000', setHoldSnapshot=-39 '\331',
dest=0x851161 <PortalRunMulti+19>, altdest=0x7ffe0f8664f0, completionTag=0x1f07e60 "\341\002") at pquery.c:1219
#9 0x0000000000851374 in PortalRunMulti (portal=0x1f0a488, isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0xea5cc0 <debugtupDR>,
altdest=0xea5cc0 <debugtupDR>, completionTag=0x7ffe0f866750 "") at pquery.c:1345
#10 0x0000000000850889 in PortalRun (portal=0x1f0a488, count=9223372036854775807, isTopLevel=1 '\001', dest=0xea5cc0 <debugtupDR>,
altdest=0xea5cc0 <debugtupDR>, completionTag=0x7ffe0f866750 "") at pquery.c:824
#11 0x000000000084a4dc in exec_simple_query (query_string=0x1f07468 "VACUUM FREEZE;\n") at postgres.c:1113
#12 0x000000000084e960 in PostgresMain (argc=10, argv=0x1e60a50, dbname=0x1e823b0 "template1", username=0x1e672a0 "anastasia")
at postgres.c:4091
#13 0x00000000006f967e in init_locale (categoryname=0x100000000000000 <error: Cannot access memory at address 0x100000000000000>,
category=32766, locale=0xa004692f0 <error: Cannot access memory at address 0xa004692f0>) at main.c:310
#14 0x00007f1e5f463830 in __libc_start_main (main=0x6f93e1 <main+85>, argc=10, argv=0x7ffe0f866a78, init=<optimized out>,
fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe0f866a68) at ../csu/libc-start.c:291
#15 0x0000000000469319 in _start ()
core file is quite big, so I didn't attach it to the mail. You can download it here: core dump file.
Here are some notes about the first patch:
1. prefetchBlkno = blkno & ~0x1f;
prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
I didn't get it what for we need these tricks. How does it differ from:
prefetchBlkno = (blkno > 32) ? blkno - 32 : 0;
2. Why do we decrease prefetchBlckno twice?
Here:
+ prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
And here:
if (prefetchBlkno >= 32)
+ prefetchBlkno -= 32;
I'll inspect second patch in a few days and write questions about it.
On Fri, Dec 23, 2016 at 1:39 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:I found the reason. I configure postgres with CFLAGS="-O0" and it causes Segfault on initdb. It works fine and passes tests with default configure flags, but I'm pretty sure that we should fix segfault before testing the feature. If you need it, I'll send a core dump.I just ran it with CFLAGS="-O0" and it passes all checks too: CFLAGS='-O0' ./configure --enable-debug --enable-cassert make clean && make -j8 && make check-world A stacktrace and a thorough description of your build environment would be helpful to understand why it breaks on your system.
I ran configure using following set of flags:
./configure --enable-tap-tests --enable-cassert --enable-debug --enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer"
And then ran make check. Here is the stacktrace:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
1417 tblk = ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]);
(gdb) bt
#0 0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
#1 0x0000000000693dfe in lazy_scan_heap (onerel=0x1ec2360, options=9, vacrelstats=0x1ef6e00, Irel=0x1ef7168, nindexes=2, aggressive=1 '\001')
at vacuumlazy.c:1337
#2 0x0000000000691e66 in lazy_vacuum_rel (onerel=0x1ec2360, options=9, params=0x7ffe0f866310, bstrategy=0x1f1c4a8) at vacuumlazy.c:290
#3 0x000000000069191f in vacuum_rel (relid=1247, relation=0x0, options=9, params=0x7ffe0f866310) at vacuum.c:1418
#4 0x0000000000690122 in vacuum (options=9, relation=0x0, relid=0, params=0x7ffe0f866310, va_cols=0x0, bstrategy=0x1f1c4a8,
isTopLevel=1 '\001') at vacuum.c:320
#5 0x000000000068fd0b in vacuum (options=-1652367447, relation=0x0, relid=3324614038, params=0x1f11bf0, va_cols=0xb59f63,
bstrategy=0x1f1c620, isTopLevel=0 '\000') at vacuum.c:150
#6 0x0000000000852993 in standard_ProcessUtility (parsetree=0x1f07e60, queryString=0x1f07468 "VACUUM FREEZE;\n",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0xea5cc0 <debugtupDR>, completionTag=0x7ffe0f866750 "") at utility.c:669
#7 0x00000000008520da in standard_ProcessUtility (parsetree=0x401ef6cd8, queryString=0x18 <error: Cannot access memory at address 0x18>,
context=PROCESS_UTILITY_TOPLEVEL, params=0x68, dest=0x9e5d62 <AllocSetFree+60>, completionTag=0x7ffe0f8663f0 "`~\360\001")
at utility.c:360
#8 0x0000000000851161 in PortalRunMulti (portal=0x7ffe0f866750, isTopLevel=0 '\000', setHoldSnapshot=-39 '\331',
dest=0x851161 <PortalRunMulti+19>, altdest=0x7ffe0f8664f0, completionTag=0x1f07e60 "\341\002") at pquery.c:1219
#9 0x0000000000851374 in PortalRunMulti (portal=0x1f0a488, isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0xea5cc0 <debugtupDR>,
altdest=0xea5cc0 <debugtupDR>, completionTag=0x7ffe0f866750 "") at pquery.c:1345
#10 0x0000000000850889 in PortalRun (portal=0x1f0a488, count=9223372036854775807, isTopLevel=1 '\001', dest=0xea5cc0 <debugtupDR>,
altdest=0xea5cc0 <debugtupDR>, completionTag=0x7ffe0f866750 "") at pquery.c:824
#11 0x000000000084a4dc in exec_simple_query (query_string=0x1f07468 "VACUUM FREEZE;\n") at postgres.c:1113
#12 0x000000000084e960 in PostgresMain (argc=10, argv=0x1e60a50, dbname=0x1e823b0 "template1", username=0x1e672a0 "anastasia")
at postgres.c:4091
#13 0x00000000006f967e in init_locale (categoryname=0x100000000000000 <error: Cannot access memory at address 0x100000000000000>,
category=32766, locale=0xa004692f0 <error: Cannot access memory at address 0xa004692f0>) at main.c:310
#14 0x00007f1e5f463830 in __libc_start_main (main=0x6f93e1 <main+85>, argc=10, argv=0x7ffe0f866a78, init=<optimized out>,
fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe0f866a68) at ../csu/libc-start.c:291
#15 0x0000000000469319 in _start ()
core file is quite big, so I didn't attach it to the mail. You can download it here: core dump file.
Here are some notes about the first patch:
1. prefetchBlkno = blkno & ~0x1f;
prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
I didn't get it what for we need these tricks. How does it differ from:
prefetchBlkno = (blkno > 32) ? blkno - 32 : 0;
2. Why do we decrease prefetchBlckno twice?
Here:
+ prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
And here:
if (prefetchBlkno >= 32)
+ prefetchBlkno -= 32;
I'll inspect second patch in a few days and write questions about it.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Anastasia Lubennikova wrote: > I ran configure using following set of flags: > ./configure --enable-tap-tests --enable-cassert --enable-debug > --enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer" > And then ran make check. Here is the stacktrace: > > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, > vacrelstats=0x1ef6e00) at vacuumlazy.c:1417 > 1417 tblk = > ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]); This doesn't make sense, since the patch removes the "tupindex" variable in that function. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Dec 27, 2016 at 10:54 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Anastasia Lubennikova wrote: > >> I ran configure using following set of flags: >> ./configure --enable-tap-tests --enable-cassert --enable-debug >> --enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer" >> And then ran make check. Here is the stacktrace: >> >> Program terminated with signal SIGSEGV, Segmentation fault. >> #0 0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, >> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417 >> 1417 tblk = >> ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]); > > This doesn't make sense, since the patch removes the "tupindex" > variable in that function. The variable is still there. It just has a slightly different meaning (index within the current segment, rather than global index). On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > 23.12.2016 22:54, Claudio Freire: > > On Fri, Dec 23, 2016 at 1:39 PM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: > > I found the reason. I configure postgres with CFLAGS="-O0" and it causes > Segfault on initdb. > It works fine and passes tests with default configure flags, but I'm pretty > sure that we should fix segfault before testing the feature. > If you need it, I'll send a core dump. > > I just ran it with CFLAGS="-O0" and it passes all checks too: > > CFLAGS='-O0' ./configure --enable-debug --enable-cassert > make clean && make -j8 && make check-world > > A stacktrace and a thorough description of your build environment > would be helpful to understand why it breaks on your system. > > > I ran configure using following set of flags: > ./configure --enable-tap-tests --enable-cassert --enable-debug > --enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer" > And then ran make check. Here is the stacktrace: Same procedure runs fine on my end. > core file is quite big, so I didn't attach it to the mail. You can download it here: core dump file. Can you attach your binary as well? (it needs to be identical to be able to inspect the core dump, and quite clearly my build is different) I'll keep looking for ways it could crash there, but being unable to reproduce the crash is a big hindrance, so if you can send the binary that could help speed things up. On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > 1. prefetchBlkno = blkno & ~0x1f; > prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0; > > I didn't get it what for we need these tricks. How does it differ from: > prefetchBlkno = (blkno > 32) ? blkno - 32 : 0; It makes all prefetches ranges of 32 blocks aligned to 32-block boundaries. It helps since it's at 32 block boundaries that the truncate stops to check for locking conflicts and abort, guaranteeing no prefetch will be needless (if it goes into that code it *will* read the next 32 blocks). > 2. Why do we decrease prefetchBlckno twice? > > Here: > + prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0; > And here: > if (prefetchBlkno >= 32) > + prefetchBlkno -= 32; The first one is outside the loop, it's finding the first prefetch range that is boundary aligned, taking care not to cause underflow. The second one is inside the loop, it's moving the prefetch window down as the truncate moves along. Since it's already aligned, it doesn't need to be realigned, just clamped to zero if it happens to reach the bottom.
On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, > vacrelstats=0x1ef6e00) at vacuumlazy.c:1417 > 1417 tblk = > ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]); > (gdb) bt > #0 0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, > vacrelstats=0x1ef6e00) at vacuumlazy.c:1417 > #1 0x0000000000693dfe in lazy_scan_heap (onerel=0x1ec2360, options=9, > vacrelstats=0x1ef6e00, Irel=0x1ef7168, nindexes=2, aggressive=1 '\001') > at vacuumlazy.c:1337 > #2 0x0000000000691e66 in lazy_vacuum_rel (onerel=0x1ec2360, options=9, > params=0x7ffe0f866310, bstrategy=0x1f1c4a8) at vacuumlazy.c:290 > #3 0x000000000069191f in vacuum_rel (relid=1247, relation=0x0, options=9, > params=0x7ffe0f866310) at vacuum.c:1418 Those line numbers don't match my code. Which commit are you based on? My tree is (currently) based on 71f996d22125eb6cfdbee6094f44370aa8dec610
27.12.2016 20:14, Claudio Freire:
Hm, my branch is based on 71f996d22125eb6cfdbee6094f44370aa8dec610 as well.
I merely applied patches 0001-Vacuum-prefetch-buffers-on-backward-scan-v3.patch
and 0002-Vacuum-allow-using-more-than-1GB-work-mem-v3.patch
then ran configure and make as usual.
Am I doing something wrong?
Anyway, I found the problem that had caused segfault.
for (segindex = 0; segindex <= vacrelstats->dead_tuples.last_seg; tupindex = 0, segindex++)
{
DeadTuplesSegment *seg = &(vacrelstats->dead_tuples.dead_tuples[segindex]);
int num_dead_tuples = seg->num_dead_tuples;
while (tupindex < num_dead_tuples)
...
You rely on the value of tupindex here, while during the very first pass the 'tupindex' variable
may contain any garbage. And it happend that on my system there was negative value
as I found inspecting core dump:
(gdb) info locals
num_dead_tuples = 5
tottuples = 0
tupindex = -1819017215
Which leads to failure in the next line
tblk = ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]);
The solution is to move this assignment inside the cycle.
I've read the second patch.
1. What is the reason to inline vac_cmp_itemptr() ?
2. +#define LAZY_MIN_TUPLES Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
What does 128<<20 mean? Why not 1<<27? I'd ask you to replace it with named constant,
or at least add a comment.
I'll share my results of performance testing it in a few days.
On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, vacrelstats=0x1ef6e00) at vacuumlazy.c:1417 1417 tblk = ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]); (gdb) bt #0 0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, vacrelstats=0x1ef6e00) at vacuumlazy.c:1417 #1 0x0000000000693dfe in lazy_scan_heap (onerel=0x1ec2360, options=9, vacrelstats=0x1ef6e00, Irel=0x1ef7168, nindexes=2, aggressive=1 '\001') at vacuumlazy.c:1337 #2 0x0000000000691e66 in lazy_vacuum_rel (onerel=0x1ec2360, options=9, params=0x7ffe0f866310, bstrategy=0x1f1c4a8) at vacuumlazy.c:290 #3 0x000000000069191f in vacuum_rel (relid=1247, relation=0x0, options=9, params=0x7ffe0f866310) at vacuum.c:1418Those line numbers don't match my code. Which commit are you based on? My tree is (currently) based on 71f996d22125eb6cfdbee6094f44370aa8dec610
Hm, my branch is based on 71f996d22125eb6cfdbee6094f44370aa8dec610 as well.
I merely applied patches 0001-Vacuum-prefetch-buffers-on-backward-scan-v3.patch
and 0002-Vacuum-allow-using-more-than-1GB-work-mem-v3.patch
then ran configure and make as usual.
Am I doing something wrong?
Anyway, I found the problem that had caused segfault.
for (segindex = 0; segindex <= vacrelstats->dead_tuples.last_seg; tupindex = 0, segindex++)
{
DeadTuplesSegment *seg = &(vacrelstats->dead_tuples.dead_tuples[segindex]);
int num_dead_tuples = seg->num_dead_tuples;
while (tupindex < num_dead_tuples)
...
You rely on the value of tupindex here, while during the very first pass the 'tupindex' variable
may contain any garbage. And it happend that on my system there was negative value
as I found inspecting core dump:
(gdb) info locals
num_dead_tuples = 5
tottuples = 0
tupindex = -1819017215
Which leads to failure in the next line
tblk = ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]);
The solution is to move this assignment inside the cycle.
I've read the second patch.
1. What is the reason to inline vac_cmp_itemptr() ?
2. +#define LAZY_MIN_TUPLES Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
What does 128<<20 mean? Why not 1<<27? I'd ask you to replace it with named constant,
or at least add a comment.
I'll share my results of performance testing it in a few days.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Wed, Dec 28, 2016 at 10:26 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > 27.12.2016 20:14, Claudio Freire: > > On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: > > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, > vacrelstats=0x1ef6e00) at vacuumlazy.c:1417 > 1417 tblk = > ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]); > (gdb) bt > #0 0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, > vacrelstats=0x1ef6e00) at vacuumlazy.c:1417 > #1 0x0000000000693dfe in lazy_scan_heap (onerel=0x1ec2360, options=9, > vacrelstats=0x1ef6e00, Irel=0x1ef7168, nindexes=2, aggressive=1 '\001') > at vacuumlazy.c:1337 > #2 0x0000000000691e66 in lazy_vacuum_rel (onerel=0x1ec2360, options=9, > params=0x7ffe0f866310, bstrategy=0x1f1c4a8) at vacuumlazy.c:290 > #3 0x000000000069191f in vacuum_rel (relid=1247, relation=0x0, options=9, > params=0x7ffe0f866310) at vacuum.c:1418 > > Those line numbers don't match my code. > > Which commit are you based on? > > My tree is (currently) based on 71f996d22125eb6cfdbee6094f44370aa8dec610 > > > Hm, my branch is based on 71f996d22125eb6cfdbee6094f44370aa8dec610 as well. > I merely applied patches > 0001-Vacuum-prefetch-buffers-on-backward-scan-v3.patch > and 0002-Vacuum-allow-using-more-than-1GB-work-mem-v3.patch > then ran configure and make as usual. > Am I doing something wrong? Doesn't sound wrong. Maybe it's my tree the unclean one. I'll have to do a clean checkout to verify. > Anyway, I found the problem that had caused segfault. > > for (segindex = 0; segindex <= vacrelstats->dead_tuples.last_seg; tupindex = > 0, segindex++) > { > DeadTuplesSegment *seg = > &(vacrelstats->dead_tuples.dead_tuples[segindex]); > int num_dead_tuples = seg->num_dead_tuples; > > while (tupindex < num_dead_tuples) > ... > > You rely on the value of tupindex here, while during the very first pass the > 'tupindex' variable > may contain any garbage. And it happend that on my system there was negative > value > as I found inspecting core dump: > > (gdb) info locals > num_dead_tuples = 5 > tottuples = 0 > tupindex = -1819017215 > > Which leads to failure in the next line > tblk = ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]); > > The solution is to move this assignment inside the cycle. Good catch. I read that line suspecting that very same thing but somehow I was blind to it. > I've read the second patch. > > 1. What is the reason to inline vac_cmp_itemptr() ? Performance, mostly. By inlining some transformations can be applied that wouldn't be possible otherwise. During the binary search, this matters performance-wise. > 2. +#define LAZY_MIN_TUPLES Max(MaxHeapTuplesPerPage, (128<<20) / > sizeof(ItemPointerData)) > What does 128<<20 mean? Why not 1<<27? I'd ask you to replace it with named > constant, > or at least add a comment. I tought it was more readable like that, since 1<<20 is known to be "MB", that reads as "128 MB". I guess I can add a comment saying so. > I'll share my results of performance testing it in a few days. Thanks
On Wed, Dec 28, 2016 at 3:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Anyway, I found the problem that had caused segfault. >> >> for (segindex = 0; segindex <= vacrelstats->dead_tuples.last_seg; tupindex = >> 0, segindex++) >> { >> DeadTuplesSegment *seg = >> &(vacrelstats->dead_tuples.dead_tuples[segindex]); >> int num_dead_tuples = seg->num_dead_tuples; >> >> while (tupindex < num_dead_tuples) >> ... >> >> You rely on the value of tupindex here, while during the very first pass the >> 'tupindex' variable >> may contain any garbage. And it happend that on my system there was negative >> value >> as I found inspecting core dump: >> >> (gdb) info locals >> num_dead_tuples = 5 >> tottuples = 0 >> tupindex = -1819017215 >> >> Which leads to failure in the next line >> tblk = ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]); >> >> The solution is to move this assignment inside the cycle. > > Good catch. I read that line suspecting that very same thing but > somehow I was blind to it. Attached v4 patches with the requested fixes. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
28.12.2016 23:43, Claudio Freire:
Sorry for being late, but the tests took a lot of time.
while for old version it took three passes (1GB+1GB+0.9GB).
Vacuum duration results:
as favorable to vanilla vacuum: the table is not that big, it has just one index
and disk used is a fast fusionIO. We can expect even more gain on slower disks.
Thank you again for the patch. Hope to see it in 10.0.
Attached v4 patches with the requested fixes.
Sorry for being late, but the tests took a lot of time.
create table t1 as select i, md5(random()::text) from generate_series(0,400000000) as i; create index md5_idx ON t1(md5); update t1 set md5 = md5((random() * (100 + 500))::text); vacuum;Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass,
while for old version it took three passes (1GB+1GB+0.9GB).
Vacuum duration results:
vanilla: LOG: duration: 4359006.327 ms statement: vacuum verbose t1; patched: LOG: duration: 3076827.378 ms statement: vacuum verbose t1;We can see 30% vacuum speedup. I should note that this case can be considered
as favorable to vanilla vacuum: the table is not that big, it has just one index
and disk used is a fast fusionIO. We can expect even more gain on slower disks.
Thank you again for the patch. Hope to see it in 10.0.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Jan 19, 2017 at 6:33 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > 28.12.2016 23:43, Claudio Freire: > > Attached v4 patches with the requested fixes. > > > Sorry for being late, but the tests took a lot of time. I know. Takes me several days to run my test scripts once. > create table t1 as select i, md5(random()::text) from > generate_series(0,400000000) as i; > create index md5_idx ON t1(md5); > update t1 set md5 = md5((random() * (100 + 500))::text); > vacuum; > > Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass, > while for old version it took three passes (1GB+1GB+0.9GB). > Vacuum duration results: > > vanilla: > LOG: duration: 4359006.327 ms statement: vacuum verbose t1; > patched: > LOG: duration: 3076827.378 ms statement: vacuum verbose t1; > > We can see 30% vacuum speedup. I should note that this case can be > considered > as favorable to vanilla vacuum: the table is not that big, it has just one > index > and disk used is a fast fusionIO. We can expect even more gain on slower > disks. > > Thank you again for the patch. Hope to see it in 10.0. Cool. Thanks for the review and the tests.
On Thu, Jan 19, 2017 at 8:31 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Thu, Jan 19, 2017 at 6:33 AM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> 28.12.2016 23:43, Claudio Freire: >> >> Attached v4 patches with the requested fixes. >> >> >> Sorry for being late, but the tests took a lot of time. > > I know. Takes me several days to run my test scripts once. > >> create table t1 as select i, md5(random()::text) from >> generate_series(0,400000000) as i; >> create index md5_idx ON t1(md5); >> update t1 set md5 = md5((random() * (100 + 500))::text); >> vacuum; >> >> Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass, >> while for old version it took three passes (1GB+1GB+0.9GB). >> Vacuum duration results: >> >> vanilla: >> LOG: duration: 4359006.327 ms statement: vacuum verbose t1; >> patched: >> LOG: duration: 3076827.378 ms statement: vacuum verbose t1; >> >> We can see 30% vacuum speedup. I should note that this case can be >> considered >> as favorable to vanilla vacuum: the table is not that big, it has just one >> index >> and disk used is a fast fusionIO. We can expect even more gain on slower >> disks. >> >> Thank you again for the patch. Hope to see it in 10.0. > > Cool. Thanks for the review and the tests. > I encountered a bug with following scenario. 1. Create table and disable autovacuum on that table. 2. Make about 200000 dead tuples on the table. 3. SET maintenance_work_mem TO 1024 4. VACUUM @@ -729,7 +759,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, * not to reset latestRemovedXid since we want that value to be * valid. */ - vacrelstats->num_dead_tuples = 0; + lazy_clear_dead_tuples(vacrelstats); vacrelstats->num_index_scans++; /* Report that we are once again scanning the heap */ I think that we should do vacrelstats->dead_tuples.num_entries = 0 as well in lazy_clear_dead_tuples(). Once the amount of dead tuples reached to maintenance_work_mem, lazy_scan_heap can never finish. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
You posted two patches with this preamble: Claudio Freire wrote: > Attached is the raw output of the test, the script used to create it, > and just in case the patch set used. I believe it's the same as the > last one I posted, just rebased. There was no discussion whatsoever of the "prefetch" patch in this thread; and as far as I can see, nobody even mentioned such an idea in the thread. This prefetch patch appeared out of the blue and there was no discussion about it that I can see. Now I was about to push it after some minor tweaks, and went to search where was its justification, only to see that there was none. Did anybody run tests with this patch? I attach it now one more time. My version is based on the latest Claudio posted at https://postgr.es/m/CAGTBQpa464RugxYwxLTtDi=Syv9GnGFcJK8uZb2fR6NDDqULaw@mail.gmail.com I don't know if there are differences to the version first posted. I only changed the magic number 32 to a #define, and added a CHECK_FOR_INTERRUPTS in the prefetching loop. FWIW, I think this patch is completely separate from the maint_work_mem patch and should have had its own thread and its own commitfest entry. I intend to get a look at the other patch next week, after pushing this one. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Alvaro Herrera wrote: > There was no discussion whatsoever of the "prefetch" patch in this > thread; and as far as I can see, nobody even mentioned such an idea in > the thread. This prefetch patch appeared out of the blue and there was > no discussion about it that I can see. Now I was about to push it after > some minor tweaks, and went to search where was its justification, only > to see that there was none. Did anybody run tests with this patch? > > I attach it now one more time. My version is based on the latest > Claudio posted at > https://postgr.es/m/CAGTBQpa464RugxYwxLTtDi=Syv9GnGFcJK8uZb2fR6NDDqULaw@mail.gmail.com > I don't know if there are differences to the version first posted. > I only changed the magic number 32 to a #define, and added a > CHECK_FOR_INTERRUPTS in the prefetching loop. Ah, I found the justification here: https://www.postgresql.org/message-id/flat/CAGTBQpa464RugxYwxLTtDi%3DSyv9GnGFcJK8uZb2fR6NDDqULaw%40mail.gmail.com#CAGTBQpbayY-t5-ySW19yQs1dBqvV6dm8dmdpTv_FWXmDC0A0cQ%40mail.gmail.com apparently the truncate scan is 4x-6x faster with this prefetching. Nice! -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I pushed this patch after rewriting it rather completely. I added tracing notices to inspect the blocks it was prefetching and observed that the original coding was failing to prefetch the final streak of blocks in the table, which is an important oversight considering that it may very well be that those are the only blocks to read at all. I timed vacuuming a 4000-block table in my laptop (single SSD disk; dropped FS caches after deleting all rows in table, so that vacuum has to read all blocks from disk); it changes from 387ms without patch to 155ms with patch. I didn't measure how much it takes to run the other steps in the vacuum, but it's clear that the speedup for the truncation phase is considerable. ¡Thanks, Claudio! -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I think this patch no longer applies because of conflicts with the one I just pushed. Please rebase. Thanks -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 20, 2017 at 6:24 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Jan 19, 2017 at 8:31 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Thu, Jan 19, 2017 at 6:33 AM, Anastasia Lubennikova >> <a.lubennikova@postgrespro.ru> wrote: >>> 28.12.2016 23:43, Claudio Freire: >>> >>> Attached v4 patches with the requested fixes. >>> >>> >>> Sorry for being late, but the tests took a lot of time. >> >> I know. Takes me several days to run my test scripts once. >> >>> create table t1 as select i, md5(random()::text) from >>> generate_series(0,400000000) as i; >>> create index md5_idx ON t1(md5); >>> update t1 set md5 = md5((random() * (100 + 500))::text); >>> vacuum; >>> >>> Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass, >>> while for old version it took three passes (1GB+1GB+0.9GB). >>> Vacuum duration results: >>> >>> vanilla: >>> LOG: duration: 4359006.327 ms statement: vacuum verbose t1; >>> patched: >>> LOG: duration: 3076827.378 ms statement: vacuum verbose t1; >>> >>> We can see 30% vacuum speedup. I should note that this case can be >>> considered >>> as favorable to vanilla vacuum: the table is not that big, it has just one >>> index >>> and disk used is a fast fusionIO. We can expect even more gain on slower >>> disks. >>> >>> Thank you again for the patch. Hope to see it in 10.0. >> >> Cool. Thanks for the review and the tests. >> > > I encountered a bug with following scenario. > 1. Create table and disable autovacuum on that table. > 2. Make about 200000 dead tuples on the table. > 3. SET maintenance_work_mem TO 1024 > 4. VACUUM > > @@ -729,7 +759,7 @@ lazy_scan_heap(Relation onerel, int options, > LVRelStats *vacrelstats, > * not to reset latestRemovedXid since we want > that value to be > * valid. > */ > - vacrelstats->num_dead_tuples = 0; > + lazy_clear_dead_tuples(vacrelstats); > vacrelstats->num_index_scans++; > > /* Report that we are once again scanning the heap */ > > I think that we should do vacrelstats->dead_tuples.num_entries = 0 as > well in lazy_clear_dead_tuples(). Once the amount of dead tuples > reached to maintenance_work_mem, lazy_scan_heap can never finish. That's right. I added a test for it in the attached patch set, which uncovered another bug in lazy_clear_dead_tuples, and took the opportunity to rebase. On Mon, Jan 23, 2017 at 1:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I pushed this patch after rewriting it rather completely. I added > tracing notices to inspect the blocks it was prefetching and observed > that the original coding was failing to prefetch the final streak of > blocks in the table, which is an important oversight considering that it > may very well be that those are the only blocks to read at all. > > I timed vacuuming a 4000-block table in my laptop (single SSD disk; > dropped FS caches after deleting all rows in table, so that vacuum has > to read all blocks from disk); it changes from 387ms without patch to > 155ms with patch. I didn't measure how much it takes to run the other > steps in the vacuum, but it's clear that the speedup for the truncation > phase is considerable. > > ĄThanks, Claudio! Cool. Though it wasn't the first time this idea has been floating around, I can't take all the credit. On Fri, Jan 20, 2017 at 6:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > FWIW, I think this patch is completely separate from the maint_work_mem > patch and should have had its own thread and its own commitfest entry. > I intend to get a look at the other patch next week, after pushing this > one. That's because it did have it, and was left in limbo due to lack of testing on SSDs. I just had to adopt it here because otherwise tests took way too long. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Jan 24, 2017 at 1:49 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Fri, Jan 20, 2017 at 6:24 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Thu, Jan 19, 2017 at 8:31 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> On Thu, Jan 19, 2017 at 6:33 AM, Anastasia Lubennikova >>> <a.lubennikova@postgrespro.ru> wrote: >>>> 28.12.2016 23:43, Claudio Freire: >>>> >>>> Attached v4 patches with the requested fixes. >>>> >>>> >>>> Sorry for being late, but the tests took a lot of time. >>> >>> I know. Takes me several days to run my test scripts once. >>> >>>> create table t1 as select i, md5(random()::text) from >>>> generate_series(0,400000000) as i; >>>> create index md5_idx ON t1(md5); >>>> update t1 set md5 = md5((random() * (100 + 500))::text); >>>> vacuum; >>>> >>>> Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass, >>>> while for old version it took three passes (1GB+1GB+0.9GB). >>>> Vacuum duration results: >>>> >>>> vanilla: >>>> LOG: duration: 4359006.327 ms statement: vacuum verbose t1; >>>> patched: >>>> LOG: duration: 3076827.378 ms statement: vacuum verbose t1; >>>> >>>> We can see 30% vacuum speedup. I should note that this case can be >>>> considered >>>> as favorable to vanilla vacuum: the table is not that big, it has just one >>>> index >>>> and disk used is a fast fusionIO. We can expect even more gain on slower >>>> disks. >>>> >>>> Thank you again for the patch. Hope to see it in 10.0. >>> >>> Cool. Thanks for the review and the tests. >>> >> >> I encountered a bug with following scenario. >> 1. Create table and disable autovacuum on that table. >> 2. Make about 200000 dead tuples on the table. >> 3. SET maintenance_work_mem TO 1024 >> 4. VACUUM >> >> @@ -729,7 +759,7 @@ lazy_scan_heap(Relation onerel, int options, >> LVRelStats *vacrelstats, >> * not to reset latestRemovedXid since we want >> that value to be >> * valid. >> */ >> - vacrelstats->num_dead_tuples = 0; >> + lazy_clear_dead_tuples(vacrelstats); >> vacrelstats->num_index_scans++; >> >> /* Report that we are once again scanning the heap */ >> >> I think that we should do vacrelstats->dead_tuples.num_entries = 0 as >> well in lazy_clear_dead_tuples(). Once the amount of dead tuples >> reached to maintenance_work_mem, lazy_scan_heap can never finish. > > That's right. > > I added a test for it in the attached patch set, which uncovered > another bug in lazy_clear_dead_tuples, and took the opportunity to > rebase. > > On Mon, Jan 23, 2017 at 1:06 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> I pushed this patch after rewriting it rather completely. I added >> tracing notices to inspect the blocks it was prefetching and observed >> that the original coding was failing to prefetch the final streak of >> blocks in the table, which is an important oversight considering that it >> may very well be that those are the only blocks to read at all. >> >> I timed vacuuming a 4000-block table in my laptop (single SSD disk; >> dropped FS caches after deleting all rows in table, so that vacuum has >> to read all blocks from disk); it changes from 387ms without patch to >> 155ms with patch. I didn't measure how much it takes to run the other >> steps in the vacuum, but it's clear that the speedup for the truncation >> phase is considerable. >> >> ĄThanks, Claudio! > > Cool. > > Though it wasn't the first time this idea has been floating around, I > can't take all the credit. > > > On Fri, Jan 20, 2017 at 6:25 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> FWIW, I think this patch is completely separate from the maint_work_mem >> patch and should have had its own thread and its own commitfest entry. >> I intend to get a look at the other patch next week, after pushing this >> one. > > That's because it did have it, and was left in limbo due to lack of > testing on SSDs. I just had to adopt it here because otherwise tests > took way too long. Thank you for updating the patch! + /* + * Quickly rule out by lower bound (should happen a lot) Upper bound was + * already checked by segment search + */ + if (vac_cmp_itemptr((void *) itemptr, (void *) rseg->dead_tuples) < 0) + return false; I think that if the above result is 0, we can return true as itemptr matched lower bound item pointer in rseg->dead_tuples. +typedef struct DeadTuplesSegment +{ + int num_dead_tuples; /* # of entries in the segment */ + int max_dead_tuples; /* # of entries allocated in the segment */ + ItemPointerData last_dead_tuple; /* Copy of the last dead tuple (unset + * until the segment is fully + * populated) */ + unsigned short padding; + ItemPointer dead_tuples; /* Array of dead tuples */ +} DeadTuplesSegment; + +typedef struct DeadTuplesMultiArray +{ + int num_entries; /* current # of entries */ + int max_entries; /* total # of slots that can be allocated in + * array */ + int num_segs; /* number of dead tuple segments allocated */ + int last_seg; /* last dead tuple segment with data (or 0) */ + DeadTuplesSegment *dead_tuples; /* array of num_segs segments */ +} DeadTuplesMultiArray; It's a matter of personal preference but some same dead_tuples variables having different meaning confused me. If we want to access first dead tuple location of first segment, we need to do 'vacrelstats->dead_tuples.dead_tuples.dead_tuples'. For example, 'vacrelstats->dead_tuples.dt_segment.dt_array' is better to me. + nseg->num_dead_tuples = 0; + nseg->max_dead_tuples = 0; + nseg->dead_tuples = NULL; + vacrelstats->dead_tuples.num_segs++; + } + seg = DeadTuplesCurrentSegment(vacrelstats); + } + vacrelstats->dead_tuples.last_seg++; + seg = DeadTuplesCurrentSegment(vacrelstats); Because seg is always set later I think the first line starting with "seg = ..." is not necessary. Thought? -- Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jan 25, 2017 at 1:54 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thank you for updating the patch! > > + /* > + * Quickly rule out by lower bound (should happen a lot) Upper bound was > + * already checked by segment search > + */ > + if (vac_cmp_itemptr((void *) itemptr, (void *) rseg->dead_tuples) < 0) > + return false; > > I think that if the above result is 0, we can return true as itemptr > matched lower bound item pointer in rseg->dead_tuples. That's right. Possibly not a great speedup but... why not? > > +typedef struct DeadTuplesSegment > +{ > + int num_dead_tuples; /* # of > entries in the segment */ > + int max_dead_tuples; /* # of > entries allocated in the segment */ > + ItemPointerData last_dead_tuple; /* Copy of the last > dead tuple (unset > + > * until the segment is fully > + > * populated) */ > + unsigned short padding; > + ItemPointer dead_tuples; /* Array of dead tuples */ > +} DeadTuplesSegment; > + > +typedef struct DeadTuplesMultiArray > +{ > + int num_entries; /* current # of entries */ > + int max_entries; /* total # of slots > that can be allocated in > + * array */ > + int num_segs; /* number of > dead tuple segments allocated */ > + int last_seg; /* last dead > tuple segment with data (or 0) */ > + DeadTuplesSegment *dead_tuples; /* array of num_segs segments */ > +} DeadTuplesMultiArray; > > It's a matter of personal preference but some same dead_tuples > variables having different meaning confused me. > If we want to access first dead tuple location of first segment, we > need to do 'vacrelstats->dead_tuples.dead_tuples.dead_tuples'. For > example, 'vacrelstats->dead_tuples.dt_segment.dt_array' is better to > me. Yes, I can see how that could be confusing. I went for vacrelstats->dead_tuples.dt_segments[i].dt_tids[j] > + nseg->num_dead_tuples = 0; > + nseg->max_dead_tuples = 0; > + nseg->dead_tuples = NULL; > + vacrelstats->dead_tuples.num_segs++; > + } > + seg = DeadTuplesCurrentSegment(vacrelstats); > + } > + vacrelstats->dead_tuples.last_seg++; > + seg = DeadTuplesCurrentSegment(vacrelstats); > > Because seg is always set later I think the first line starting with > "seg = ..." is not necessary. Thought? That's correct. Attached a v6 with those changes (and rebased). Make check still passes. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Jan 26, 2017 at 5:11 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Jan 25, 2017 at 1:54 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Thank you for updating the patch! >> >> + /* >> + * Quickly rule out by lower bound (should happen a lot) Upper bound was >> + * already checked by segment search >> + */ >> + if (vac_cmp_itemptr((void *) itemptr, (void *) rseg->dead_tuples) < 0) >> + return false; >> >> I think that if the above result is 0, we can return true as itemptr >> matched lower bound item pointer in rseg->dead_tuples. > > That's right. Possibly not a great speedup but... why not? > >> >> +typedef struct DeadTuplesSegment >> +{ >> + int num_dead_tuples; /* # of >> entries in the segment */ >> + int max_dead_tuples; /* # of >> entries allocated in the segment */ >> + ItemPointerData last_dead_tuple; /* Copy of the last >> dead tuple (unset >> + >> * until the segment is fully >> + >> * populated) */ >> + unsigned short padding; >> + ItemPointer dead_tuples; /* Array of dead tuples */ >> +} DeadTuplesSegment; >> + >> +typedef struct DeadTuplesMultiArray >> +{ >> + int num_entries; /* current # of entries */ >> + int max_entries; /* total # of slots >> that can be allocated in >> + * array */ >> + int num_segs; /* number of >> dead tuple segments allocated */ >> + int last_seg; /* last dead >> tuple segment with data (or 0) */ >> + DeadTuplesSegment *dead_tuples; /* array of num_segs segments */ >> +} DeadTuplesMultiArray; >> >> It's a matter of personal preference but some same dead_tuples >> variables having different meaning confused me. >> If we want to access first dead tuple location of first segment, we >> need to do 'vacrelstats->dead_tuples.dead_tuples.dead_tuples'. For >> example, 'vacrelstats->dead_tuples.dt_segment.dt_array' is better to >> me. > > Yes, I can see how that could be confusing. > > I went for vacrelstats->dead_tuples.dt_segments[i].dt_tids[j] Thank you for updating. Looks good to me. >> + nseg->num_dead_tuples = 0; >> + nseg->max_dead_tuples = 0; >> + nseg->dead_tuples = NULL; >> + vacrelstats->dead_tuples.num_segs++; >> + } >> + seg = DeadTuplesCurrentSegment(vacrelstats); >> + } >> + vacrelstats->dead_tuples.last_seg++; >> + seg = DeadTuplesCurrentSegment(vacrelstats); >> >> Because seg is always set later I think the first line starting with >> "seg = ..." is not necessary. Thought? > > That's correct. > > Attached a v6 with those changes (and rebased). > > Make check still passes. Here is review comment of v6 patch. ----* We are willing to use at most maintenance_work_mem (or perhaps* autovacuum_work_mem) memory space to keep track ofdead tuples. We* initially allocate an array of TIDs of that size, with an upper limit that* depends on table size (thislimit ensures we don't allocate a huge area* uselessly for vacuuming small tables). If the array threatens to overflow, I think that we need to update the above paragraph comment at top of vacuumlazy.c file. ---- + numtuples = Max(numtuples, MaxHeapTuplesPerPage); + numtuples = Min(numtuples, INT_MAX / 2); + numtuples = Min(numtuples, 2 * pseg->max_dead_tuples); + numtuples = Min(numtuples, MaxAllocSize / sizeof(ItemPointerData)); + seg->dt_tids = (ItemPointer) palloc(sizeof(ItemPointerData) * numtuples); Why numtuples is limited to "INT_MAX / 2" but not INT_MAX? ---- @@ -1376,35 +1411,43 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) pg_rusage_init(&ru0); npages = 0; - tupindex = 0; - while (tupindex < vacrelstats->num_dead_tuples) + segindex = 0; + tottuples = 0; + for (segindex = tupindex = 0; segindex <= vacrelstats->dead_tuples.last_seg; tupindex = 0, segindex++) { - BlockNumber tblk; - Buffer buf; - Page page; - Size freespace; This is a minute thing but tupindex can be define inside of for loop. ---- @@ -1129,10 +1159,13 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, * instead of doing a second scan. */ if (nindexes == 0 && - vacrelstats->num_dead_tuples > 0) + vacrelstats->dead_tuples.num_entries > 0) { /* Remove tuples from heap */ - lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer); + Assert(vacrelstats->dead_tuples.last_seg == 0); /* Should not need more + * than one segment per + * page */ I'm not sure we need to add Assert() here but it seems to me that the comment and code is not properly correspond and the comment for Assert() should be wrote above of Assert() line. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Jan 30, 2017 at 5:51 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > ---- > * We are willing to use at most maintenance_work_mem (or perhaps > * autovacuum_work_mem) memory space to keep track of dead tuples. We > * initially allocate an array of TIDs of that size, with an upper limit that > * depends on table size (this limit ensures we don't allocate a huge area > * uselessly for vacuuming small tables). If the array threatens to overflow, > > I think that we need to update the above paragraph comment at top of > vacuumlazy.c file. Indeed, I missed that one. Fixing. > > ---- > + numtuples = Max(numtuples, > MaxHeapTuplesPerPage); > + numtuples = Min(numtuples, INT_MAX / 2); > + numtuples = Min(numtuples, 2 * > pseg->max_dead_tuples); > + numtuples = Min(numtuples, > MaxAllocSize / sizeof(ItemPointerData)); > + seg->dt_tids = (ItemPointer) > palloc(sizeof(ItemPointerData) * numtuples); > > Why numtuples is limited to "INT_MAX / 2" but not INT_MAX? I forgot to mention this one in the OP. Googling around, I found out some implemetations of bsearch break with array sizes beyond INT_MAX/2 [1] (they'd overflow when computing the midpoint). Before this patch, this bsearch call had no way of reaching that size. An initial version of the patch (the one that allocated a big array with huge allocation) could reach that point, though, so I reduced the limit to play it safe. This latest version is back to the starting point, since it cannot allocate segments bigger than 1GB, but I opted to keep playing it safe and leave the reduced limit just in case. > ---- > @@ -1376,35 +1411,43 @@ lazy_vacuum_heap(Relation onerel, LVRelStats > *vacrelstats) > pg_rusage_init(&ru0); > npages = 0; > > - tupindex = 0; > - while (tupindex < vacrelstats->num_dead_tuples) > + segindex = 0; > + tottuples = 0; > + for (segindex = tupindex = 0; segindex <= > vacrelstats->dead_tuples.last_seg; tupindex = 0, segindex++) > { > - BlockNumber tblk; > - Buffer buf; > - Page page; > - Size freespace; > > This is a minute thing but tupindex can be define inside of for loop. Right, changing. > > ---- > @@ -1129,10 +1159,13 @@ lazy_scan_heap(Relation onerel, int options, > LVRelStats *vacrelstats, > * instead of doing a second scan. > */ > if (nindexes == 0 && > - vacrelstats->num_dead_tuples > 0) > + vacrelstats->dead_tuples.num_entries > 0) > { > /* Remove tuples from heap */ > - lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer); > + Assert(vacrelstats->dead_tuples.last_seg == 0); /* > Should not need more > + * > than one segment per > + * page */ > > I'm not sure we need to add Assert() here but it seems to me that the > comment and code is not properly correspond and the comment for > Assert() should be wrote above of Assert() line. Well, that assert is the one that found the second bug in lazy_clear_dead_tuples, so clearly it's not without merit. I'll rearrange the comments as you ask though. Updated and rebased v7 attached. [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=776671 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Jan 31, 2017 at 11:05 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > Updated and rebased v7 attached. Moved to CF 2017-03. -- Michael
On Tue, Jan 31, 2017 at 3:05 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Mon, Jan 30, 2017 at 5:51 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> ---- >> * We are willing to use at most maintenance_work_mem (or perhaps >> * autovacuum_work_mem) memory space to keep track of dead tuples. We >> * initially allocate an array of TIDs of that size, with an upper limit that >> * depends on table size (this limit ensures we don't allocate a huge area >> * uselessly for vacuuming small tables). If the array threatens to overflow, >> >> I think that we need to update the above paragraph comment at top of >> vacuumlazy.c file. > > Indeed, I missed that one. Fixing. > >> >> ---- >> + numtuples = Max(numtuples, >> MaxHeapTuplesPerPage); >> + numtuples = Min(numtuples, INT_MAX / 2); >> + numtuples = Min(numtuples, 2 * >> pseg->max_dead_tuples); >> + numtuples = Min(numtuples, >> MaxAllocSize / sizeof(ItemPointerData)); >> + seg->dt_tids = (ItemPointer) >> palloc(sizeof(ItemPointerData) * numtuples); >> >> Why numtuples is limited to "INT_MAX / 2" but not INT_MAX? > > I forgot to mention this one in the OP. > > Googling around, I found out some implemetations of bsearch break with > array sizes beyond INT_MAX/2 [1] (they'd overflow when computing the > midpoint). > > Before this patch, this bsearch call had no way of reaching that size. > An initial version of the patch (the one that allocated a big array > with huge allocation) could reach that point, though, so I reduced the > limit to play it safe. This latest version is back to the starting > point, since it cannot allocate segments bigger than 1GB, but I opted > to keep playing it safe and leave the reduced limit just in case. > Thanks, I understood. >> ---- >> @@ -1376,35 +1411,43 @@ lazy_vacuum_heap(Relation onerel, LVRelStats >> *vacrelstats) >> pg_rusage_init(&ru0); >> npages = 0; >> >> - tupindex = 0; >> - while (tupindex < vacrelstats->num_dead_tuples) >> + segindex = 0; >> + tottuples = 0; >> + for (segindex = tupindex = 0; segindex <= >> vacrelstats->dead_tuples.last_seg; tupindex = 0, segindex++) >> { >> - BlockNumber tblk; >> - Buffer buf; >> - Page page; >> - Size freespace; >> >> This is a minute thing but tupindex can be define inside of for loop. > > Right, changing. > >> >> ---- >> @@ -1129,10 +1159,13 @@ lazy_scan_heap(Relation onerel, int options, >> LVRelStats *vacrelstats, >> * instead of doing a second scan. >> */ >> if (nindexes == 0 && >> - vacrelstats->num_dead_tuples > 0) >> + vacrelstats->dead_tuples.num_entries > 0) >> { >> /* Remove tuples from heap */ >> - lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer); >> + Assert(vacrelstats->dead_tuples.last_seg == 0); /* >> Should not need more >> + * >> than one segment per >> + * page */ >> >> I'm not sure we need to add Assert() here but it seems to me that the >> comment and code is not properly correspond and the comment for >> Assert() should be wrote above of Assert() line. > > Well, that assert is the one that found the second bug in > lazy_clear_dead_tuples, so clearly it's not without merit. > > I'll rearrange the comments as you ask though. > > > Updated and rebased v7 attached. > > > [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=776671 Thank you for updating the patch. Whole patch looks good to me except for the following one comment. This is the final comment from me. /** lazy_tid_reaped() -- is a particular tid deletable?** This has the right signature to be an IndexBulkDeleteCallback.** Assumes dead_tuples array is in sorted order.*/ static bool lazy_tid_reaped(ItemPointer itemptr, void *state) { LVRelStats *vacrelstats = (LVRelStats *) state; You might want to update the comment of lazy_tid_reaped() as well. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thank you for updating the patch. > > Whole patch looks good to me except for the following one comment. > This is the final comment from me. > > /* > * lazy_tid_reaped() -- is a particular tid deletable? > * > * This has the right signature to be an IndexBulkDeleteCallback. > * > * Assumes dead_tuples array is in sorted order. > */ > static bool > lazy_tid_reaped(ItemPointer itemptr, void *state) > { > LVRelStats *vacrelstats = (LVRelStats *) state; > > You might want to update the comment of lazy_tid_reaped() as well. I don't see the mismatch with reality there (if you consider "dead_tples array" in the proper context, that is, the multiarray). What in particular do you find out of sync there?
On Wed, Feb 1, 2017 at 10:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Thank you for updating the patch. >> >> Whole patch looks good to me except for the following one comment. >> This is the final comment from me. >> >> /* >> * lazy_tid_reaped() -- is a particular tid deletable? >> * >> * This has the right signature to be an IndexBulkDeleteCallback. >> * >> * Assumes dead_tuples array is in sorted order. >> */ >> static bool >> lazy_tid_reaped(ItemPointer itemptr, void *state) >> { >> LVRelStats *vacrelstats = (LVRelStats *) state; >> >> You might want to update the comment of lazy_tid_reaped() as well. > > I don't see the mismatch with reality there (if you consider > "dead_tples array" in the proper context, that is, the multiarray). > > What in particular do you find out of sync there? The current lazy_tid_reaped just find a tid from a tid array using bsearch but in your patch lazy_tid_reaped handles multiple tid arrays and processing method become complicated. So I thought it's better to add the description of this function. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Feb 1, 2017 at 6:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Feb 1, 2017 at 10:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> Thank you for updating the patch. >>> >>> Whole patch looks good to me except for the following one comment. >>> This is the final comment from me. >>> >>> /* >>> * lazy_tid_reaped() -- is a particular tid deletable? >>> * >>> * This has the right signature to be an IndexBulkDeleteCallback. >>> * >>> * Assumes dead_tuples array is in sorted order. >>> */ >>> static bool >>> lazy_tid_reaped(ItemPointer itemptr, void *state) >>> { >>> LVRelStats *vacrelstats = (LVRelStats *) state; >>> >>> You might want to update the comment of lazy_tid_reaped() as well. >> >> I don't see the mismatch with reality there (if you consider >> "dead_tples array" in the proper context, that is, the multiarray). >> >> What in particular do you find out of sync there? > > The current lazy_tid_reaped just find a tid from a tid array using > bsearch but in your patch lazy_tid_reaped handles multiple tid arrays > and processing method become complicated. So I thought it's better to > add the description of this function. Alright, updated with some more remarks that seemed relevant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Feb 1, 2017 at 11:55 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Feb 1, 2017 at 6:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Wed, Feb 1, 2017 at 10:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> Thank you for updating the patch. >>>> >>>> Whole patch looks good to me except for the following one comment. >>>> This is the final comment from me. >>>> >>>> /* >>>> * lazy_tid_reaped() -- is a particular tid deletable? >>>> * >>>> * This has the right signature to be an IndexBulkDeleteCallback. >>>> * >>>> * Assumes dead_tuples array is in sorted order. >>>> */ >>>> static bool >>>> lazy_tid_reaped(ItemPointer itemptr, void *state) >>>> { >>>> LVRelStats *vacrelstats = (LVRelStats *) state; >>>> >>>> You might want to update the comment of lazy_tid_reaped() as well. >>> >>> I don't see the mismatch with reality there (if you consider >>> "dead_tples array" in the proper context, that is, the multiarray). >>> >>> What in particular do you find out of sync there? >> >> The current lazy_tid_reaped just find a tid from a tid array using >> bsearch but in your patch lazy_tid_reaped handles multiple tid arrays >> and processing method become complicated. So I thought it's better to >> add the description of this function. > > Alright, updated with some more remarks that seemed relevant Thank you for updating the patch. The patch looks good to me. There is no review comment from me. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Feb 1, 2017 at 7:55 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Feb 1, 2017 at 6:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Wed, Feb 1, 2017 at 10:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> Thank you for updating the patch. >>>> >>>> Whole patch looks good to me except for the following one comment. >>>> This is the final comment from me. >>>> >>>> /* >>>> * lazy_tid_reaped() -- is a particular tid deletable? >>>> * >>>> * This has the right signature to be an IndexBulkDeleteCallback. >>>> * >>>> * Assumes dead_tuples array is in sorted order. >>>> */ >>>> static bool >>>> lazy_tid_reaped(ItemPointer itemptr, void *state) >>>> { >>>> LVRelStats *vacrelstats = (LVRelStats *) state; >>>> >>>> You might want to update the comment of lazy_tid_reaped() as well. >>> >>> I don't see the mismatch with reality there (if you consider >>> "dead_tples array" in the proper context, that is, the multiarray). >>> >>> What in particular do you find out of sync there? >> >> The current lazy_tid_reaped just find a tid from a tid array using >> bsearch but in your patch lazy_tid_reaped handles multiple tid arrays >> and processing method become complicated. So I thought it's better to >> add the description of this function. > > Alright, updated with some more remarks that seemed relevant I just realized I never updated the early free patch after the multiarray version. So attached is a patch that frees the multiarray as early as possible (just after finishing with index bulk deletes, right before doing index cleanup and attempting truncation). This should make the possibly big amount of memory available to other processes for the duration of those tasks, which could be a long time in some cases.
Attachment
Hi, I've *not* read the history of this thread. So I really might be missing some context. > From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001 > From: Claudio Freire <klaussfreire@gmail.com> > Date: Mon, 12 Sep 2016 23:36:42 -0300 > Subject: [PATCH] Vacuum: allow using more than 1GB work mem > > Turn the dead_tuples array into a structure composed of several > exponentially bigger arrays, to enable usage of more than 1GB > of work mem during vacuum and thus reduce the number of full > index scans necessary to remove all dead tids when the memory is > available. > * We are willing to use at most maintenance_work_mem (or perhaps > * autovacuum_work_mem) memory space to keep track of dead tuples. We > - * initially allocate an array of TIDs of that size, with an upper limit that > + * initially allocate an array of TIDs of 128MB, or an upper limit that > * depends on table size (this limit ensures we don't allocate a huge area > - * uselessly for vacuuming small tables). If the array threatens to overflow, > - * we suspend the heap scan phase and perform a pass of index cleanup and page > - * compaction, then resume the heap scan with an empty TID array. > + * uselessly for vacuuming small tables). Additional arrays of increasingly > + * large sizes are allocated as they become necessary. > + * > + * The TID array is thus represented as a list of multiple segments of > + * varying size, beginning with the initial size of up to 128MB, and growing > + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem > + * is used up. When the chunk size is 128MB, I'm a bit unconvinced that using exponential growth is worth it. The allocator overhead can't be meaningful in comparison to collecting 128MB dead tuples, the potential waste is pretty big, and it increases memory fragmentation. > + * Lookup in that structure proceeds sequentially in the list of segments, > + * and with a binary search within each segment. Since segment's size grows > + * exponentially, this retains O(N log N) lookup complexity. N log N is a horrible lookup complexity. That's the complexity of *sorting* an entire array. I think you might be trying to argue that it's log(N) * log(N)? Once log(n) for the exponentially growing size of segments, one for the binary search? Afaics you could quite easily make it O(2 log(N)) by simply also doing binary search over the segments. Might not be worth it due to the small constant involved normally. > + * If the array threatens to overflow, we suspend the heap scan phase and > + * perform a pass of index cleanup and page compaction, then resume the heap > + * scan with an array of logically empty but already preallocated TID segments > + * to be refilled with more dead tuple TIDs. Hm, it's not really the array that overflows, it's m_w_m that'd be exceeded, right? > /* > + * Minimum (starting) size of the dead_tuples array segments. Will allocate > + * space for 128MB worth of tid pointers in the first segment, further segments > + * will grow in size exponentially. Don't make it too small or the segment list > + * will grow bigger than the sweetspot for search efficiency on big vacuums. > + */ > +#define LAZY_MIN_TUPLES Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData)) That's not really the minimum, no? s/MIN/INIT/? > +typedef struct DeadTuplesSegment > +{ > + int num_dead_tuples; /* # of entries in the segment */ > + int max_dead_tuples; /* # of entries allocated in the segment */ > + ItemPointerData last_dead_tuple; /* Copy of the last dead tuple (unset > + * until the segment is fully > + * populated) */ > + unsigned short padding; > + ItemPointer dt_tids; /* Array of dead tuples */ > +} DeadTuplesSegment; Whenever padding is needed, it should have an explanatory comment. It's certainly not obvious to me wh it's neede here. > @@ -1598,6 +1657,11 @@ lazy_vacuum_index(Relation indrel, > ivinfo.num_heap_tuples = vacrelstats->old_rel_tuples; > ivinfo.strategy = vac_strategy; > > + /* Finalize the current segment by setting its upper bound dead tuple */ > + seg = DeadTuplesCurrentSegment(vacrelstats); > + if (seg->num_dead_tuples > 0) > + seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1]; Why don't we just maintain this here, for all of the segments? Seems a bit easier. > @@ -1973,7 +2037,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats) > static void > lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) > { > - long maxtuples; > + long maxtuples, > + mintuples; > int vac_work_mem = IsAutoVacuumWorkerProcess() && > autovacuum_work_mem != -1 ? > autovacuum_work_mem : maintenance_work_mem; > @@ -1982,7 +2047,6 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) > { > maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData); > maxtuples = Min(maxtuples, INT_MAX); > - maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData)); > > /* curious coding here to ensure the multiplication can't overflow */ > if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks) > @@ -1996,10 +2060,18 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) > maxtuples = MaxHeapTuplesPerPage; > } > > - vacrelstats->num_dead_tuples = 0; > - vacrelstats->max_dead_tuples = (int) maxtuples; > - vacrelstats->dead_tuples = (ItemPointer) > - palloc(maxtuples * sizeof(ItemPointerData)); > + mintuples = Min(LAZY_MIN_TUPLES, maxtuples); > + > + vacrelstats->dead_tuples.num_entries = 0; > + vacrelstats->dead_tuples.max_entries = (int) maxtuples; > + vacrelstats->dead_tuples.num_segs = 1; > + vacrelstats->dead_tuples.last_seg = 0; > + vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *) > + palloc(sizeof(DeadTuplesSegment)); > + vacrelstats->dead_tuples.dt_segments[0].dt_tids = (ItemPointer) > + palloc(mintuples * sizeof(ItemPointerData)); > + vacrelstats->dead_tuples.dt_segments[0].max_dead_tuples = mintuples; > + vacrelstats->dead_tuples.dt_segments[0].num_dead_tuples = 0; > } Hm. Why don't we delay allocating dt_segments[0] till we actually need it? It's not uncommon for vacuums not to be able to find any dead tuples, and it'd not change code in lazy_record_dead_tuple() much. > @@ -2014,31 +2086,147 @@ lazy_record_dead_tuple(LVRelStats *vacrelstats, > * could if we are given a really small maintenance_work_mem. In that > * case, just forget the last few tuples (we'll get 'em next time). > */ > - if (vacrelstats->num_dead_tuples < vacrelstats->max_dead_tuples) > + if (vacrelstats->dead_tuples.num_entries < vacrelstats->dead_tuples.max_entries) > { > - vacrelstats->dead_tuples[vacrelstats->num_dead_tuples] = *itemptr; > - vacrelstats->num_dead_tuples++; > + DeadTuplesSegment *seg = DeadTuplesCurrentSegment(vacrelstats); > + > + if (seg->num_dead_tuples >= seg->max_dead_tuples) > + { > + /* > + * The segment is overflowing, so we must allocate a new segment. > + * We could have a preallocated segment descriptor already, in > + * which case we just reinitialize it, or we may need to repalloc > + * the vacrelstats->dead_tuples array. In that case, seg will no > + * longer be valid, so we must be careful about that. In any case, > + * we must update the last_dead_tuple copy in the overflowing > + * segment descriptor. > + */ > + Assert(seg->num_dead_tuples == seg->max_dead_tuples); > + seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1]; > + if (vacrelstats->dead_tuples.last_seg + 1 >= vacrelstats->dead_tuples.num_segs) > + { > + int new_num_segs = vacrelstats->dead_tuples.num_segs * 2; > + > + vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *) repalloc( > + (void *) vacrelstats->dead_tuples.dt_segments, > + new_num_segs * sizeof(DeadTuplesSegment)); Might be worth breaking this into some sub-statements, it's quite hard to read. > + while (vacrelstats->dead_tuples.num_segs < new_num_segs) > + { > + /* Initialize as "unallocated" */ > + DeadTuplesSegment *nseg = &(vacrelstats->dead_tuples.dt_segments[ > + vacrelstats->dead_tuples.num_segs]); dito. > +/* > * lazy_tid_reaped() -- is a particular tid deletable? > * > * This has the right signature to be an IndexBulkDeleteCallback. > * > - * Assumes dead_tuples array is in sorted order. > + * Assumes the dead_tuples multiarray is in sorted order, both > + * the segment list and each segment itself, and that all segments' > + * last_dead_tuple fields up to date > */ > static bool > lazy_tid_reaped(ItemPointer itemptr, void *state) Have you done performance evaluation about potential performance regressions in big indexes here? IIRC this can be quite frequently called? I think this is reasonably close to commit, but unfortunately not quite there yet. I.e. I personally won't polish this up & commit in the next couple hours, but if somebody else wants to take that on... Greetings, Andres Freund
On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > I've *not* read the history of this thread. So I really might be > missing some context. > > >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001 >> From: Claudio Freire <klaussfreire@gmail.com> >> Date: Mon, 12 Sep 2016 23:36:42 -0300 >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem >> >> Turn the dead_tuples array into a structure composed of several >> exponentially bigger arrays, to enable usage of more than 1GB >> of work mem during vacuum and thus reduce the number of full >> index scans necessary to remove all dead tids when the memory is >> available. > >> * We are willing to use at most maintenance_work_mem (or perhaps >> * autovacuum_work_mem) memory space to keep track of dead tuples. We >> - * initially allocate an array of TIDs of that size, with an upper limit that >> + * initially allocate an array of TIDs of 128MB, or an upper limit that >> * depends on table size (this limit ensures we don't allocate a huge area >> - * uselessly for vacuuming small tables). If the array threatens to overflow, >> - * we suspend the heap scan phase and perform a pass of index cleanup and page >> - * compaction, then resume the heap scan with an empty TID array. >> + * uselessly for vacuuming small tables). Additional arrays of increasingly >> + * large sizes are allocated as they become necessary. >> + * >> + * The TID array is thus represented as a list of multiple segments of >> + * varying size, beginning with the initial size of up to 128MB, and growing >> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem >> + * is used up. > > When the chunk size is 128MB, I'm a bit unconvinced that using > exponential growth is worth it. The allocator overhead can't be > meaningful in comparison to collecting 128MB dead tuples, the potential > waste is pretty big, and it increases memory fragmentation. The exponential strategy is mainly to improve lookup time (ie: to avoid large segment lists). >> + * Lookup in that structure proceeds sequentially in the list of segments, >> + * and with a binary search within each segment. Since segment's size grows >> + * exponentially, this retains O(N log N) lookup complexity. > > N log N is a horrible lookup complexity. That's the complexity of > *sorting* an entire array. I think you might be trying to argue that > it's log(N) * log(N)? Once log(n) for the exponentially growing size of > segments, one for the binary search? > > Afaics you could quite easily make it O(2 log(N)) by simply also doing > binary search over the segments. Might not be worth it due to the small > constant involved normally. It's a typo, yes, I meant O(log N) (which is equivalent to O(2 log N)) >> + * If the array threatens to overflow, we suspend the heap scan phase and >> + * perform a pass of index cleanup and page compaction, then resume the heap >> + * scan with an array of logically empty but already preallocated TID segments >> + * to be refilled with more dead tuple TIDs. > > Hm, it's not really the array that overflows, it's m_w_m that'd be > exceeded, right? Yes, will rephrase. Although that's how the original comment expressed the same concept. >> /* >> + * Minimum (starting) size of the dead_tuples array segments. Will allocate >> + * space for 128MB worth of tid pointers in the first segment, further segments >> + * will grow in size exponentially. Don't make it too small or the segment list >> + * will grow bigger than the sweetspot for search efficiency on big vacuums. >> + */ >> +#define LAZY_MIN_TUPLES Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData)) > > That's not really the minimum, no? s/MIN/INIT/? Ok >> +typedef struct DeadTuplesSegment >> +{ >> + int num_dead_tuples; /* # of entries in the segment */ >> + int max_dead_tuples; /* # of entries allocated in the segment */ >> + ItemPointerData last_dead_tuple; /* Copy of the last dead tuple (unset >> + * until the segment is fully >> + * populated) */ >> + unsigned short padding; >> + ItemPointer dt_tids; /* Array of dead tuples */ >> +} DeadTuplesSegment; > > Whenever padding is needed, it should have an explanatory comment. It's > certainly not obvious to me wh it's neede here. Ok >> @@ -1598,6 +1657,11 @@ lazy_vacuum_index(Relation indrel, >> ivinfo.num_heap_tuples = vacrelstats->old_rel_tuples; >> ivinfo.strategy = vac_strategy; >> >> + /* Finalize the current segment by setting its upper bound dead tuple */ >> + seg = DeadTuplesCurrentSegment(vacrelstats); >> + if (seg->num_dead_tuples > 0) >> + seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1]; > > Why don't we just maintain this here, for all of the segments? Seems a > bit easier. Originally, I just wanted to maintain the validity of last_dead_tuple as an invariant at all times. But it may be like you say, that it's simpler to just maintain the invariant of all segments at finalization time. I'll explore that possibility. >> @@ -1973,7 +2037,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats) >> static void >> lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) >> { >> - long maxtuples; >> + long maxtuples, >> + mintuples; >> int vac_work_mem = IsAutoVacuumWorkerProcess() && >> autovacuum_work_mem != -1 ? >> autovacuum_work_mem : maintenance_work_mem; >> @@ -1982,7 +2047,6 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) >> { >> maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData); >> maxtuples = Min(maxtuples, INT_MAX); >> - maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData)); >> >> /* curious coding here to ensure the multiplication can't overflow */ >> if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks) >> @@ -1996,10 +2060,18 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) >> maxtuples = MaxHeapTuplesPerPage; >> } >> >> - vacrelstats->num_dead_tuples = 0; >> - vacrelstats->max_dead_tuples = (int) maxtuples; >> - vacrelstats->dead_tuples = (ItemPointer) >> - palloc(maxtuples * sizeof(ItemPointerData)); >> + mintuples = Min(LAZY_MIN_TUPLES, maxtuples); >> + >> + vacrelstats->dead_tuples.num_entries = 0; >> + vacrelstats->dead_tuples.max_entries = (int) maxtuples; >> + vacrelstats->dead_tuples.num_segs = 1; >> + vacrelstats->dead_tuples.last_seg = 0; >> + vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *) >> + palloc(sizeof(DeadTuplesSegment)); >> + vacrelstats->dead_tuples.dt_segments[0].dt_tids = (ItemPointer) >> + palloc(mintuples * sizeof(ItemPointerData)); >> + vacrelstats->dead_tuples.dt_segments[0].max_dead_tuples = mintuples; >> + vacrelstats->dead_tuples.dt_segments[0].num_dead_tuples = 0; >> } > > Hm. Why don't we delay allocating dt_segments[0] till we actually need > it? It's not uncommon for vacuums not to be able to find any dead > tuples, and it'd not change code in lazy_record_dead_tuple() much. I avoided that because that would make dt_segments[last_seg] invalid for the case of a just-initialized multiarray. Some places in the code use a macro that references dt_segments[last_seg] (mostly for indexless tables), and having to check num_segs and do lazy initialization would have complicated the code considerably. Nonetheless, I'll re-check how viable doing that would be. >> @@ -2014,31 +2086,147 @@ lazy_record_dead_tuple(LVRelStats *vacrelstats, >> * could if we are given a really small maintenance_work_mem. In that >> * case, just forget the last few tuples (we'll get 'em next time). >> */ >> - if (vacrelstats->num_dead_tuples < vacrelstats->max_dead_tuples) >> + if (vacrelstats->dead_tuples.num_entries < vacrelstats->dead_tuples.max_entries) >> { >> - vacrelstats->dead_tuples[vacrelstats->num_dead_tuples] = *itemptr; >> - vacrelstats->num_dead_tuples++; >> + DeadTuplesSegment *seg = DeadTuplesCurrentSegment(vacrelstats); >> + >> + if (seg->num_dead_tuples >= seg->max_dead_tuples) >> + { >> + /* >> + * The segment is overflowing, so we must allocate a new segment. >> + * We could have a preallocated segment descriptor already, in >> + * which case we just reinitialize it, or we may need to repalloc >> + * the vacrelstats->dead_tuples array. In that case, seg will no >> + * longer be valid, so we must be careful about that. In any case, >> + * we must update the last_dead_tuple copy in the overflowing >> + * segment descriptor. >> + */ >> + Assert(seg->num_dead_tuples == seg->max_dead_tuples); >> + seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1]; >> + if (vacrelstats->dead_tuples.last_seg + 1 >= vacrelstats->dead_tuples.num_segs) >> + { >> + int new_num_segs = vacrelstats->dead_tuples.num_segs * 2; >> + >> + vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *) repalloc( >> + (void *) vacrelstats->dead_tuples.dt_segments, >> + new_num_segs * sizeof(DeadTuplesSegment)); > > Might be worth breaking this into some sub-statements, it's quite hard > to read. Breaking what precisely? The comment? >> + while (vacrelstats->dead_tuples.num_segs < new_num_segs) >> + { >> + /* Initialize as "unallocated" */ >> + DeadTuplesSegment *nseg = &(vacrelstats->dead_tuples.dt_segments[ >> + vacrelstats->dead_tuples.num_segs]); > > dito. I don't really get what you're asking here. >> +/* >> * lazy_tid_reaped() -- is a particular tid deletable? >> * >> * This has the right signature to be an IndexBulkDeleteCallback. >> * >> - * Assumes dead_tuples array is in sorted order. >> + * Assumes the dead_tuples multiarray is in sorted order, both >> + * the segment list and each segment itself, and that all segments' >> + * last_dead_tuple fields up to date >> */ >> static bool >> lazy_tid_reaped(ItemPointer itemptr, void *state) > > Have you done performance evaluation about potential performance > regressions in big indexes here? IIRC this can be quite frequently > called? Yes, the benchmarks are upthread. The earlier runs were run on my laptop and made little sense, so I'd ignore them as inaccurate. The latest run[1] with a pgbench scale of 4000 gave an improvement in CPU time (ie: faster) of about 20%. Anastasia did another one[2] and saw improvements as well, roughly 30%, though it's not measuring CPU time but rather elapsed time. Even small scales (100) saw an improvement as well, although possibly below the noise floor. Tests are very slow so I haven't run enough to measure variance and statistical significance. I blame the improvement not only on better cache locality (the initial search on the segment list usually fits on L1) but also on less overall work due to needing less index scans, and the fact that overall lookup complexity remains O(log N) due to the exponential segment growth strategy. [1] https://www.postgresql.org/message-id/CAGTBQpa6NFGO_6g_y_7zQx8L9GcHDSQKYdo1tGuh791z6PYgEg%40mail.gmail.com [2] https://www.postgresql.org/message-id/13bee467-bdcf-d3b9-c0ee-e2792fd46839%40postgrespro.ru > > > I think this is reasonably close to commit, but unfortunately not quite > there yet. I.e. I personally won't polish this up & commit in the next > couple hours, but if somebody else wants to take that on... > > Greetings, > > Andres Freund I'll post an updated patch with the requested changes shortly.
On Fri, Apr 7, 2017 at 7:43 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> + * Lookup in that structure proceeds sequentially in the list of segments, >>> + * and with a binary search within each segment. Since segment's size grows >>> + * exponentially, this retains O(N log N) lookup complexity. >> >> N log N is a horrible lookup complexity. That's the complexity of >> *sorting* an entire array. I think you might be trying to argue that >> it's log(N) * log(N)? Once log(n) for the exponentially growing size of >> segments, one for the binary search? >> >> Afaics you could quite easily make it O(2 log(N)) by simply also doing >> binary search over the segments. Might not be worth it due to the small >> constant involved normally. > > It's a typo, yes, I meant O(log N) (which is equivalent to O(2 log N)) To clarify, lookup over the segments is linear, so it's O(M) with M the number of segments, then the binary search is O(log N) with N the number of dead tuples. So lookup is O(M + log N), but M < log N because of the segment's exponential growth, therefore the lookup is O(2 log N)
Hi, On 2017-04-07 19:43:39 -0300, Claudio Freire wrote: > On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > > > I've *not* read the history of this thread. So I really might be > > missing some context. > > > > > >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001 > >> From: Claudio Freire <klaussfreire@gmail.com> > >> Date: Mon, 12 Sep 2016 23:36:42 -0300 > >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem > >> > >> Turn the dead_tuples array into a structure composed of several > >> exponentially bigger arrays, to enable usage of more than 1GB > >> of work mem during vacuum and thus reduce the number of full > >> index scans necessary to remove all dead tids when the memory is > >> available. > > > >> * We are willing to use at most maintenance_work_mem (or perhaps > >> * autovacuum_work_mem) memory space to keep track of dead tuples. We > >> - * initially allocate an array of TIDs of that size, with an upper limit that > >> + * initially allocate an array of TIDs of 128MB, or an upper limit that > >> * depends on table size (this limit ensures we don't allocate a huge area > >> - * uselessly for vacuuming small tables). If the array threatens to overflow, > >> - * we suspend the heap scan phase and perform a pass of index cleanup and page > >> - * compaction, then resume the heap scan with an empty TID array. > >> + * uselessly for vacuuming small tables). Additional arrays of increasingly > >> + * large sizes are allocated as they become necessary. > >> + * > >> + * The TID array is thus represented as a list of multiple segments of > >> + * varying size, beginning with the initial size of up to 128MB, and growing > >> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem > >> + * is used up. > > > > When the chunk size is 128MB, I'm a bit unconvinced that using > > exponential growth is worth it. The allocator overhead can't be > > meaningful in comparison to collecting 128MB dead tuples, the potential > > waste is pretty big, and it increases memory fragmentation. > > The exponential strategy is mainly to improve lookup time (ie: to > avoid large segment lists). Well, if we were to do binary search on the segment list, that'd not be necessary. > >> + if (seg->num_dead_tuples >= seg->max_dead_tuples) > >> + { > >> + /* > >> + * The segment is overflowing, so we must allocate a new segment. > >> + * We could have a preallocated segment descriptor already, in > >> + * which case we just reinitialize it, or we may need to repalloc > >> + * the vacrelstats->dead_tuples array. In that case, seg will no > >> + * longer be valid, so we must be careful about that. In any case, > >> + * we must update the last_dead_tuple copy in the overflowing > >> + * segment descriptor. > >> + */ > >> + Assert(seg->num_dead_tuples == seg->max_dead_tuples); > >> + seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1]; > >> + if (vacrelstats->dead_tuples.last_seg + 1 >= vacrelstats->dead_tuples.num_segs) > >> + { > >> + int new_num_segs = vacrelstats->dead_tuples.num_segs * 2; > >> + > >> + vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *) repalloc( > >> + (void *) vacrelstats->dead_tuples.dt_segments, > >> + new_num_segs * sizeof(DeadTuplesSegment)); > > > > Might be worth breaking this into some sub-statements, it's quite hard > > to read. > > Breaking what precisely? The comment? No, the three-line statement computing the new value of dead_tuples.dt_segments. I'd at least assign dead_tuples to a local variable, to cut the length of the statement down. > >> + while (vacrelstats->dead_tuples.num_segs < new_num_segs) > >> + { > >> + /* Initialize as "unallocated" */ > >> + DeadTuplesSegment *nseg = &(vacrelstats->dead_tuples.dt_segments[ > >> + vacrelstats->dead_tuples.num_segs]); > > > > dito. > > I don't really get what you're asking here. Trying to simplify/shorten the statement. > >> +/* > >> * lazy_tid_reaped() -- is a particular tid deletable? > >> * > >> * This has the right signature to be an IndexBulkDeleteCallback. > >> * > >> - * Assumes dead_tuples array is in sorted order. > >> + * Assumes the dead_tuples multiarray is in sorted order, both > >> + * the segment list and each segment itself, and that all segments' > >> + * last_dead_tuple fields up to date > >> */ > >> static bool > >> lazy_tid_reaped(ItemPointer itemptr, void *state) > > > > Have you done performance evaluation about potential performance > > regressions in big indexes here? IIRC this can be quite frequently > > called? > > Yes, the benchmarks are upthread. The earlier runs were run on my > laptop and made little sense, so I'd ignore them as inaccurate. The > latest run[1] with a pgbench scale of 4000 gave an improvement in CPU > time (ie: faster) of about 20%. Anastasia did another one[2] and saw > improvements as well, roughly 30%, though it's not measuring CPU time > but rather elapsed time. I'd be more concerned about cases that'd already fit into memory, not ones where we avoid doing another scan - and I think you mostly measured that? - Andres
On Fri, Apr 7, 2017 at 9:56 PM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > > On 2017-04-07 19:43:39 -0300, Claudio Freire wrote: >> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <andres@anarazel.de> wrote: >> > Hi, >> > >> > I've *not* read the history of this thread. So I really might be >> > missing some context. >> > >> > >> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001 >> >> From: Claudio Freire <klaussfreire@gmail.com> >> >> Date: Mon, 12 Sep 2016 23:36:42 -0300 >> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem >> >> >> >> Turn the dead_tuples array into a structure composed of several >> >> exponentially bigger arrays, to enable usage of more than 1GB >> >> of work mem during vacuum and thus reduce the number of full >> >> index scans necessary to remove all dead tids when the memory is >> >> available. >> > >> >> * We are willing to use at most maintenance_work_mem (or perhaps >> >> * autovacuum_work_mem) memory space to keep track of dead tuples. We >> >> - * initially allocate an array of TIDs of that size, with an upper limit that >> >> + * initially allocate an array of TIDs of 128MB, or an upper limit that >> >> * depends on table size (this limit ensures we don't allocate a huge area >> >> - * uselessly for vacuuming small tables). If the array threatens to overflow, >> >> - * we suspend the heap scan phase and perform a pass of index cleanup and page >> >> - * compaction, then resume the heap scan with an empty TID array. >> >> + * uselessly for vacuuming small tables). Additional arrays of increasingly >> >> + * large sizes are allocated as they become necessary. >> >> + * >> >> + * The TID array is thus represented as a list of multiple segments of >> >> + * varying size, beginning with the initial size of up to 128MB, and growing >> >> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem >> >> + * is used up. >> > >> > When the chunk size is 128MB, I'm a bit unconvinced that using >> > exponential growth is worth it. The allocator overhead can't be >> > meaningful in comparison to collecting 128MB dead tuples, the potential >> > waste is pretty big, and it increases memory fragmentation. >> >> The exponential strategy is mainly to improve lookup time (ie: to >> avoid large segment lists). > > Well, if we were to do binary search on the segment list, that'd not be > necessary. True, but the initial lookup might be slower in the end, since the array would be bigger and cache locality worse. Why do you say exponential growth fragments memory? AFAIK, all those allocations are well beyond the point where malloc starts mmaping memory, so each of those segments should be a mmap segment, independently freeable. >> >> + if (seg->num_dead_tuples >= seg->max_dead_tuples) >> >> + { >> >> + /* >> >> + * The segment is overflowing, so we must allocate a new segment. >> >> + * We could have a preallocated segment descriptor already, in >> >> + * which case we just reinitialize it, or we may need to repalloc >> >> + * the vacrelstats->dead_tuples array. In that case, seg will no >> >> + * longer be valid, so we must be careful about that. In any case, >> >> + * we must update the last_dead_tuple copy in the overflowing >> >> + * segment descriptor. >> >> + */ >> >> + Assert(seg->num_dead_tuples == seg->max_dead_tuples); >> >> + seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1]; >> >> + if (vacrelstats->dead_tuples.last_seg + 1 >= vacrelstats->dead_tuples.num_segs) >> >> + { >> >> + int new_num_segs = vacrelstats->dead_tuples.num_segs * 2; >> >> + >> >> + vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *) repalloc( >> >> + (void *) vacrelstats->dead_tuples.dt_segments, >> >> + new_num_segs * sizeof(DeadTuplesSegment)); >> > >> > Might be worth breaking this into some sub-statements, it's quite hard >> > to read. >> >> Breaking what precisely? The comment? > > No, the three-line statement computing the new value of > dead_tuples.dt_segments. I'd at least assign dead_tuples to a local > variable, to cut the length of the statement down. Ah, alright. Will try to do that. >> >> +/* >> >> * lazy_tid_reaped() -- is a particular tid deletable? >> >> * >> >> * This has the right signature to be an IndexBulkDeleteCallback. >> >> * >> >> - * Assumes dead_tuples array is in sorted order. >> >> + * Assumes the dead_tuples multiarray is in sorted order, both >> >> + * the segment list and each segment itself, and that all segments' >> >> + * last_dead_tuple fields up to date >> >> */ >> >> static bool >> >> lazy_tid_reaped(ItemPointer itemptr, void *state) >> > >> > Have you done performance evaluation about potential performance >> > regressions in big indexes here? IIRC this can be quite frequently >> > called? >> >> Yes, the benchmarks are upthread. The earlier runs were run on my >> laptop and made little sense, so I'd ignore them as inaccurate. The >> latest run[1] with a pgbench scale of 4000 gave an improvement in CPU >> time (ie: faster) of about 20%. Anastasia did another one[2] and saw >> improvements as well, roughly 30%, though it's not measuring CPU time >> but rather elapsed time. > > I'd be more concerned about cases that'd already fit into memory, not ones > where we avoid doing another scan - and I think you mostly measured that? > > - Andres Well, scale 400 is pretty much as big as you can get with the old 1GB limit, and also suffered no significant regression. Although, true, id didn't significantly improve either.
On 2017-04-07 22:06:13 -0300, Claudio Freire wrote: > On Fri, Apr 7, 2017 at 9:56 PM, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > > > > > On 2017-04-07 19:43:39 -0300, Claudio Freire wrote: > >> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <andres@anarazel.de> wrote: > >> > Hi, > >> > > >> > I've *not* read the history of this thread. So I really might be > >> > missing some context. > >> > > >> > > >> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001 > >> >> From: Claudio Freire <klaussfreire@gmail.com> > >> >> Date: Mon, 12 Sep 2016 23:36:42 -0300 > >> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem > >> >> > >> >> Turn the dead_tuples array into a structure composed of several > >> >> exponentially bigger arrays, to enable usage of more than 1GB > >> >> of work mem during vacuum and thus reduce the number of full > >> >> index scans necessary to remove all dead tids when the memory is > >> >> available. > >> > > >> >> * We are willing to use at most maintenance_work_mem (or perhaps > >> >> * autovacuum_work_mem) memory space to keep track of dead tuples. We > >> >> - * initially allocate an array of TIDs of that size, with an upper limit that > >> >> + * initially allocate an array of TIDs of 128MB, or an upper limit that > >> >> * depends on table size (this limit ensures we don't allocate a huge area > >> >> - * uselessly for vacuuming small tables). If the array threatens to overflow, > >> >> - * we suspend the heap scan phase and perform a pass of index cleanup and page > >> >> - * compaction, then resume the heap scan with an empty TID array. > >> >> + * uselessly for vacuuming small tables). Additional arrays of increasingly > >> >> + * large sizes are allocated as they become necessary. > >> >> + * > >> >> + * The TID array is thus represented as a list of multiple segments of > >> >> + * varying size, beginning with the initial size of up to 128MB, and growing > >> >> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem > >> >> + * is used up. > >> > > >> > When the chunk size is 128MB, I'm a bit unconvinced that using > >> > exponential growth is worth it. The allocator overhead can't be > >> > meaningful in comparison to collecting 128MB dead tuples, the potential > >> > waste is pretty big, and it increases memory fragmentation. > >> > >> The exponential strategy is mainly to improve lookup time (ie: to > >> avoid large segment lists). > > > > Well, if we were to do binary search on the segment list, that'd not be > > necessary. > > True, but the initial lookup might be slower in the end, since the > array would be bigger and cache locality worse. > > Why do you say exponential growth fragments memory? AFAIK, all those > allocations are well beyond the point where malloc starts mmaping > memory, so each of those segments should be a mmap segment, > independently freeable. Not all platforms have that, and even on platforms with it, frequent, unevenly sized, very large allocations can lead to enough fragmentation that further allocations are harder and fragment / enlarge the pagetable. > >> Yes, the benchmarks are upthread. The earlier runs were run on my > >> laptop and made little sense, so I'd ignore them as inaccurate. The > >> latest run[1] with a pgbench scale of 4000 gave an improvement in CPU > >> time (ie: faster) of about 20%. Anastasia did another one[2] and saw > >> improvements as well, roughly 30%, though it's not measuring CPU time > >> but rather elapsed time. > > > > I'd be more concerned about cases that'd already fit into memory, not ones > > where we avoid doing another scan - and I think you mostly measured that? > > > > - Andres > > Well, scale 400 is pretty much as big as you can get with the old 1GB > limit, and also suffered no significant regression. Although, true, id > didn't significantly improve either. Aren't more interesting cases those where not that many dead tuples are found, but the indexes are pretty large? IIRC the index vacuum scans still visit every leaf index tuple, no? Greetings, Andres Freund
On Fri, Apr 7, 2017 at 10:12 PM, Andres Freund <andres@anarazel.de> wrote: > On 2017-04-07 22:06:13 -0300, Claudio Freire wrote: >> On Fri, Apr 7, 2017 at 9:56 PM, Andres Freund <andres@anarazel.de> wrote: >> > Hi, >> > >> > >> > On 2017-04-07 19:43:39 -0300, Claudio Freire wrote: >> >> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <andres@anarazel.de> wrote: >> >> > Hi, >> >> > >> >> > I've *not* read the history of this thread. So I really might be >> >> > missing some context. >> >> > >> >> > >> >> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001 >> >> >> From: Claudio Freire <klaussfreire@gmail.com> >> >> >> Date: Mon, 12 Sep 2016 23:36:42 -0300 >> >> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem >> >> >> >> >> >> Turn the dead_tuples array into a structure composed of several >> >> >> exponentially bigger arrays, to enable usage of more than 1GB >> >> >> of work mem during vacuum and thus reduce the number of full >> >> >> index scans necessary to remove all dead tids when the memory is >> >> >> available. >> >> > >> >> >> * We are willing to use at most maintenance_work_mem (or perhaps >> >> >> * autovacuum_work_mem) memory space to keep track of dead tuples. We >> >> >> - * initially allocate an array of TIDs of that size, with an upper limit that >> >> >> + * initially allocate an array of TIDs of 128MB, or an upper limit that >> >> >> * depends on table size (this limit ensures we don't allocate a huge area >> >> >> - * uselessly for vacuuming small tables). If the array threatens to overflow, >> >> >> - * we suspend the heap scan phase and perform a pass of index cleanup and page >> >> >> - * compaction, then resume the heap scan with an empty TID array. >> >> >> + * uselessly for vacuuming small tables). Additional arrays of increasingly >> >> >> + * large sizes are allocated as they become necessary. >> >> >> + * >> >> >> + * The TID array is thus represented as a list of multiple segments of >> >> >> + * varying size, beginning with the initial size of up to 128MB, and growing >> >> >> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem >> >> >> + * is used up. >> >> > >> >> > When the chunk size is 128MB, I'm a bit unconvinced that using >> >> > exponential growth is worth it. The allocator overhead can't be >> >> > meaningful in comparison to collecting 128MB dead tuples, the potential >> >> > waste is pretty big, and it increases memory fragmentation. >> >> >> >> The exponential strategy is mainly to improve lookup time (ie: to >> >> avoid large segment lists). >> > >> > Well, if we were to do binary search on the segment list, that'd not be >> > necessary. >> >> True, but the initial lookup might be slower in the end, since the >> array would be bigger and cache locality worse. >> >> Why do you say exponential growth fragments memory? AFAIK, all those >> allocations are well beyond the point where malloc starts mmaping >> memory, so each of those segments should be a mmap segment, >> independently freeable. > > Not all platforms have that, and even on platforms with it, frequent, > unevenly sized, very large allocations can lead to enough fragmentation > that further allocations are harder and fragment / enlarge the > pagetable. I wouldn't call this frequent. You can get at most slightly more than a dozen such allocations given the current limits. And allocation sizes are quite regular - you get 128M or multiples of 128M, so each free block can be reused for N smaller allocations if needed. I don't think it has much potential to fragment memory. This isn't significantly different from tuplesort or any other code that can do big allocations, and the differences favor less fragmentation than those, so I don't see why this would need special treatment. My point being that it's not been simple getting to a point where this beats even in CPU time the original single binary search. If we're to scrap this implementation and go for a double binary search, I'd like to have a clear measurable benefit to chase from doing so. Fragmentation is hard to measure, and I cannot get CPU-bound vacuums on the test hardware I have to test lookup performance at big scales. >> >> Yes, the benchmarks are upthread. The earlier runs were run on my >> >> laptop and made little sense, so I'd ignore them as inaccurate. The >> >> latest run[1] with a pgbench scale of 4000 gave an improvement in CPU >> >> time (ie: faster) of about 20%. Anastasia did another one[2] and saw >> >> improvements as well, roughly 30%, though it's not measuring CPU time >> >> but rather elapsed time. >> > >> > I'd be more concerned about cases that'd already fit into memory, not ones >> > where we avoid doing another scan - and I think you mostly measured that? >> > >> > - Andres >> >> Well, scale 400 is pretty much as big as you can get with the old 1GB >> limit, and also suffered no significant regression. Although, true, id >> didn't significantly improve either. > > Aren't more interesting cases those where not that many dead tuples are > found, but the indexes are pretty large? IIRC the index vacuum scans > still visit every leaf index tuple, no? Indeed they do, and that's what motivated this patch. But I'd need TB-sized tables to set up something like that. I don't have the hardware or time available to do that (vacuum on bloated TB-sized tables can take days in my experience). Scale 4000 is as big as I can get without running out of space for the tests in my test hardware. If anybody else has the ability, I'd be thankful if they did test it under those conditions, but I cannot. I think Anastasia's test is closer to such a test, that's probably why it shows a bigger improvement in total elapsed time. Our production database could possibly be used, but it can take about a week to clone it, upgrade it (it's 9.5 currently), and run the relevant vacuum. I did perform tests against the same pgbench databases referenced in the post I linked earlier, but deleting only a fraction of the rows, or on uncorrelated indexes. The benchmarks weren't very interesting, and results were consistent with the linked benchmark (slight CPU time improvement, just less impactful), so I didn't post them. I think all those tests show that, if there's a workload that regresses, it's a rare one, running on very powerful I/O hardware (to make vacuum CPU-bound). And even if that were to happen, considering a single (or fewer) index scan, even if slower, will cause less WAL traffic that has to be archived/streamed, it would still most likely be a win overall.
On Fri, Apr 7, 2017 at 10:06 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> >> + if (seg->num_dead_tuples >= seg->max_dead_tuples) >>> >> + { >>> >> + /* >>> >> + * The segment is overflowing, so we must allocate a new segment. >>> >> + * We could have a preallocated segment descriptor already, in >>> >> + * which case we just reinitialize it, or we may need to repalloc >>> >> + * the vacrelstats->dead_tuples array. In that case, seg will no >>> >> + * longer be valid, so we must be careful about that. In any case, >>> >> + * we must update the last_dead_tuple copy in the overflowing >>> >> + * segment descriptor. >>> >> + */ >>> >> + Assert(seg->num_dead_tuples == seg->max_dead_tuples); >>> >> + seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1]; >>> >> + if (vacrelstats->dead_tuples.last_seg + 1 >= vacrelstats->dead_tuples.num_segs) >>> >> + { >>> >> + int new_num_segs = vacrelstats->dead_tuples.num_segs * 2; >>> >> + >>> >> + vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *) repalloc( >>> >> + (void *) vacrelstats->dead_tuples.dt_segments, >>> >> + new_num_segs * sizeof(DeadTuplesSegment)); >>> > >>> > Might be worth breaking this into some sub-statements, it's quite hard >>> > to read. >>> >>> Breaking what precisely? The comment? >> >> No, the three-line statement computing the new value of >> dead_tuples.dt_segments. I'd at least assign dead_tuples to a local >> variable, to cut the length of the statement down. > > Ah, alright. Will try to do that. Attached is an updated patch set with the requested changes. Segment allocation still follows the exponential strategy, and segment lookup is still linear. I rebased the early free patch (patch 3) to apply on top of the v9 patch 2 (it needed some changes). I recognize the early free patch didn't get nearly as much scrutiny, so I'm fine with commiting only 2 if that one's ready to go but 3 isn't. If it's decided to go for fixed 128M segments and a binary search of segments, I don't think I can get that ready and tested before the commitfest ends. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 4/7/17 10:19 PM, Claudio Freire wrote: > > I rebased the early free patch (patch 3) to apply on top of the v9 > patch 2 (it needed some changes). I recognize the early free patch > didn't get nearly as much scrutiny, so I'm fine with commiting only 2 > if that one's ready to go but 3 isn't. > > If it's decided to go for fixed 128M segments and a binary search of > segments, I don't think I can get that ready and tested before the > commitfest ends. This submission has been moved to CF 2017-07. -- -David david@pgmasters.net
On Fri, Apr 7, 2017 at 9:12 PM, Andres Freund <andres@anarazel.de> wrote: >> Why do you say exponential growth fragments memory? AFAIK, all those >> allocations are well beyond the point where malloc starts mmaping >> memory, so each of those segments should be a mmap segment, >> independently freeable. > > Not all platforms have that, and even on platforms with it, frequent, > unevenly sized, very large allocations can lead to enough fragmentation > that further allocations are harder and fragment / enlarge the > pagetable. Such a thing is completely outside my personal experience. I've never heard of a case where a 64-bit platform fails to allocate memory because something (what?) is fragmented. Page table memory usage is a concern at some level, but probably less so for autovacuum workers than for most backends, because autovacuum workers (where most vacuuming is done) exit after one pass through pg_class. Although I think our memory footprint is a topic that could use more energy, I don't really see any reason to think that pagetable bloat caused my unevenly sized allocations in short-lived processes is the place to start worrying. That having been said, IIRC, I did propose quite a ways upthread that we use a fixed chunk size, just because it would use less actual memory, never mind the size of the page table. I mean, if you allocate in chunks of 64MB, which I think is what I proposed, you'll never waste more than 64MB. If you allocate in exponentially-increasing chunk sizes starting at 128MB, you could easily waste much more. Let's imagine a 1TB table where 20% of the tuples are dead due to some large bulk operation (a bulk load failed, or a bulk delete succeeded, or a bulk update happened). Back of the envelope calculation: 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of maintenance_work_mem So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite enough so we'll allocate another 8GB, for a total of 16256MB, but more than three-quarters of that last allocation ends up being wasted. I've been told on this list before that doubling is the one true way of increasing the size of an allocated chunk of memory, but I'm still a bit unconvinced. On the other hand, if we did allocate fixed chunks of, say, 64MB, we could end up with an awful lot of them. For example, in the example above, 9216MB/64MB = 144 chunks. Is that number of mappings going to make the VM system unhappy on any of the platforms we care about? Is that a bigger or smaller problem than what you (Andres) are worrying about? I don't know. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of > maintenance_work_mem > > So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite > enough so we'll allocate another 8GB, for a total of 16256MB, but more > than three-quarters of that last allocation ends up being wasted. > I've been told on this list before that doubling is the one true way > of increasing the size of an allocated chunk of memory, but I'm still > a bit unconvinced. There you're wrong. The allocation is capped to 1GB, so wastage has an upper bound of 1GB.
On Tue, Apr 11, 2017 at 3:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of >> maintenance_work_mem >> >> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite >> enough so we'll allocate another 8GB, for a total of 16256MB, but more >> than three-quarters of that last allocation ends up being wasted. >> I've been told on this list before that doubling is the one true way >> of increasing the size of an allocated chunk of memory, but I'm still >> a bit unconvinced. > > There you're wrong. The allocation is capped to 1GB, so wastage has an > upper bound of 1GB. And total m_w_m for vacuum is still capped to 12GB (as big you can get with 32-bit integer indices). So you can get at most 15 segments (a binary search is thus not worth it), and overallocate by at most 1GB (the maximum segment size). At least that's my rationale. Removing the 12GB limit requires a bit of care (there are some 32-bit counters still around I believe).
On Tue, Apr 11, 2017 at 2:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of >> maintenance_work_mem >> >> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite >> enough so we'll allocate another 8GB, for a total of 16256MB, but more >> than three-quarters of that last allocation ends up being wasted. >> I've been told on this list before that doubling is the one true way >> of increasing the size of an allocated chunk of memory, but I'm still >> a bit unconvinced. > > There you're wrong. The allocation is capped to 1GB, so wastage has an > upper bound of 1GB. Ah, OK. Sorry, didn't really look at the code. I stand corrected, but then it seems a bit strange to me that the largest and smallest allocations are only 8x different. I still don't really understand what that buys us. What would we lose if we just made 'em all 128MB? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 11, 2017 at 4:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Apr 11, 2017 at 2:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of >>> maintenance_work_mem >>> >>> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite >>> enough so we'll allocate another 8GB, for a total of 16256MB, but more >>> than three-quarters of that last allocation ends up being wasted. >>> I've been told on this list before that doubling is the one true way >>> of increasing the size of an allocated chunk of memory, but I'm still >>> a bit unconvinced. >> >> There you're wrong. The allocation is capped to 1GB, so wastage has an >> upper bound of 1GB. > > Ah, OK. Sorry, didn't really look at the code. I stand corrected, > but then it seems a bit strange to me that the largest and smallest > allocations are only 8x different. I still don't really understand > what that buys us. Basically, attacking the problem (that, I think, you mentioned) of very small systems in which overallocation for small vacuums was an issue. The "slow start" behavior of starting with smaller segments tries to improve the situation for small vacuums, not big ones. By starting at 128M and growing up to 1GB, overallocation is bound to the range 128M-1GB and is proportional to the amount of dead tuples, not table size, as it was before. Starting at 128M helps the initial segment search, but I could readily go for starting at 64M, I don't think it would make a huge difference. Removing exponential growth, however, would. As the patch stands, small systems (say 32-bit systems) without overcommit and with slowly-changing data can now set high m_w_m without running into overallocation issues with autovacuum reserving too much virtual space, as it will reserve memory only proportional to the amount of dead tuples. Previously, it would reserve all of m_w_m regardless of whether it was needed or not, with the only exception being really small tables, so m_w_m=1GB was unworkable in those cases. Now it should be fine. > What would we lose if we just made 'em all 128MB? TBH, not that much. We'd need 8x compares to find the segment, that forces a switch to binary search of the segments, which is less cache-friendly. So it's more complex code, less cache locality. I'm just not sure what's the benefit given current limits. The only aim of this multiarray approach was making *virtual address space reservations* proportional to the amount of actual memory needed, as opposed to configured limits. It doesn't need to be a tight fit, because calling palloc on its own doesn't actually use that memory, at least on big allocations like these - the OS will not map the memory pages until they're first touched. That's true in most modern systems, and many ancient ones too. In essence, the patch as it is proposed, doesn't *need* a binary search, because the segment list can only grow up to 15 segments at its biggest, and that's a size small enough that linear search will outperform (or at least perform as well as) binary search. Reducing the initial segment size wouldn't change that. If the 12GB limit is lifted, or the maximum segment size reduced (from 1GB to 128MB for example), however, that would change. I'd be more in favor of lifting the 12GB limit than of reducing the maximum segment size, for the reasons above. Raising the 12GB limit has concrete and readily apparent benefits, whereas using bigger (or smaller) segments is far more debatable. Yes, that will need a binary search. But, I was hoping that could be a second (or third) patch, to keep things simple, and benefits measurable. Also, the plan as discussed in this very long thread, was to eventually try to turn segments into bitmaps if dead tuple density was big enough. That benefits considerably from big segments, since lookup on a bitmap is O(1) - the bigger the segments, the faster the lookup, as the search on the segment list would be dominant. So... what shall we do? At this point, I've given all my arguments for the current design. If the more senior developers don't agree, I'll be happy to try your way.
On Tue, Apr 11, 2017 at 4:38 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > In essence, the patch as it is proposed, doesn't *need* a binary > search, because the segment list can only grow up to 15 segments at > its biggest, and that's a size small enough that linear search will > outperform (or at least perform as well as) binary search. Reducing > the initial segment size wouldn't change that. If the 12GB limit is > lifted, or the maximum segment size reduced (from 1GB to 128MB for > example), however, that would change. > > I'd be more in favor of lifting the 12GB limit than of reducing the > maximum segment size, for the reasons above. Raising the 12GB limit > has concrete and readily apparent benefits, whereas using bigger (or > smaller) segments is far more debatable. Yes, that will need a binary > search. But, I was hoping that could be a second (or third) patch, to > keep things simple, and benefits measurable. To me, it seems a bit short-sighted to say, OK, let's use a linear search because there's this 12GB limit so we can limit ourselves to 15 segments. Because somebody will want to remove that 12GB limit, and then we'll have to revisit the whole thing anyway. I think, anyway. What's not clear to me is how sensitive the performance of vacuum is to the number of cycles used here. For a large index, the number of searches will presumably be quite large, so it does seem worth worrying about performance. But if we just always used a binary search, would that lose enough performance with small numbers of segments that anyone would care? If so, maybe we need to use linear search for small numbers of segments and switch to binary search with larger numbers of segments. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 12, 2017 at 4:35 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Apr 11, 2017 at 4:38 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> In essence, the patch as it is proposed, doesn't *need* a binary >> search, because the segment list can only grow up to 15 segments at >> its biggest, and that's a size small enough that linear search will >> outperform (or at least perform as well as) binary search. Reducing >> the initial segment size wouldn't change that. If the 12GB limit is >> lifted, or the maximum segment size reduced (from 1GB to 128MB for >> example), however, that would change. >> >> I'd be more in favor of lifting the 12GB limit than of reducing the >> maximum segment size, for the reasons above. Raising the 12GB limit >> has concrete and readily apparent benefits, whereas using bigger (or >> smaller) segments is far more debatable. Yes, that will need a binary >> search. But, I was hoping that could be a second (or third) patch, to >> keep things simple, and benefits measurable. > > To me, it seems a bit short-sighted to say, OK, let's use a linear > search because there's this 12GB limit so we can limit ourselves to 15 > segments. Because somebody will want to remove that 12GB limit, and > then we'll have to revisit the whole thing anyway. I think, anyway. Ok, attached an updated patch that implements the binary search > What's not clear to me is how sensitive the performance of vacuum is > to the number of cycles used here. For a large index, the number of > searches will presumably be quite large, so it does seem worth > worrying about performance. But if we just always used a binary > search, would that lose enough performance with small numbers of > segments that anyone would care? If so, maybe we need to use linear > search for small numbers of segments and switch to binary search with > larger numbers of segments. I just went and tested. I implemented the hybrid binary search attached, and ran a few tests with and without the sequential code enabled, at small scales. The difference is statistically significant, but small (less than 3%). With proper optimization of the binary search, however, the difference flips: claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80 fullbinary.s100.times vacuum_bench_s100.1.shufp80.log:CPU: user: 6.20 s, system: 1.42 s, elapsed: 18.34 s. vacuum_bench_s100.2.shufp80.log:CPU: user: 6.44 s, system: 1.40 s, elapsed: 19.75 s. vacuum_bench_s100.3.shufp80.log:CPU: user: 6.28 s, system: 1.41 s, elapsed: 18.48 s. vacuum_bench_s100.4.shufp80.log:CPU: user: 6.39 s, system: 1.51 s, elapsed: 20.60 s. vacuum_bench_s100.5.shufp80.log:CPU: user: 6.26 s, system: 1.42 s, elapsed: 19.16 s. claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80 hybridbinary.s100.times vacuum_bench_s100.1.shufp80.log:CPU: user: 6.49 s, system: 1.39 s, elapsed: 19.15 s. vacuum_bench_s100.2.shufp80.log:CPU: user: 6.36 s, system: 1.33 s, elapsed: 18.40 s. vacuum_bench_s100.3.shufp80.log:CPU: user: 6.36 s, system: 1.31 s, elapsed: 18.87 s. vacuum_bench_s100.4.shufp80.log:CPU: user: 6.59 s, system: 1.35 s, elapsed: 26.43 s. vacuum_bench_s100.5.shufp80.log:CPU: user: 6.54 s, system: 1.28 s, elapsed: 20.02 s. That's after inlining the compare on both the linear and sequential code, and it seems it lets the compiler optimize the binary search to the point where it outperforms the sequential search. That's not the case when the compare isn't inlined. That seems in line with [1], that show the impact of various optimizations on both algorithms. It's clearly a close enough race that optimizations play a huge role. Since we're not likely to go and implement SSE2-optimized versions, I believe I'll leave the binary search only. That's the attached patch set. I'm running the full test suite, but that takes a very long while. I'll post the results when they're done. [1] https://schani.wordpress.com/2010/04/30/linear-vs-binary-search/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Apr 20, 2017 at 5:24 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> What's not clear to me is how sensitive the performance of vacuum is >> to the number of cycles used here. For a large index, the number of >> searches will presumably be quite large, so it does seem worth >> worrying about performance. But if we just always used a binary >> search, would that lose enough performance with small numbers of >> segments that anyone would care? If so, maybe we need to use linear >> search for small numbers of segments and switch to binary search with >> larger numbers of segments. > > I just went and tested. Thanks! > That's after inlining the compare on both the linear and sequential > code, and it seems it lets the compiler optimize the binary search to > the point where it outperforms the sequential search. > > That's not the case when the compare isn't inlined. > > That seems in line with [1], that show the impact of various > optimizations on both algorithms. It's clearly a close enough race > that optimizations play a huge role. > > Since we're not likely to go and implement SSE2-optimized versions, I > believe I'll leave the binary search only. That's the attached patch > set. That sounds reasonable based on your test results. I guess part of what I was wondering is whether a vacuum on a table large enough to require multiple gigabytes of work_mem isn't likely to be I/O-bound anyway. If so, a few cycles one way or the other other isn't likely to matter much. If not, where exactly are all of those CPU cycles going? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Apr 23, 2017 at 12:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> That's after inlining the compare on both the linear and sequential >> code, and it seems it lets the compiler optimize the binary search to >> the point where it outperforms the sequential search. >> >> That's not the case when the compare isn't inlined. >> >> That seems in line with [1], that show the impact of various >> optimizations on both algorithms. It's clearly a close enough race >> that optimizations play a huge role. >> >> Since we're not likely to go and implement SSE2-optimized versions, I >> believe I'll leave the binary search only. That's the attached patch >> set. > > That sounds reasonable based on your test results. I guess part of > what I was wondering is whether a vacuum on a table large enough to > require multiple gigabytes of work_mem isn't likely to be I/O-bound > anyway. If so, a few cycles one way or the other other isn't likely > to matter much. If not, where exactly are all of those CPU cycles > going? I haven't been able to produce a table large enough to get a CPU-bound vacuum, so such a case is likely to require huge storage and a very powerful I/O system. Mine can only get about 100MB/s tops, and at that speed, vacuum is I/O bound even for multi-GB work_mem. That's why I've been using the reported CPU time as benchmark. BTW, I left the benchmark script running all weekend at the office, and when I got back a power outage had aborted it. In a few days I'll be out on vacation, so I'm not sure I'll get the benchmark results anytime soon. But this patch moved to 11.0 I guess there's no rush. Just FTR, in case I leave before the script is done, the script got to scale 400 before the outage: INFO: vacuuming "public.pgbench_accounts" INFO: scanned index "pgbench_accounts_pkey" to remove 40000000 row versions DETAIL: CPU: user: 5.94 s, system: 1.26 s, elapsed: 26.77 s. INFO: "pgbench_accounts": removed 40000000 row versions in 655739 pages DETAIL: CPU: user: 3.36 s, system: 2.57 s, elapsed: 61.67 s. INFO: index "pgbench_accounts_pkey" now contains 0 row versions in 109679 pages DETAIL: 40000000 index row versions were removed. 109289 index pages have been deleted, 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.06 s. INFO: "pgbench_accounts": found 38925546 removable, 0 nonremovable row versions in 655738 out of 655738 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 1098 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 15.34 s, system: 6.95 s, elapsed: 126.21 s. INFO: "pgbench_accounts": truncated 655738 to 0 pages DETAIL: CPU: user: 0.22 s, system: 2.10 s, elapsed: 8.10 s. In summary: binsrch v10: s100: CPU: user: 3.02 s, system: 1.51 s, elapsed: 16.43 s. s400: CPU: user: 15.34 s, system: 6.95 s, elapsed: 126.21 s. The old results: Old Patched (sequential search): s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s. s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s. s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s. Unpatched: s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s. s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s. s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s. I wouldn't fret over the slight slowdown vs the old patch, it could be noise (the script only completed a single run at scale 400).
On Mon, Apr 24, 2017 at 3:57 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > I wouldn't fret over the slight slowdown vs the old patch, it could be > noise (the script only completed a single run at scale 400). Yeah, seems fine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 21, 2017 at 6:24 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Apr 12, 2017 at 4:35 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Apr 11, 2017 at 4:38 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> In essence, the patch as it is proposed, doesn't *need* a binary >>> search, because the segment list can only grow up to 15 segments at >>> its biggest, and that's a size small enough that linear search will >>> outperform (or at least perform as well as) binary search. Reducing >>> the initial segment size wouldn't change that. If the 12GB limit is >>> lifted, or the maximum segment size reduced (from 1GB to 128MB for >>> example), however, that would change. >>> >>> I'd be more in favor of lifting the 12GB limit than of reducing the >>> maximum segment size, for the reasons above. Raising the 12GB limit >>> has concrete and readily apparent benefits, whereas using bigger (or >>> smaller) segments is far more debatable. Yes, that will need a binary >>> search. But, I was hoping that could be a second (or third) patch, to >>> keep things simple, and benefits measurable. >> >> To me, it seems a bit short-sighted to say, OK, let's use a linear >> search because there's this 12GB limit so we can limit ourselves to 15 >> segments. Because somebody will want to remove that 12GB limit, and >> then we'll have to revisit the whole thing anyway. I think, anyway. > > Ok, attached an updated patch that implements the binary search > >> What's not clear to me is how sensitive the performance of vacuum is >> to the number of cycles used here. For a large index, the number of >> searches will presumably be quite large, so it does seem worth >> worrying about performance. But if we just always used a binary >> search, would that lose enough performance with small numbers of >> segments that anyone would care? If so, maybe we need to use linear >> search for small numbers of segments and switch to binary search with >> larger numbers of segments. > > I just went and tested. > > I implemented the hybrid binary search attached, and ran a few tests > with and without the sequential code enabled, at small scales. > > The difference is statistically significant, but small (less than 3%). > With proper optimization of the binary search, however, the difference > flips: > > claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80 > fullbinary.s100.times > vacuum_bench_s100.1.shufp80.log:CPU: user: 6.20 s, system: 1.42 s, > elapsed: 18.34 s. > vacuum_bench_s100.2.shufp80.log:CPU: user: 6.44 s, system: 1.40 s, > elapsed: 19.75 s. > vacuum_bench_s100.3.shufp80.log:CPU: user: 6.28 s, system: 1.41 s, > elapsed: 18.48 s. > vacuum_bench_s100.4.shufp80.log:CPU: user: 6.39 s, system: 1.51 s, > elapsed: 20.60 s. > vacuum_bench_s100.5.shufp80.log:CPU: user: 6.26 s, system: 1.42 s, > elapsed: 19.16 s. > > claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80 > hybridbinary.s100.times > vacuum_bench_s100.1.shufp80.log:CPU: user: 6.49 s, system: 1.39 s, > elapsed: 19.15 s. > vacuum_bench_s100.2.shufp80.log:CPU: user: 6.36 s, system: 1.33 s, > elapsed: 18.40 s. > vacuum_bench_s100.3.shufp80.log:CPU: user: 6.36 s, system: 1.31 s, > elapsed: 18.87 s. > vacuum_bench_s100.4.shufp80.log:CPU: user: 6.59 s, system: 1.35 s, > elapsed: 26.43 s. > vacuum_bench_s100.5.shufp80.log:CPU: user: 6.54 s, system: 1.28 s, > elapsed: 20.02 s. > > That's after inlining the compare on both the linear and sequential > code, and it seems it lets the compiler optimize the binary search to > the point where it outperforms the sequential search. > > That's not the case when the compare isn't inlined. > > That seems in line with [1], that show the impact of various > optimizations on both algorithms. It's clearly a close enough race > that optimizations play a huge role. > > Since we're not likely to go and implement SSE2-optimized versions, I > believe I'll leave the binary search only. That's the attached patch > set. > > I'm running the full test suite, but that takes a very long while. > I'll post the results when they're done. > > [1] https://schani.wordpress.com/2010/04/30/linear-vs-binary-search/ Thank you for updating the patch. I've read this patch again and here are some review comments. + * Lookup in that structure proceeds sequentially in the list of segments, + * and with a binary search within each segment. Since segment's size grows + * exponentially, this retains O(log N) lookup complexity (2 log N to be + * precise). IIUC we now do binary search even over the list of segments. ----- We often fetch a particular dead tuple segment. How about providing a macro for easier understanding? For example, #define GetDeadTuplsSegment(lvrelstats, seg) \ (&(lvrelstats)->dead_tuples.dt_segments[(seg)]) ----- + if (vacrelstats->dead_tuples.num_segs == 0) + return; + + /* If uninitialized, we have no tuples to delete from the indexes */ + if (vacrelstats->dead_tuples.num_segs == 0) + { + return; + } + if (vacrelstats->dead_tuples.num_segs == 0) + return false; + As I listed, there is code to check if dead tuple is initialized already in some places where doing actual vacuum. I guess that it should not happen that we attempt to vacuum a table/index page while not having any dead tuple. Is it better to have Assert or ereport instead? ----- @@ -1915,2 +2002,2 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats) - BlockNumber prefetchStart; - BlockNumber pblkno; + BlockNumber prefetchStart; + BlockNumber pblkno; I think that it's a unnecessary change. ----- + /* Search for the segment likely to contain the item pointer */ + iseg = vac_itemptr_binsrch( + (void *) itemptr, + (void *) &(vacrelstats->dead_tuples.dt_segments->last_dead_tuple), + vacrelstats->dead_tuples.last_seg + 1, + sizeof(DeadTuplesSegment)); + I think that we can change the above to; + /* Search for the segment likely to contain the item pointer */ + iseg = vac_itemptr_binsrch( + (void *) itemptr, + (void *) &(seg->last_dead_tuple), + vacrelstats->dead_tuples.last_seg + 1, + sizeof(DeadTuplesSegment)); We set "seg = vacrelstats->dead_tuples.dt_segments" just before this. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Sorry for the delay, I had extended vacations that kept me away from my test rigs, and afterward testing too, liteally, a few weeks. I built a more thoroguh test script that produced some interesting results. Will attach the results. For now, to the review comments: On Thu, Apr 27, 2017 at 4:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I've read this patch again and here are some review comments. > > + * Lookup in that structure proceeds sequentially in the list of segments, > + * and with a binary search within each segment. Since segment's size grows > + * exponentially, this retains O(log N) lookup complexity (2 log N to be > + * precise). > > IIUC we now do binary search even over the list of segments. Right > > ----- > > We often fetch a particular dead tuple segment. How about providing a > macro for easier understanding? > For example, > > #define GetDeadTuplsSegment(lvrelstats, seg) \ > (&(lvrelstats)->dead_tuples.dt_segments[(seg)]) > > ----- > > + if (vacrelstats->dead_tuples.num_segs == 0) > + return; > + > > + /* If uninitialized, we have no tuples to delete from the indexes */ > + if (vacrelstats->dead_tuples.num_segs == 0) > + { > + return; > + } > > + if (vacrelstats->dead_tuples.num_segs == 0) > + return false; > + Ok > As I listed, there is code to check if dead tuple is initialized > already in some places where doing actual vacuum. > I guess that it should not happen that we attempt to vacuum a > table/index page while not having any dead tuple. Is it better to have > Assert or ereport instead? I'm not sure. Having a non-empty dead tuples array is not necessary to be able to honor the contract in the docstring. Most of those functions clean up the heap/index of dead tuples given the array of dead tuples, which is a no-op for an empty array. The code that calls those functions doesn't bother calling if the array is known empty, true, but there's no compelling reason to enforce that at the interface. Doing so could cause subtle bugs rather than catch them (in the form of unexpected assertion failures, if some caller forgot to check the dead tuples array for emptiness). If you're worried about the possibility that some bugs fails to record dead tuples in the array, and thus makes VACUUM silently ineffective, I instead added a test for that case. This should be a better approach, since it's more likely to catch unexpected failure modes than an assert. > @@ -1915,2 +2002,2 @@ count_nondeletable_pages(Relation onerel, > LVRelStats *vacrelstats) > - BlockNumber prefetchStart; > - BlockNumber pblkno; > + BlockNumber prefetchStart; > + BlockNumber pblkno; > > I think that it's a unnecessary change. Yep. But funnily that's how it's now in master. > > ----- > > + /* Search for the segment likely to contain the item pointer */ > + iseg = vac_itemptr_binsrch( > + (void *) itemptr, > + (void *) > &(vacrelstats->dead_tuples.dt_segments->last_dead_tuple), > + vacrelstats->dead_tuples.last_seg + 1, > + sizeof(DeadTuplesSegment)); > + > > I think that we can change the above to; > > + /* Search for the segment likely to contain the item pointer */ > + iseg = vac_itemptr_binsrch( > + (void *) itemptr, > + (void *) &(seg->last_dead_tuple), > + vacrelstats->dead_tuples.last_seg + 1, > + sizeof(DeadTuplesSegment)); > > We set "seg = vacrelstats->dead_tuples.dt_segments" just before this. Right Attached is a current version of both patches, rebased since we're at it. I'm also attaching the output from the latest benchmark runs, in raw (tar.bz2) and digested (bench_report) forms, the script used to run them (vacuumbench.sh) and to produce the reports (vacuum_bench_report.sh). Those are before the changes in the review. While I don't expect any change, I'll re-run some of them just in case, and try to investigate the slowdown. But that will take forever. Each run takes about a week on my test rig, and I don't have enough hardware to parallelize the tests. I will run a test on a snapshot of a particularly troublesome production database we have, that should be interesting. The benchmarks show a consistent improvement at scale 400, which may be related to the search implementation being better somehow, and a slowdown at scale 4000 in some variants. I believe this is due to those variants having highly clustered indexes. While the "shuf" (shuffled) variantes were intended to be the opposite of that, I suspect I somehow failed to get the desired outcome, so I'll be double-checking that. In any case the slowdown is only materialized when vacuuming with a large mwm setting, which is something that shouldn't happen unintentionally. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Resending without the .tar.bz2 that get blocked Sorry for the delay, I had extended vacations that kept me away from my test rigs, and afterward testing too, liteally, a few weeks. I built a more thoroguh test script that produced some interesting results. Will attach the results. For now, to the review comments: On Thu, Apr 27, 2017 at 4:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I've read this patch again and here are some review comments. > > + * Lookup in that structure proceeds sequentially in the list of segments, > + * and with a binary search within each segment. Since segment's size grows > + * exponentially, this retains O(log N) lookup complexity (2 log N to be > + * precise). > > IIUC we now do binary search even over the list of segments. Right > > ----- > > We often fetch a particular dead tuple segment. How about providing a > macro for easier understanding? > For example, > > #define GetDeadTuplsSegment(lvrelstats, seg) \ > (&(lvrelstats)->dead_tuples.dt_segments[(seg)]) > > ----- > > + if (vacrelstats->dead_tuples.num_segs == 0) > + return; > + > > + /* If uninitialized, we have no tuples to delete from the indexes */ > + if (vacrelstats->dead_tuples.num_segs == 0) > + { > + return; > + } > > + if (vacrelstats->dead_tuples.num_segs == 0) > + return false; > + Ok > As I listed, there is code to check if dead tuple is initialized > already in some places where doing actual vacuum. > I guess that it should not happen that we attempt to vacuum a > table/index page while not having any dead tuple. Is it better to have > Assert or ereport instead? I'm not sure. Having a non-empty dead tuples array is not necessary to be able to honor the contract in the docstring. Most of those functions clean up the heap/index of dead tuples given the array of dead tuples, which is a no-op for an empty array. The code that calls those functions doesn't bother calling if the array is known empty, true, but there's no compelling reason to enforce that at the interface. Doing so could cause subtle bugs rather than catch them (in the form of unexpected assertion failures, if some caller forgot to check the dead tuples array for emptiness). If you're worried about the possibility that some bugs fails to record dead tuples in the array, and thus makes VACUUM silently ineffective, I instead added a test for that case. This should be a better approach, since it's more likely to catch unexpected failure modes than an assert. > @@ -1915,2 +2002,2 @@ count_nondeletable_pages(Relation onerel, > LVRelStats *vacrelstats) > - BlockNumber prefetchStart; > - BlockNumber pblkno; > + BlockNumber prefetchStart; > + BlockNumber pblkno; > > I think that it's a unnecessary change. Yep. But funnily that's how it's now in master. > > ----- > > + /* Search for the segment likely to contain the item pointer */ > + iseg = vac_itemptr_binsrch( > + (void *) itemptr, > + (void *) > &(vacrelstats->dead_tuples.dt_segments->last_dead_tuple), > + vacrelstats->dead_tuples.last_seg + 1, > + sizeof(DeadTuplesSegment)); > + > > I think that we can change the above to; > > + /* Search for the segment likely to contain the item pointer */ > + iseg = vac_itemptr_binsrch( > + (void *) itemptr, > + (void *) &(seg->last_dead_tuple), > + vacrelstats->dead_tuples.last_seg + 1, > + sizeof(DeadTuplesSegment)); > > We set "seg = vacrelstats->dead_tuples.dt_segments" just before this. Right Attached is a current version of both patches, rebased since we're at it. I'm also attaching the output from the latest benchmark runs, in raw (tar.bz2) and digested (bench_report) forms, the script used to run them (vacuumbench.sh) and to produce the reports (vacuum_bench_report.sh). Those are before the changes in the review. While I don't expect any change, I'll re-run some of them just in case, and try to investigate the slowdown. But that will take forever. Each run takes about a week on my test rig, and I don't have enough hardware to parallelize the tests. I will run a test on a snapshot of a particularly troublesome production database we have, that should be interesting. The benchmarks show a consistent improvement at scale 400, which may be related to the search implementation being better somehow, and a slowdown at scale 4000 in some variants. I believe this is due to those variants having highly clustered indexes. While the "shuf" (shuffled) variantes were intended to be the opposite of that, I suspect I somehow failed to get the desired outcome, so I'll be double-checking that. In any case the slowdown is only materialized when vacuuming with a large mwm setting, which is something that shouldn't happen unintentionally. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Thank you for the patch and benchmark results, I have a couple remarks. Firstly, padding in DeadTuplesSegment typedef struct DeadTuplesSegment { ItemPointerData last_dead_tuple; /* Copy of the last dead tuple (unset * until the segment is fully * populated). Keep it first to simplify * binary searches */ unsigned short padding; /* Align dt_tids to 32-bits, * sizeof(ItemPointerData) is aligned to * short, so add a padding short, to make the * size of DeadTuplesSegment a multiple of * 32-bits and align integer components for * better performance during lookups into the * multiarray */ int num_dead_tuples; /* # of entries in the segment */ int max_dead_tuples; /* # of entries allocated in the segment */ ItemPointer dt_tids; /* Array of dead tuples */ } DeadTuplesSegment; In the comments to ItemPointerData is written that it is 6 bytes long, but can be padded to 8 bytes by some compilers, so if we add padding in a current way, there is no guaranty that it will be done as it is expected. The other way to do it with pg_attribute_alligned. But in my opinion, there is no need to do it manually, because the compiler will do this optimization itself. On 11.07.2017 19:51, Claudio Freire wrote: >> ----- >> >> + /* Search for the segment likely to contain the item pointer */ >> + iseg = vac_itemptr_binsrch( >> + (void *) itemptr, >> + (void *) >> &(vacrelstats->dead_tuples.dt_segments->last_dead_tuple), >> + vacrelstats->dead_tuples.last_seg + 1, >> + sizeof(DeadTuplesSegment)); >> + >> >> I think that we can change the above to; >> >> + /* Search for the segment likely to contain the item pointer */ >> + iseg = vac_itemptr_binsrch( >> + (void *) itemptr, >> + (void *) &(seg->last_dead_tuple), >> + vacrelstats->dead_tuples.last_seg + 1, >> + sizeof(DeadTuplesSegment)); >> >> We set "seg = vacrelstats->dead_tuples.dt_segments" just before this. > Right In my mind, if you change vacrelstats->dead_tuples.last_seg + 1 with GetNumDeadTuplesSegments(vacrelstats), it would be more meaningful. Besides, you can change the vac_itemptr_binsrch within the segment with stdlib bsearch, like: res = (ItemPointer) bsearch((void *) itemptr, (void *) seg->dt_tids, seg->num_dead_tuples, sizeof(ItemPointerData), vac_cmp_itemptr); return (res != NULL); > Those are before the changes in the review. While I don't expect any > change, I'll re-run some of them just in case, and try to investigate > the slowdown. But that will take forever. Each run takes about a week > on my test rig, and I don't have enough hardware to parallelize the > tests. I will run a test on a snapshot of a particularly troublesome > production database we have, that should be interesting. Very interesting, waiting for the results.
On Wed, Jul 12, 2017 at 11:48 AM, Alexey Chernyshov <a.chernyshov@postgrespro.ru> wrote: > Thank you for the patch and benchmark results, I have a couple remarks. > Firstly, padding in DeadTuplesSegment > > typedef struct DeadTuplesSegment > > { > > ItemPointerData last_dead_tuple; /* Copy of the last dead tuple > (unset > > * until the segment is fully > > * populated). Keep it first to > simplify > > * binary searches */ > > unsigned short padding; /* Align dt_tids to 32-bits, > > * sizeof(ItemPointerData) is aligned to > > * short, so add a padding short, to make > the > > * size of DeadTuplesSegment a multiple of > > * 32-bits and align integer components for > > * better performance during lookups into > the > > * multiarray */ > > int num_dead_tuples; /* # of entries in the segment */ > > int max_dead_tuples; /* # of entries allocated in the > segment */ > > ItemPointer dt_tids; /* Array of dead tuples */ > > } DeadTuplesSegment; > > In the comments to ItemPointerData is written that it is 6 bytes long, but > can be padded to 8 bytes by some compilers, so if we add padding in a > current way, there is no guaranty that it will be done as it is expected. > The other way to do it with pg_attribute_alligned. But in my opinion, there > is no need to do it manually, because the compiler will do this optimization > itself. I'll look into it. But my experience is that compilers won't align struct size like this, only attributes, and this attribute is composed of 16-bit attributes so it doesn't get aligned by default. > On 11.07.2017 19:51, Claudio Freire wrote: >>> >>> ----- >>> >>> + /* Search for the segment likely to contain the item pointer */ >>> + iseg = vac_itemptr_binsrch( >>> + (void *) itemptr, >>> + (void *) >>> &(vacrelstats->dead_tuples.dt_segments->last_dead_tuple), >>> + vacrelstats->dead_tuples.last_seg + 1, >>> + sizeof(DeadTuplesSegment)); >>> + >>> >>> I think that we can change the above to; >>> >>> + /* Search for the segment likely to contain the item pointer */ >>> + iseg = vac_itemptr_binsrch( >>> + (void *) itemptr, >>> + (void *) &(seg->last_dead_tuple), >>> + vacrelstats->dead_tuples.last_seg + 1, >>> + sizeof(DeadTuplesSegment)); >>> >>> We set "seg = vacrelstats->dead_tuples.dt_segments" just before this. >> >> Right > > In my mind, if you change vacrelstats->dead_tuples.last_seg + 1 with > GetNumDeadTuplesSegments(vacrelstats), it would be more meaningful. It's not the same thing. The first run it might, but after a reset of the multiarray, num segments is the allocated size, while last_seg is the last one filled with data. > Besides, you can change the vac_itemptr_binsrch within the segment with > stdlib bsearch, like: > > res = (ItemPointer) bsearch((void *) itemptr, > > (void *) seg->dt_tids, > > seg->num_dead_tuples, > > sizeof(ItemPointerData), > > vac_cmp_itemptr); > > return (res != NULL); The stdlib's bsearch is quite slower. The custom bsearch inlines the comparison making it able to factor out of the loop quite a bit of logic, and in general generate far more specialized assembly. For the compiler to optimize the stdlib's bsearch call, whole-program optimization should be enabled, something that is unlikely. Even then, it may not be able to, due to aliasing rules. This is what I came up to make the new approach's performance on par or better than the old one, in CPU cycles. In fact, benchmarks show that time spent on the CPU is lower now, in large part, due to this. It's not like it's the first custom binary search in postgres, also.
On Wed, Jul 12, 2017 at 1:08 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Jul 12, 2017 at 11:48 AM, Alexey Chernyshov > <a.chernyshov@postgrespro.ru> wrote: >> Thank you for the patch and benchmark results, I have a couple remarks. >> Firstly, padding in DeadTuplesSegment >> >> typedef struct DeadTuplesSegment >> >> { >> >> ItemPointerData last_dead_tuple; /* Copy of the last dead tuple >> (unset >> >> * until the segment is fully >> >> * populated). Keep it first to >> simplify >> >> * binary searches */ >> >> unsigned short padding; /* Align dt_tids to 32-bits, >> >> * sizeof(ItemPointerData) is aligned to >> >> * short, so add a padding short, to make >> the >> >> * size of DeadTuplesSegment a multiple of >> >> * 32-bits and align integer components for >> >> * better performance during lookups into >> the >> >> * multiarray */ >> >> int num_dead_tuples; /* # of entries in the segment */ >> >> int max_dead_tuples; /* # of entries allocated in the >> segment */ >> >> ItemPointer dt_tids; /* Array of dead tuples */ >> >> } DeadTuplesSegment; >> >> In the comments to ItemPointerData is written that it is 6 bytes long, but >> can be padded to 8 bytes by some compilers, so if we add padding in a >> current way, there is no guaranty that it will be done as it is expected. >> The other way to do it with pg_attribute_alligned. But in my opinion, there >> is no need to do it manually, because the compiler will do this optimization >> itself. > > I'll look into it. But my experience is that compilers won't align > struct size like this, only attributes, and this attribute is composed > of 16-bit attributes so it doesn't get aligned by default. Doing sizeof(DeadTuplesSegment) suggests you were indeed right, at least in GCC. I'll remove the padding. Seems I just got the wrong impression at some point.
On Wed, Jul 12, 2017 at 1:29 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Jul 12, 2017 at 1:08 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Wed, Jul 12, 2017 at 11:48 AM, Alexey Chernyshov >> <a.chernyshov@postgrespro.ru> wrote: >>> Thank you for the patch and benchmark results, I have a couple remarks. >>> Firstly, padding in DeadTuplesSegment >>> >>> typedef struct DeadTuplesSegment >>> >>> { >>> >>> ItemPointerData last_dead_tuple; /* Copy of the last dead tuple >>> (unset >>> >>> * until the segment is fully >>> >>> * populated). Keep it first to >>> simplify >>> >>> * binary searches */ >>> >>> unsigned short padding; /* Align dt_tids to 32-bits, >>> >>> * sizeof(ItemPointerData) is aligned to >>> >>> * short, so add a padding short, to make >>> the >>> >>> * size of DeadTuplesSegment a multiple of >>> >>> * 32-bits and align integer components for >>> >>> * better performance during lookups into >>> the >>> >>> * multiarray */ >>> >>> int num_dead_tuples; /* # of entries in the segment */ >>> >>> int max_dead_tuples; /* # of entries allocated in the >>> segment */ >>> >>> ItemPointer dt_tids; /* Array of dead tuples */ >>> >>> } DeadTuplesSegment; >>> >>> In the comments to ItemPointerData is written that it is 6 bytes long, but >>> can be padded to 8 bytes by some compilers, so if we add padding in a >>> current way, there is no guaranty that it will be done as it is expected. >>> The other way to do it with pg_attribute_alligned. But in my opinion, there >>> is no need to do it manually, because the compiler will do this optimization >>> itself. >> >> I'll look into it. But my experience is that compilers won't align >> struct size like this, only attributes, and this attribute is composed >> of 16-bit attributes so it doesn't get aligned by default. > > Doing sizeof(DeadTuplesSegment) suggests you were indeed right, at > least in GCC. I'll remove the padding. > > Seems I just got the wrong impression at some point. Updated versions of the patches attached. A few runs of the benchmark show no significant difference, as it should (being all cosmetic changes). The bigger benchmark will take longer. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Apr 7, 2017 at 10:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > Indeed they do, and that's what motivated this patch. But I'd need > TB-sized tables to set up something like that. I don't have the > hardware or time available to do that (vacuum on bloated TB-sized > tables can take days in my experience). Scale 4000 is as big as I can > get without running out of space for the tests in my test hardware. > > If anybody else has the ability, I'd be thankful if they did test it > under those conditions, but I cannot. I think Anastasia's test is > closer to such a test, that's probably why it shows a bigger > improvement in total elapsed time. > > Our production database could possibly be used, but it can take about > a week to clone it, upgrade it (it's 9.5 currently), and run the > relevant vacuum. It looks like I won't be able to do that test with a production snapshot anytime soon. Getting approval for the budget required to do that looks like it's going to take far longer than I thought. Regardless of that, I think the patch can move forward. I'm still planning to do the test at some point, but this patch shouldn't block on it.
> On 18 Aug 2017, at 13:39, Claudio Freire <klaussfreire@gmail.com> wrote: > > On Fri, Apr 7, 2017 at 10:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Indeed they do, and that's what motivated this patch. But I'd need >> TB-sized tables to set up something like that. I don't have the >> hardware or time available to do that (vacuum on bloated TB-sized >> tables can take days in my experience). Scale 4000 is as big as I can >> get without running out of space for the tests in my test hardware. >> >> If anybody else has the ability, I'd be thankful if they did test it >> under those conditions, but I cannot. I think Anastasia's test is >> closer to such a test, that's probably why it shows a bigger >> improvement in total elapsed time. >> >> Our production database could possibly be used, but it can take about >> a week to clone it, upgrade it (it's 9.5 currently), and run the >> relevant vacuum. > > It looks like I won't be able to do that test with a production > snapshot anytime soon. > > Getting approval for the budget required to do that looks like it's > going to take far longer than I thought. > > Regardless of that, I think the patch can move forward. I'm still > planning to do the test at some point, but this patch shouldn't block > on it. This patch has been marked Ready for committer after review, but wasn’t committed in the current commitfest so it will be moved to the next. Since it no longer applies cleanly, it’s being reset to Waiting for author though. cheers ./daniel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Oct 1, 2017 at 8:36 PM, Daniel Gustafsson <daniel@yesql.se> wrote: >> On 18 Aug 2017, at 13:39, Claudio Freire <klaussfreire@gmail.com> wrote: >> >> On Fri, Apr 7, 2017 at 10:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> Indeed they do, and that's what motivated this patch. But I'd need >>> TB-sized tables to set up something like that. I don't have the >>> hardware or time available to do that (vacuum on bloated TB-sized >>> tables can take days in my experience). Scale 4000 is as big as I can >>> get without running out of space for the tests in my test hardware. >>> >>> If anybody else has the ability, I'd be thankful if they did test it >>> under those conditions, but I cannot. I think Anastasia's test is >>> closer to such a test, that's probably why it shows a bigger >>> improvement in total elapsed time. >>> >>> Our production database could possibly be used, but it can take about >>> a week to clone it, upgrade it (it's 9.5 currently), and run the >>> relevant vacuum. >> >> It looks like I won't be able to do that test with a production >> snapshot anytime soon. >> >> Getting approval for the budget required to do that looks like it's >> going to take far longer than I thought. >> >> Regardless of that, I think the patch can move forward. I'm still >> planning to do the test at some point, but this patch shouldn't block >> on it. > > This patch has been marked Ready for committer after review, but wasn’t > committed in the current commitfest so it will be moved to the next. Since it > no longer applies cleanly, it’s being reset to Waiting for author though. > > cheers ./daniel Rebased version of the patches attached -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Oct 2, 2017 at 11:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > Rebased version of the patches attached The status of the patch is misleading: https://commitfest.postgresql.org/15/844/. This was marked as waiting on author but a new version has been published. Let's be careful. The last patches I am aware of, aka those from https://www.postgresql.org/message-id/CAGTBQpZHTf2JtShC=ijc9wzEipo3XOKWQhx+8WiP7ZjPC3FBEg@mail.gmail.com, do not apply. I am moving the patch to the next commit fest with a waiting on author status, as this should be reviewed, but those need a rebase. -- Michael
On Tue, Nov 28, 2017 at 10:37 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Oct 2, 2017 at 11:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Rebased version of the patches attached > > The status of the patch is misleading: > https://commitfest.postgresql.org/15/844/. This was marked as waiting > on author but a new version has been published. Let's be careful. > > The last patches I am aware of, aka those from > https://www.postgresql.org/message-id/CAGTBQpZHTf2JtShC=ijc9wzEipo3XOKWQhx+8WiP7ZjPC3FBEg@mail.gmail.com, > do not apply. I am moving the patch to the next commit fest with a > waiting on author status, as this should be reviewed, but those need a > rebase. They did apply at the time, but I think major work on vacuum was pushed since then, and also I was traveling so out of reach. It may take some time to rebase them again. Should I move to needs review myself after that?
On Mon, Dec 4, 2017 at 2:38 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > They did apply at the time, but I think major work on vacuum was > pushed since then, and also I was traveling so out of reach. > > It may take some time to rebase them again. Should I move to needs > review myself after that? Sure, if you can get into this state, please feel free to update the status of the patch yourself. -- Michael
Greetings, * Michael Paquier (michael.paquier@gmail.com) wrote: > On Mon, Dec 4, 2017 at 2:38 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > > They did apply at the time, but I think major work on vacuum was > > pushed since then, and also I was traveling so out of reach. > > > > It may take some time to rebase them again. Should I move to needs > > review myself after that? > > Sure, if you can get into this state, please feel free to update the > status of the patch yourself. We're now over a month since this status update- Claudio, for this to have a chance during this commitfest to be included (which, personally, I think would be great as it solves a pretty serious issue..), we really need to have it be rebased and updated. Once that's done, as Michael says, please change the patch status back to 'Needs Review'. Thanks! Stephen
Attachment
On Sat, Jan 6, 2018 at 7:35 PM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Mon, Dec 4, 2017 at 2:38 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > They did apply at the time, but I think major work on vacuum was
> > pushed since then, and also I was traveling so out of reach.
> >
> > It may take some time to rebase them again. Should I move to needs
> > review myself after that?
>
> Sure, if you can get into this state, please feel free to update the
> status of the patch yourself.
We're now over a month since this status update- Claudio, for this to
have a chance during this commitfest to be included (which, personally,
I think would be great as it solves a pretty serious issue..), we really
need to have it be rebased and updated. Once that's done, as Michael
says, please change the patch status back to 'Needs Review'.
Sorry, had tons of other stuff that took priority.
I'll get to rebase this patch now.
On Wed, Jan 17, 2018 at 5:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Sat, Jan 6, 2018 at 7:35 PM, Stephen Frost <sfrost@snowman.net> wrote:Greetings,
* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Mon, Dec 4, 2017 at 2:38 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > They did apply at the time, but I think major work on vacuum was
> > pushed since then, and also I was traveling so out of reach.
> >
> > It may take some time to rebase them again. Should I move to needs
> > review myself after that?
>
> Sure, if you can get into this state, please feel free to update the
> status of the patch yourself.
We're now over a month since this status update- Claudio, for this to
have a chance during this commitfest to be included (which, personally,
I think would be great as it solves a pretty serious issue..), we really
need to have it be rebased and updated. Once that's done, as Michael
says, please change the patch status back to 'Needs Review'.Sorry, had tons of other stuff that took priority.I'll get to rebase this patch now.
Huh. That was simpler than I thought.
Attached rebased versions.
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed I can confirm that these patches don't break anything; the code is well documented, has some tests and doesn't do anything obviously wrong. However I would recommend someone who is more familiar with the VACUUM mechanism than I do to recheck these patches. The new status of this patch is: Ready for Committer
On Thu, Jan 18, 2018 at 9:17 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > Huh. That was simpler than I thought. > > Attached rebased versions. Hi Claudio, FYI the regression test seems to have some run-to-run variation. Though it usually succeeds, recently I have seen a couple of failures like this: ========= Contents of ./src/test/regress/regression.diffs *** /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/expected/vacuum.out 2018-01-24 01:41:28.200454371 +0000 --- /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/vacuum.out 2018-01-24 01:51:07.970049937 +0000 *************** *** 128,134 **** SELECT pg_relation_size('vactst', 'main'); pg_relation_size ------------------ ! 0 (1 row) SELECT count(*) FROM vactst; --- 128,134 ---- SELECT pg_relation_size('vactst', 'main'); pg_relation_size ------------------ ! 8192 (1 row) SELECT count(*) FROM vactst; ====================================================================== -- Thomas Munro http://www.enterprisedb.com
On Thu, Jan 25, 2018 at 4:11 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Jan 18, 2018 at 9:17 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Huh. That was simpler than I thought. >> >> Attached rebased versions. > > Hi Claudio, > > FYI the regression test seems to have some run-to-run variation. > Though it usually succeeds, recently I have seen a couple of failures > like this: > > ========= Contents of ./src/test/regress/regression.diffs > *** /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/expected/vacuum.out > 2018-01-24 01:41:28.200454371 +0000 > --- /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/vacuum.out > 2018-01-24 01:51:07.970049937 +0000 > *************** > *** 128,134 **** > SELECT pg_relation_size('vactst', 'main'); > pg_relation_size > ------------------ > ! 0 > (1 row) > > SELECT count(*) FROM vactst; > --- 128,134 ---- > SELECT pg_relation_size('vactst', 'main'); > pg_relation_size > ------------------ > ! 8192 > (1 row) > > SELECT count(*) FROM vactst; > ====================================================================== > > -- > Thomas Munro > http://www.enterprisedb.com I'll look into it However, shouldn't an empty relation have an initial page anyway? In that case shouldn't the correct value be 8192?
Claudio Freire wrote: > On Thu, Jan 25, 2018 at 4:11 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > *** 128,134 **** > > SELECT pg_relation_size('vactst', 'main'); > > pg_relation_size > > ------------------ > > ! 0 > > (1 row) > > > > SELECT count(*) FROM vactst; > > --- 128,134 ---- > > SELECT pg_relation_size('vactst', 'main'); > > pg_relation_size > > ------------------ > > ! 8192 > > (1 row) > However, shouldn't an empty relation have an initial page anyway? In > that case shouldn't the correct value be 8192? No, it's legal for an empty table to have size 0 on disk. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jan 25, 2018 at 10:56 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Thu, Jan 25, 2018 at 4:11 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Thu, Jan 18, 2018 at 9:17 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> Huh. That was simpler than I thought. >>> >>> Attached rebased versions. >> >> Hi Claudio, >> >> FYI the regression test seems to have some run-to-run variation. >> Though it usually succeeds, recently I have seen a couple of failures >> like this: >> >> ========= Contents of ./src/test/regress/regression.diffs >> *** /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/expected/vacuum.out >> 2018-01-24 01:41:28.200454371 +0000 >> --- /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/vacuum.out >> 2018-01-24 01:51:07.970049937 +0000 >> *************** >> *** 128,134 **** >> SELECT pg_relation_size('vactst', 'main'); >> pg_relation_size >> ------------------ >> ! 0 >> (1 row) >> >> SELECT count(*) FROM vactst; >> --- 128,134 ---- >> SELECT pg_relation_size('vactst', 'main'); >> pg_relation_size >> ------------------ >> ! 8192 >> (1 row) >> >> SELECT count(*) FROM vactst; >> ====================================================================== >> >> -- >> Thomas Munro >> http://www.enterprisedb.com > > I'll look into it I had the tests running in a loop all day long, and I cannot reproduce that variance. Can you share your steps to reproduce it, including configure flags?
On Fri, Jan 26, 2018 at 9:38 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > I had the tests running in a loop all day long, and I cannot reproduce > that variance. > > Can you share your steps to reproduce it, including configure flags? Here are two build logs where it failed: https://travis-ci.org/postgresql-cfbot/postgresql/builds/332968819 https://travis-ci.org/postgresql-cfbot/postgresql/builds/332592511 Here's one where it succeeded: https://travis-ci.org/postgresql-cfbot/postgresql/builds/333139855 The full build script used is: ./configure --enable-debug --enable-cassert --enable-coverage --enable-tap-tests --with-tcl --with-python --with-perl --with-ldap --with-icu && make -j4 all contrib docs && make -Otarget -j3 check-world This is a virtualised 4 core system. I wonder if "make -Otarget -j3 check-world" creates enough load on it to produce some weird timing effect that you don't see on your development system. -- Thomas Munro http://www.enterprisedb.com
On Thu, Jan 25, 2018 at 6:21 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Jan 26, 2018 at 9:38 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> I had the tests running in a loop all day long, and I cannot reproduce >> that variance. >> >> Can you share your steps to reproduce it, including configure flags? > > Here are two build logs where it failed: > > https://travis-ci.org/postgresql-cfbot/postgresql/builds/332968819 > https://travis-ci.org/postgresql-cfbot/postgresql/builds/332592511 > > Here's one where it succeeded: > > https://travis-ci.org/postgresql-cfbot/postgresql/builds/333139855 > > The full build script used is: > > ./configure --enable-debug --enable-cassert --enable-coverage > --enable-tap-tests --with-tcl --with-python --with-perl --with-ldap > --with-icu && make -j4 all contrib docs && make -Otarget -j3 > check-world > > This is a virtualised 4 core system. I wonder if "make -Otarget -j3 > check-world" creates enough load on it to produce some weird timing > effect that you don't see on your development system. I can't reproduce it, not even with the same build script. It's starting to look like a timing effect indeed. I get a similar effect if there's an active snapshot in another session while vacuum runs. I don't know how the test suite ends up in that situation, but it seems to be the case. How do you suggest we go about fixing this? The test in question is important, I've caught actual bugs in the implementation with it, because it checks that vacuum effectively frees up space. I'm thinking this vacuum test could be put on its own parallel group perhaps? Since I can't reproduce it, I can't know whether that will fix it, but it seems sensible.
Hello, At Fri, 2 Feb 2018 19:52:02 -0300, Claudio Freire <klaussfreire@gmail.com> wrote in <CAGTBQpaiNQSNJC8y4w82UBTaPsvSqRRg++yEi5wre1MFE2iD8Q@mail.gmail.com> > On Thu, Jan 25, 2018 at 6:21 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > On Fri, Jan 26, 2018 at 9:38 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > >> I had the tests running in a loop all day long, and I cannot reproduce > >> that variance. > >> > >> Can you share your steps to reproduce it, including configure flags? > > > > Here are two build logs where it failed: > > > > https://travis-ci.org/postgresql-cfbot/postgresql/builds/332968819 > > https://travis-ci.org/postgresql-cfbot/postgresql/builds/332592511 > > > > Here's one where it succeeded: > > > > https://travis-ci.org/postgresql-cfbot/postgresql/builds/333139855 > > > > The full build script used is: > > > > ./configure --enable-debug --enable-cassert --enable-coverage > > --enable-tap-tests --with-tcl --with-python --with-perl --with-ldap > > --with-icu && make -j4 all contrib docs && make -Otarget -j3 > > check-world > > > > This is a virtualised 4 core system. I wonder if "make -Otarget -j3 > > check-world" creates enough load on it to produce some weird timing > > effect that you don't see on your development system. > > I can't reproduce it, not even with the same build script. I had the same error by "make -j3 check-world" but only twice from many trials. > It's starting to look like a timing effect indeed. It seems to be truncation skip, maybe caused by concurrent autovacuum. See lazy_truncate_heap() for details. Updates of pg_stat_*_tables can be delayed so looking it also can fail. Even though I haven't looked the patch closer, the "SELECT pg_relation_size()" doesn't seem to give something meaningful anyway. > I get a similar effect if there's an active snapshot in another > session while vacuum runs. I don't know how the test suite ends up in > that situation, but it seems to be the case. > > How do you suggest we go about fixing this? The test in question is > important, I've caught actual bugs in the implementation with it, > because it checks that vacuum effectively frees up space. > > I'm thinking this vacuum test could be put on its own parallel group > perhaps? Since I can't reproduce it, I can't know whether that will > fix it, but it seems sensible. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Feb 6, 2018 at 4:35 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> It's starting to look like a timing effect indeed. > > It seems to be truncation skip, maybe caused by concurrent > autovacuum. Good point, I'll also disable autovacuum on vactst. > See lazy_truncate_heap() for details. Updates of > pg_stat_*_tables can be delayed so looking it also can fail. Even > though I haven't looked the patch closer, the "SELECT > pg_relation_size()" doesn't seem to give something meaningful > anyway. Maybe then "explain (analyze, buffers, costs off, timing off, summary off) select * from vactst" then. The point is to check that the relation's heap has 0 pages.
On Tue, Feb 6, 2018 at 10:18 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Tue, Feb 6, 2018 at 4:35 AM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>> It's starting to look like a timing effect indeed. >> >> It seems to be truncation skip, maybe caused by concurrent >> autovacuum. > > Good point, I'll also disable autovacuum on vactst. > >> See lazy_truncate_heap() for details. Updates of >> pg_stat_*_tables can be delayed so looking it also can fail. Even >> though I haven't looked the patch closer, the "SELECT >> pg_relation_size()" doesn't seem to give something meaningful >> anyway. > > Maybe then "explain (analyze, buffers, costs off, timing off, summary > off) select * from vactst" then. > > The point is to check that the relation's heap has 0 pages. Attached rebased patches with those changes mentioned above, namely: - vacuum test now creates vactst with autovacuum disabled for it - vacuum test on its own parallel group - use explain analyze instead of pg_relation_size to check the relation is properly truncated
Attachment
Hello, At Tue, 6 Feb 2018 10:41:22 -0300, Claudio Freire <klaussfreire@gmail.com> wrote in <CAGTBQpaufC0o8OikWd8=5biXcbYjT51mPLfmy22NUjX6kUED0A@mail.gmail.com> > On Tue, Feb 6, 2018 at 10:18 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > > On Tue, Feb 6, 2018 at 4:35 AM, Kyotaro HORIGUCHI > > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > >>> It's starting to look like a timing effect indeed. > >> > >> It seems to be truncation skip, maybe caused by concurrent > >> autovacuum. > > > > Good point, I'll also disable autovacuum on vactst. > > > >> See lazy_truncate_heap() for details. Updates of > >> pg_stat_*_tables can be delayed so looking it also can fail. Even > >> though I haven't looked the patch closer, the "SELECT > >> pg_relation_size()" doesn't seem to give something meaningful > >> anyway. > > > > Maybe then "explain (analyze, buffers, costs off, timing off, summary > > off) select * from vactst" then. Ah, sorry. I meant by the above that it gives unstable result with autovacuum. So pg_relation_size() is usable after you turned of autovacuum on the table. > > The point is to check that the relation's heap has 0 pages. > > Attached rebased patches with those changes mentioned above, namely: > > - vacuum test now creates vactst with autovacuum disabled for it > - vacuum test on its own parallel group > - use explain analyze instead of pg_relation_size to check the > relation is properly truncated The problematic test was in the 0001..v14 patch. The new 0001..v15 is identical to v14 and instead 0003-v8 has additional part that edits the test and expects added by 0001 into the shape as above. It seems that you merged the fixup onto the wrong commit. And may we assume it correct that 0002 is missing in this patchset? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Feb 7, 2018 at 12:50 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Tue, 6 Feb 2018 10:41:22 -0300, Claudio Freire <klaussfreire@gmail.com> wrote in <CAGTBQpaufC0o8OikWd8=5biXcbYjT51mPLfmy22NUjX6kUED0A@mail.gmail.com> >> On Tue, Feb 6, 2018 at 10:18 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> > On Tue, Feb 6, 2018 at 4:35 AM, Kyotaro HORIGUCHI >> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> >>> It's starting to look like a timing effect indeed. >> >> >> >> It seems to be truncation skip, maybe caused by concurrent >> >> autovacuum. >> > >> > Good point, I'll also disable autovacuum on vactst. >> > >> >> See lazy_truncate_heap() for details. Updates of >> >> pg_stat_*_tables can be delayed so looking it also can fail. Even >> >> though I haven't looked the patch closer, the "SELECT >> >> pg_relation_size()" doesn't seem to give something meaningful >> >> anyway. >> > >> > Maybe then "explain (analyze, buffers, costs off, timing off, summary >> > off) select * from vactst" then. > > Ah, sorry. I meant by the above that it gives unstable result > with autovacuum. So pg_relation_size() is usable after you turned > of autovacuum on the table. You did mention stats could be delayed >> > The point is to check that the relation's heap has 0 pages. >> >> Attached rebased patches with those changes mentioned above, namely: >> >> - vacuum test now creates vactst with autovacuum disabled for it >> - vacuum test on its own parallel group >> - use explain analyze instead of pg_relation_size to check the >> relation is properly truncated > > The problematic test was in the 0001..v14 patch. The new > 0001..v15 is identical to v14 and instead 0003-v8 has additional > part that edits the test and expects added by 0001 into the shape > as above. It seems that you merged the fixup onto the wrong > commit. > > And may we assume it correct that 0002 is missing in this > patchset? Sounds like I botched the rebase. Sorry about that. Attached are corrected versions (1-v16 and 3-v9)
Attachment
Claudio Freire wrote: > - vacuum test on its own parallel group Hmm, this solution is not very friendly to the goal of reducing test runtime, particularly since the new test creates a nontrivial-sized table. Maybe we can find a better alternative. Can we use some wait logic instead? Maybe something like grab a snapshot of running VXIDs and loop waiting until they're all gone before doing the vacuum? Also, I don't understand why pg_relation_size() is not a better solution to determining the table size compared to explain. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Feb 7, 2018 at 7:57 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Claudio Freire wrote: > >> - vacuum test on its own parallel group > > Hmm, this solution is not very friendly to the goal of reducing test > runtime, particularly since the new test creates a nontrivial-sized > table. Maybe we can find a better alternative. Can we use some wait > logic instead? Maybe something like grab a snapshot of running VXIDs > and loop waiting until they're all gone before doing the vacuum? I'm not sure there's any alternative. I did some tests and any active snapshot on any other table, not necessarily on the one being vacuumed, distorted the test. And it makes sense, since that snapshot makes those deleted tuples unvacuumable. Waiting as you say would be akin to what the patch does by putting vacuum on its own parallel group. I'm guessing all tests write something to the database, so all tests will create a snapshot. Maybe if there were read-only tests, those might be safe to include in vacuum's parallel group, but otherwise I don't see any alternative. > Also, I don't understand why pg_relation_size() is not a better solution > to determining the table size compared to explain. I was told pg_relation_size can return stale information. I didn't check, I took that at face value.
Claudio Freire wrote: > On Wed, Feb 7, 2018 at 7:57 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Claudio Freire wrote: > > Hmm, this solution is not very friendly to the goal of reducing test > > runtime, particularly since the new test creates a nontrivial-sized > > table. Maybe we can find a better alternative. Can we use some wait > > logic instead? Maybe something like grab a snapshot of running VXIDs > > and loop waiting until they're all gone before doing the vacuum? > > I'm not sure there's any alternative. I did some tests and any active > snapshot on any other table, not necessarily on the one being > vacuumed, distorted the test. And it makes sense, since that snapshot > makes those deleted tuples unvacuumable. Sure. > Waiting as you say would be akin to what the patch does by putting > vacuum on its own parallel group. I don't think it's the same. We don't need to wait until all the concurrent tests are done -- we only need to wait until the transactions that were current when the delete finished are done, which is very different since each test runs tons of small transactions rather than one single big transaction. > > Also, I don't understand why pg_relation_size() is not a better solution > > to determining the table size compared to explain. > > I was told pg_relation_size can return stale information. I didn't > check, I took that at face value. Hmm, it uses stat() on the table files. I think those files would be truncated at the time the transaction commits, so they shouldn't be stale. (I don't think the system waits for a checkpoint to flush a truncation.) Maybe relying on that is not reliable or future-proof enough. Anyway this is a minor point -- the one above worries me most. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Feb 7, 2018 at 8:52 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> Waiting as you say would be akin to what the patch does by putting >> vacuum on its own parallel group. > > I don't think it's the same. We don't need to wait until all the > concurrent tests are done -- we only need to wait until the transactions > that were current when the delete finished are done, which is very > different since each test runs tons of small transactions rather than > one single big transaction. Um... maybe "lock pg_class" ? That should conflict with basically any other running transaction and have pretty much that effect. Attached is a version of patch 1 with that approach.
Attachment
Claudio Freire wrote: > On Wed, Feb 7, 2018 at 8:52 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > >> Waiting as you say would be akin to what the patch does by putting > >> vacuum on its own parallel group. > > > > I don't think it's the same. We don't need to wait until all the > > concurrent tests are done -- we only need to wait until the transactions > > that were current when the delete finished are done, which is very > > different since each test runs tons of small transactions rather than > > one single big transaction. > > Um... maybe "lock pg_class" ? I was thinking in first doing SELECT array_agg(DISTINCT virtualtransaction) vxids FROM pg_locks \gset and then in a DO block loop until SELECT DISTINCT virtualtransaction FROM pg_locks INTERSECT SELECT (unnest(:'vxids'::text[])); returns empty; something along those lines. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Feb 7, 2018 at 11:29 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Claudio Freire wrote: >> On Wed, Feb 7, 2018 at 8:52 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> >> Waiting as you say would be akin to what the patch does by putting >> >> vacuum on its own parallel group. >> > >> > I don't think it's the same. We don't need to wait until all the >> > concurrent tests are done -- we only need to wait until the transactions >> > that were current when the delete finished are done, which is very >> > different since each test runs tons of small transactions rather than >> > one single big transaction. >> >> Um... maybe "lock pg_class" ? > > I was thinking in first doing > SELECT array_agg(DISTINCT virtualtransaction) vxids > FROM pg_locks \gset > > and then in a DO block loop until > > SELECT DISTINCT virtualtransaction > FROM pg_locks > INTERSECT > SELECT (unnest(:'vxids'::text[])); > > returns empty; something along those lines. Isn't it the same though? I can't think how a transaction wouldn't be holding at least an access share on pg_class.
On Thu, Feb 8, 2018 at 12:13 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Feb 7, 2018 at 11:29 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> Claudio Freire wrote: >>> On Wed, Feb 7, 2018 at 8:52 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >>> >> Waiting as you say would be akin to what the patch does by putting >>> >> vacuum on its own parallel group. >>> > >>> > I don't think it's the same. We don't need to wait until all the >>> > concurrent tests are done -- we only need to wait until the transactions >>> > that were current when the delete finished are done, which is very >>> > different since each test runs tons of small transactions rather than >>> > one single big transaction. >>> >>> Um... maybe "lock pg_class" ? >> >> I was thinking in first doing >> SELECT array_agg(DISTINCT virtualtransaction) vxids >> FROM pg_locks \gset >> >> and then in a DO block loop until >> >> SELECT DISTINCT virtualtransaction >> FROM pg_locks >> INTERSECT >> SELECT (unnest(:'vxids'::text[])); >> >> returns empty; something along those lines. > > Isn't it the same though? > > I can't think how a transaction wouldn't be holding at least an access > share on pg_class. Never mind, I just saw the error of my ways. I don't like looping, though, seems overly cumbersome. What's worse? maintaining that fragile weird loop that might break (by causing unexpected output), or a slight slowdown of the test suite? I don't know how long it might take on slow machines, but in my machine, which isn't a great machine, while the vacuum test isn't fast indeed, it's just a tiny fraction of what a simple "make check" takes. So it's not a huge slowdown in any case. I'll give it some thought, maybe there's a simpler way.
Hello, I looked this a bit closer. In upthread[1] Robert mentioned the exponentially increasing size of additional segments. >> Hmm, I had imagined making all of the segments the same size rather >> than having the size grow exponentially. The whole point of this is >> to save memory, and even in the worst case you don't end up with that >> many segments as long as you pick a reasonable base size (e.g. 64MB). > > Wastage is bound by a fraction of the total required RAM, that is, > it's proportional to the amount of required RAM, not the amount > allocated. So it should still be fine, and the exponential strategy > should improve lookup performance considerably. It seems that you are getting him wrong. (Anyway I'm not sure what you meant by the above. not-yet-allocated memory won't be a waste.) The conclusive number of dead tuples in a heap scan is undeteminable until the scan ends. If we had a new dead tuple required a, say 512MB new segment and the scan ends just after, the wastage will be almost the whole of the segment. On the other hand, I don't think the exponential strategy make things considerably better. bsearch iterations in lazy_tid_reaped() are distributed between segment search and tid search. Intuitively more or less the increased segment size just moves some iterations of the former to the latter. I made a calculation[2]. With maintemance_work_mem of 4096MB, the number of segments is 6 and expected number of bsearch iteration is about 20.8 for the exponential strategy. With 64MB fixed size segments, we will have 64 segments (that is not so many) and the expected iteration is 20.4. (I suppose the increase comes from the imbalanced size among segments.) Addition to that, as Robert mentioned, the possible maximum memory wastage of the exponential strategy is about 2GB and 64MB in fixed size strategy. Seeing these numbers, I don't tend to take the exponential strategy. [1] https://www.postgresql.org/message-id/CAGTBQpbZX5S4QrnB6YP-2Nk+A9bxbaVktzKwsGvMeov3MTgdiQ@mail.gmail.com [2] See attached perl script. I hope it is correct. regards, -- Kyotaro Horiguchi NTT Open Source Software Center #! /usr/bin/perl $maxmem=1024 * 4; #===== print "exponential sized strategy\n"; $ss = 64; $ts = 0; $sumiteritem = 0; for ($i = 1 ; $ts < $maxmem ; $i++) { $ss = $ss * 2; if ($ts + $ss > $maxmem) { $ss = $maxmem - $ts; } $ts += $ss; $ntups = $ts*1024*1024 / 6; $ntupinseg = $ss*1024*1024 / 6; $npages = $ntups / 291; $tsize = $npages * 8192.0 / 1024 / 1024 / 1024; $sumiteritem += log($ntupinseg) * $ntupinseg; # weight by percentage in all tuples printf("#%d : segsize=%dMB total=%dMB, (tuples = %ld, min tsize=%.1fGB), iterseg(%d)=%f, iteritem(%d) = %f, expectediter=%f\n", $i, $ss, $ts, $ntups, $tsize, $i, log($i), $ntupinseg, log($ntupinseg), log($i) + $sumiteritem/$ntups); } print "\n\nfixed sized strategy\n"; $ss = 64; $ts = 0; for ($i = 1 ; $ts < $maxmem ; $i++) { $ts += $ss; $ntups = $ts*1024*1024 / 6; $ntupinseg = $ss*1024*1024 / 6; $npages = $ntups / 300; $tsize = $npages * 8192.0 / 1024 / 1024 / 1024; printf("#%d : segsize=%dMB total=%dMB, (tuples = %ld, min tsize=%.1fGB), interseg(%d)=%f, iteritem(%d) = %f, expectediter=%f\n", $i, $ss, $ts, $ntups, $tsize, $i, log($i), $ntupinseg, log($ntupinseg), log($i) + log($ntupinseg)); }
On Thu, Feb 8, 2018 at 2:44 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, I looked this a bit closer. > > In upthread[1] Robert mentioned the exponentially increasing size > of additional segments. > >>> Hmm, I had imagined making all of the segments the same size rather >>> than having the size grow exponentially. The whole point of this is >>> to save memory, and even in the worst case you don't end up with that >>> many segments as long as you pick a reasonable base size (e.g. 64MB). >> >> Wastage is bound by a fraction of the total required RAM, that is, >> it's proportional to the amount of required RAM, not the amount >> allocated. So it should still be fine, and the exponential strategy >> should improve lookup performance considerably. > > It seems that you are getting him wrong. (Anyway I'm not sure > what you meant by the above. not-yet-allocated memory won't be a > waste.) The conclusive number of dead tuples in a heap scan is > undeteminable until the scan ends. If we had a new dead tuple > required a, say 512MB new segment and the scan ends just after, > the wastage will be almost the whole of the segment. And the segment size is bound by a fraction of total needed memory. When I said "allocated", I meant m_w_m. Wastage is not proportional to m_w_m. > On the other hand, I don't think the exponential strategy make > things considerably better. bsearch iterations in > lazy_tid_reaped() are distributed between segment search and tid > search. Intuitively more or less the increased segment size just > moves some iterations of the former to the latter. > > I made a calculation[2]. With maintemance_work_mem of 4096MB, the > number of segments is 6 and expected number of bsearch iteration > is about 20.8 for the exponential strategy. With 64MB fixed size > segments, we will have 64 segments (that is not so many) and the > expected iteration is 20.4. (I suppose the increase comes from > the imbalanced size among segments.) Addition to that, as Robert > mentioned, the possible maximum memory wastage of the exponential > strategy is about 2GB and 64MB in fixed size strategy. That calculation has a slight bug in that it should be log2, and that segment size is limited to 1GB at the top end. But in any case, the biggest issue is that it's ignoring the effect of cache locality. The way in which the exponential strategy helps, is by keeping the segment list small and comfortably fitting in fast cache memory, while also keeping wastage at a minimum for small lists. 64MB segments with 4G mwm would be about 2kb of segment list. It fits in L1, if there's nothing else contending for it, but it's already starting to get big, and it would be expected settings larger than 4G mwm could be used. I guess I could tune the starting/ending sizes a bit. Say, with an upper limit of 256M (instead of 1G), and after fixing the other issues, we get: exponential sized strategy ... #18 : segsize=64MB total=4096MB, (tuples = 715827882, min tsize=18.8GB), iterseg(18)=4.169925, iteritem(11184810) = 23.415037, expected iter=29.491213 fixed sized strategy ... #64 : segsize=64MB total=4096MB, (tuples = 715827882, min tsize=18.2GB), interseg(64)=6.000000, iteritem(11184810) = 23.415037, expected iter=29.415037 Almost identical, and we get all the benefits of cache locality with the exponential strategy. The fixed strategy might fit in the L1, but it's less likely the bigger the mwm is. The scaling factor could also be tuned I guess, but I'm wary of using anything other than a doubling strategy, since it might cause memory fragmentation.
Claudio Freire wrote: > I don't like looping, though, seems overly cumbersome. What's worse? > maintaining that fragile weird loop that might break (by causing > unexpected output), or a slight slowdown of the test suite? > > I don't know how long it might take on slow machines, but in my > machine, which isn't a great machine, while the vacuum test isn't fast > indeed, it's just a tiny fraction of what a simple "make check" takes. > So it's not a huge slowdown in any case. Well, what about a machine running tests under valgrind, or the weird cache-clobbering infuriatingly slow code? Or buildfarm members running on really slow hardware? These days, a few people have spent a lot of time trying to reduce the total test time, and it'd be bad to lose back the improvements for no good reason. I grant you that the looping I proposed is more complicated, but I don't see any reason why it would break. Another argument against the LOCK pg_class idea is that it causes an unnecessary contention point across the whole parallel test group -- with possible weird side effects. How about a deadlock? Other than the wait loop I proposed, I think we can make a couple of very simple improvements to this test case to avoid a slowdown: 1. the DELETE takes about 1/4th of the time and removes about the same number of rows as the one using the IN clause: delete from vactst where random() < 3.0 / 4; 2. Use a new temp table rather than vactst. Everything is then faster. 3. Figure out the minimum size for the table that triggers the behavior you want. Right now you use 400k tuples -- maybe 100k are sufficient? Don't know. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 8, 2018 at 8:39 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Claudio Freire wrote: > >> I don't like looping, though, seems overly cumbersome. What's worse? >> maintaining that fragile weird loop that might break (by causing >> unexpected output), or a slight slowdown of the test suite? >> >> I don't know how long it might take on slow machines, but in my >> machine, which isn't a great machine, while the vacuum test isn't fast >> indeed, it's just a tiny fraction of what a simple "make check" takes. >> So it's not a huge slowdown in any case. > > Well, what about a machine running tests under valgrind, or the weird > cache-clobbering infuriatingly slow code? Or buildfarm members running > on really slow hardware? These days, a few people have spent a lot of > time trying to reduce the total test time, and it'd be bad to lose back > the improvements for no good reason. It's not for no good reason. The old tests were woefully inadequate. During the process of developing the patch, I got seriously broken code that passed the tests nonetheless. The test as it was was very ineffective at actually detecting issues. This new test may be slow, but it's effective. That's a very good reason to make it slower, if you ask me. > I grant you that the looping I proposed is more complicated, but I don't > see any reason why it would break. > > Another argument against the LOCK pg_class idea is that it causes an > unnecessary contention point across the whole parallel test group -- > with possible weird side effects. How about a deadlock? The real issue with lock pg_class is that locks on pg_class are short-lived, so I'm not waiting for whole transactions. > Other than the wait loop I proposed, I think we can make a couple of > very simple improvements to this test case to avoid a slowdown: > > 1. the DELETE takes about 1/4th of the time and removes about the same > number of rows as the one using the IN clause: > delete from vactst where random() < 3.0 / 4; I did try this at first, but it causes random output, so the test breaks randomly. > 2. Use a new temp table rather than vactst. Everything is then faster. I might try that. > 3. Figure out the minimum size for the table that triggers the behavior > you want. Right now you use 400k tuples -- maybe 100k are sufficient? > Don't know. For that test, I need enough *dead* tuples to cause several passes. Even small mwm settings require tons of tuples for this. In fact, I'm thinking that number might be too low for its purpose, even. I'll re-check, but I doubt it's too high. If anything, it's too low.
Claudio Freire wrote: > On Thu, Feb 8, 2018 at 8:39 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > During the process of developing the patch, I got seriously broken > code that passed the tests nonetheless. The test as it was was very > ineffective at actually detecting issues. > > This new test may be slow, but it's effective. That's a very good > reason to make it slower, if you ask me. OK, I don't disagree with improving the test, but if we can make it fast *and* effective, that's better than slow and effective. > > Another argument against the LOCK pg_class idea is that it causes an > > unnecessary contention point across the whole parallel test group -- > > with possible weird side effects. How about a deadlock? > > The real issue with lock pg_class is that locks on pg_class are > short-lived, so I'm not waiting for whole transactions. Doh. > > Other than the wait loop I proposed, I think we can make a couple of > > very simple improvements to this test case to avoid a slowdown: > > > > 1. the DELETE takes about 1/4th of the time and removes about the same > > number of rows as the one using the IN clause: > > delete from vactst where random() < 3.0 / 4; > > I did try this at first, but it causes random output, so the test > breaks randomly. OK. Still, your query seqscans the table twice. Maybe it's possible to use a CTID scan to avoid that, but I'm not sure how. > > 3. Figure out the minimum size for the table that triggers the behavior > > you want. Right now you use 400k tuples -- maybe 100k are sufficient? > > Don't know. > > For that test, I need enough *dead* tuples to cause several passes. > Even small mwm settings require tons of tuples for this. In fact, I'm > thinking that number might be too low for its purpose, even. I'll > re-check, but I doubt it's too high. If anything, it's too low. OK. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Feb 9, 2018 at 10:32 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Claudio Freire wrote: >> On Thu, Feb 8, 2018 at 8:39 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > >> During the process of developing the patch, I got seriously broken >> code that passed the tests nonetheless. The test as it was was very >> ineffective at actually detecting issues. >> >> This new test may be slow, but it's effective. That's a very good >> reason to make it slower, if you ask me. > > OK, I don't disagree with improving the test, but if we can make it fast > *and* effective, that's better than slow and effective. I'd love to have a test that uses multiple segments of dead tuples, but for that, it needs to use more than 64MB of mwm. That amounts to, basically, ~12M rows. Is there a "slow test suite" where such a test could be added that won't bother regular "make check"? That, or we turn the initial segment size into a GUC, but I don't think it's a useful GUC outside of the test suite. >> > 3. Figure out the minimum size for the table that triggers the behavior >> > you want. Right now you use 400k tuples -- maybe 100k are sufficient? >> > Don't know. >> >> For that test, I need enough *dead* tuples to cause several passes. >> Even small mwm settings require tons of tuples for this. In fact, I'm >> thinking that number might be too low for its purpose, even. I'll >> re-check, but I doubt it's too high. If anything, it's too low. > > OK. Turns out that it was a tad oversized. 300k tuples seems enough. Attached is a new patch version that: - Uses an unlogged table to make the large mwm test faster - Uses a wait_barrier helper that waits for concurrent transactions to finish before vacuuming tables, to make sure deleted tuples actually are vacuumable - Tweaks the size of the large mwm test to be as small as possible - Optimizes the delete to avoid expensive operations yet attain the same end result
Attachment
On Fri, Aug 18, 2017 at 8:39 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Fri, Apr 7, 2017 at 10:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Indeed they do, and that's what motivated this patch. But I'd need >> TB-sized tables to set up something like that. I don't have the >> hardware or time available to do that (vacuum on bloated TB-sized >> tables can take days in my experience). Scale 4000 is as big as I can >> get without running out of space for the tests in my test hardware. >> >> If anybody else has the ability, I'd be thankful if they did test it >> under those conditions, but I cannot. I think Anastasia's test is >> closer to such a test, that's probably why it shows a bigger >> improvement in total elapsed time. >> >> Our production database could possibly be used, but it can take about >> a week to clone it, upgrade it (it's 9.5 currently), and run the >> relevant vacuum. > > It looks like I won't be able to do that test with a production > snapshot anytime soon. > > Getting approval for the budget required to do that looks like it's > going to take far longer than I thought. I finally had a chance to test the patch in a production snapshot. Actually, I tried to take out 2 birds with one stone, and I'm also testing the FSM vacuum patch. It shouldn't significantly alter the numbers anyway. So, while the whole-db throttled vacuum (as is run in production) is still ongoing, an interesting case already popped up. TL;DR, without the patch, this particular table took 16 1/2 hours more or less, to vacuum 313M dead tuples. With the patch, it took 6:10h to vacuum 323M dead tuples. That's quite a speedup. It even used significantly less CPU time as well. Since vacuum here is throttled (with cost-based delays), this also means it generated less I/O. We have more extreme cases sometimes, so if I see something interesting in what remains of the test, I'll post those results as well. The raw data: Patched INFO: vacuuming "public.aggregated_tracks_hourly_full" INFO: scanned index "aggregated_tracks_hourly_full_pkey_null" to remove 323778164 row versions DETAIL: CPU: user: 111.57 s, system: 31.28 s, elapsed: 2693.67 s INFO: scanned index "ix_aggregated_tracks_hourly_full_action_null" to remove 323778164 row versions DETAIL: CPU: user: 281.89 s, system: 36.32 s, elapsed: 2915.94 s INFO: scanned index "ix_aggregated_tracks_hourly_full_nunq" to remove 323778164 row versions DETAIL: CPU: user: 313.35 s, system: 79.22 s, elapsed: 6400.87 s INFO: "aggregated_tracks_hourly_full": removed 323778164 row versions in 7070739 pages DETAIL: CPU: user: 583.48 s, system: 69.77 s, elapsed: 8048.00 s INFO: index "aggregated_tracks_hourly_full_pkey_null" now contains 720807609 row versions in 10529903 pages DETAIL: 43184641 index row versions were removed. 5288916 index pages have been deleted, 4696227 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.03 s. INFO: index "ix_aggregated_tracks_hourly_full_action_null" now contains 720807609 row versions in 7635161 pages DETAIL: 202678522 index row versions were removed. 4432789 index pages have been deleted, 3727966 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.01 s. INFO: index "ix_aggregated_tracks_hourly_full_nunq" now contains 720807609 row versions in 15526885 pages DETAIL: 202678522 index row versions were removed. 9052488 index pages have been deleted, 7390279 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s. INFO: "aggregated_tracks_hourly_full": found 41131260 removable, 209520861 nonremovable row versions in 7549244 out of 22391603 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 245834316 There were 260553451 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 1329.64 s, system: 244.22 s, elapsed: 22222.14 s. Vanilla 9.5 (ie: what's in production right now, should be similar to master): INFO: vacuuming "public.aggregated_tracks_hourly_full" INFO: scanned index "aggregated_tracks_hourly_full_pkey_null" to remove 178956729 row versions DETAIL: CPU 65.51s/253.67u sec elapsed 3490.13 sec. INFO: scanned index "ix_aggregated_tracks_hourly_full_action_null" to remove 178956729 row versions DETAIL: CPU 63.26s/238.08u sec elapsed 3483.32 sec. INFO: scanned index "ix_aggregated_tracks_hourly_full_nunq" to remove 178956729 row versions DETAIL: CPU 340.15s/445.52u sec elapsed 15898.48 sec. INFO: "aggregated_tracks_hourly_full": removed 178956729 row versions in 3121122 pages DETAIL: CPU 168.24s/159.20u sec elapsed 5678.51 sec. INFO: scanned index "aggregated_tracks_hourly_full_pkey_null" to remove 134424729 row versions DETAIL: CPU 50.66s/265.19u sec elapsed 3977.15 sec. INFO: scanned index "ix_aggregated_tracks_hourly_full_action_null" to remove 134424729 row versions DETAIL: CPU 99.68s/326.44u sec elapsed 6580.22 sec. INFO: scanned index "ix_aggregated_tracks_hourly_full_nunq" to remove 134424729 row versions DETAIL: CPU 146.96s/358.86u sec elapsed 10464.69 sec. INFO: "aggregated_tracks_hourly_full": removed 134424729 row versions in 2072649 pages DETAIL: CPU 109.07s/37.12u sec elapsed 3601.39 sec. INFO: index "aggregated_tracks_hourly_full_pkey_null" now contains 870543969 row versions in 10529903 pages DETAIL: 134424771 index row versions were removed. 4358027 index pages have been deleted, 3662385 are currently reusable. CPU 0.02s/0.00u sec elapsed 2.42 sec. INFO: index "ix_aggregated_tracks_hourly_full_action_null" now contains 870543969 row versions in 7635161 pages DETAIL: 134424771 index row versions were removed. 3908583 index pages have been deleted, 3445049 are currently reusable. CPU 0.02s/0.00u sec elapsed 0.08 sec. INFO: index "ix_aggregated_tracks_hourly_full_nunq" now contains 870543969 row versions in 15526885 pages DETAIL: 218955943 index row versions were removed. 7710441 index pages have been deleted, 5928522 are currently reusable. CPU 0.02s/0.01u sec elapsed 0.19 sec. INFO: "aggregated_tracks_hourly_full": found 134159696 removable, 90271560 nonremovable row versions in 6113375 out of 22391603 pages DETAIL: 287 dead row versions cannot be removed yet. There were 126680434 unused item pointers. Skipped 0 pages due to buffer pins. 0 pages are entirely empty. CPU 1191.42s/2223.19u sec elapsed 59885.50 sec.
On Fri, Feb 9, 2018 at 1:05 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > Turns out that it was a tad oversized. 300k tuples seems enough. > > Attached is a new patch version that: > > - Uses an unlogged table to make the large mwm test faster > - Uses a wait_barrier helper that waits for concurrent transactions > to finish before vacuuming tables, to make sure deleted tuples > actually are vacuumable > - Tweaks the size of the large mwm test to be as small as possible > - Optimizes the delete to avoid expensive operations yet attain > the same end result Attached rebased versions of the patches (they weren't applying to current master)
Attachment
Hello everyone, I would like to let you know that unfortunately these patches don't apply anymore. Also personally I'm a bit confused bythe last message that has 0001- and 0003- patches attached but not the 0002- one.
I didn't receive your comment, I just saw it. Nevertheless, I rebased the patches a while ago just because I noticed theydidn't apply anymore in cputube, and they still seem to apply. Patch number 2 was committed a long while ago, that's why it's missing. It was a simple patch, it landed rewritten as commit7e26e02eec90370dd222f35f00042f8188488ac4
On Tue, Apr 3, 2018 at 11:06 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > I didn't receive your comment, I just saw it. Nevertheless, I rebased the patches a while ago just because I noticed theydidn't apply anymore in cputube, and they still seem to apply. Sorry, that is false. They appear green in cputube, so I was confident they did apply, but I just double-checked on a recent pull and they don't. I'll rebase them shortly.
On Tue, Apr 3, 2018 at 11:09 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Tue, Apr 3, 2018 at 11:06 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> I didn't receive your comment, I just saw it. Nevertheless, I rebased the patches a while ago just because I noticed theydidn't apply anymore in cputube, and they still seem to apply. > > Sorry, that is false. > > They appear green in cputube, so I was confident they did apply, but I > just double-checked on a recent pull and they don't. I'll rebase them > shortly. Ok, rebased patches attached
Attachment
On 03/04/18 17:20, Claudio Freire wrote: > Ok, rebased patches attached Thanks! I took a look at this. First, now that the data structure is more complicated, I think it's time to abstract it, and move it out of vacuumlazy.c. The Tid Map needs to support the following operations: * Add TIDs, in order (in 1st phase of vacuum) * Random lookup, by TID (when scanning indexes) * Iterate through all TIDs, in order (2nd pass over heap) Let's add a new source file to hold the code for the tid map data structure, with functions corresponding those operations. I took a stab at doing that, and I think it makes vacuumlazy.c nicer. Secondly, I'm not a big fan of the chosen data structure. I think the only reason that the segmented "multi-array" was chosen is that each "segment" works is similar to the simple array that we used to have. After putting it behind the abstraction, it seems rather ad hoc. There are many standard textbook data structures that we could use instead, and would be easier to understand and reason about, I think. So, I came up with the attached patch. I used a B-tree as the data structure. Not sure it's the best one, I'm all ears for suggestions and bikeshedding on alternatives, but I'm pretty happy with that. I would expect it to be pretty close to the simple array with binary search in performance characteristics. It'd be pretty straightforward to optimize further, and e.g. use a bitmap of OffsetNumbers or some other more dense data structure in the B-tree leaf pages, but I resisted doing that as part of this patch. I haven't done any performance testing of this (and not much testing in general), but at least the abstraction seems better this way. Performance testing would be good, too. In particular, I'd like to know how this might affect the performance of lazy_tid_reaped(). That's a hot spot when vacuuming indexes, so we don't want to add any cycles there. Was there any ready-made test kits on that in this thread? I didn't see any at a quick glance, but it's a long thread.. - Heikki
Attachment
On Thu, Apr 5, 2018 at 5:02 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 03/04/18 17:20, Claudio Freire wrote: >> >> Ok, rebased patches attached > > > Thanks! I took a look at this. > > First, now that the data structure is more complicated, I think it's time to > abstract it, and move it out of vacuumlazy.c. The Tid Map needs to support > the following operations: > > * Add TIDs, in order (in 1st phase of vacuum) > * Random lookup, by TID (when scanning indexes) > * Iterate through all TIDs, in order (2nd pass over heap) > > Let's add a new source file to hold the code for the tid map data structure, > with functions corresponding those operations. > > I took a stab at doing that, and I think it makes vacuumlazy.c nicer. About the refactoring to split this into their own set of files and abstract away the underlying structure, I can totally get behind that. The iteration interface, however, seems quite specific for the use case of vacuumlazy, so it's not really a good abstraction. It also copies stuff a lot, so it's quite heavyweight. I'd suggest trying to go for a lighter weight interface with less overhead that is more general at the same time. If it was C++, I'd say build an iterator class. C would do it probably with macros, so you can have a macro to get to the current element, another to advance to the next element, and another to check whether you've reached the end. I can do that if we agree on the points below: > Secondly, I'm not a big fan of the chosen data structure. I think the only > reason that the segmented "multi-array" was chosen is that each "segment" > works is similar to the simple array that we used to have. After putting it > behind the abstraction, it seems rather ad hoc. There are many standard > textbook data structures that we could use instead, and would be easier to > understand and reason about, I think. > > So, I came up with the attached patch. I used a B-tree as the data > structure. Not sure it's the best one, I'm all ears for suggestions and > bikeshedding on alternatives, but I'm pretty happy with that. I would expect > it to be pretty close to the simple array with binary search in performance > characteristics. It'd be pretty straightforward to optimize further, and > e.g. use a bitmap of OffsetNumbers or some other more dense data structure > in the B-tree leaf pages, but I resisted doing that as part of this patch. About the B-tree, however, I don't think a B-tree is a good idea. Trees' main benefit is that they can be inserted to efficiently. When all your data is loaded sequentially, in-order, in-memory and immutable; the tree is pointless, more costly to build, and harder to maintain - in terms of code complexity. In this use case, the only benefit of B-trees would be that they're optimized for disk access. If we planned to store this on-disk, perhaps I'd grant you that. But we don't plan to do that, and it's not even clear doing it would be efficient enough for the intended use. On the other side, using B-trees incurs memory overhead due to the need for internal nodes, can fragment memory because internal nodes aren't the same size as leaf nodes, is easier to get wrong and introduce bugs... I don't see a gain. If you propose its use, at least benchmark it to show some gain. So I don't think B-tree is a good idea, the sorted array already is good enough, and if not, it's at least close to the earlier implementation and less likely to introduce bugs. Furthermore, among the 200-ish messages this thread has accumulated, better ideas have been proposed, better because they do use less memory and are faster (like using bitmaps when possible), but if we can't push a simple refactoring first, there's no chance a bigger rewrite will fare better. Remember, in this use case, using less memory far outweights any other consideration. Less memory directly translates to less iterations over the indexes, because more can be crammed into m_w_m, which is a huge time saving. Far more than any micro-optimization. About 2 years ago, I chose to try to push this simple algorithm first, then try to improve on it with better data structures. Nobody complained at the time (I think, IIRC), and I don't think it fair to go and revisit that now. It just delays getting a solution for this issue for the persuit of "the perfect implementaiton" that might never arrive. Or even if it doesn, there's nothing stopping us from pushing another patch in the future with that better implementation if we wish. Lets get something simple and proven first. > I haven't done any performance testing of this (and not much testing in > general), but at least the abstraction seems better this way. Performance > testing would be good, too. In particular, I'd like to know how this might > affect the performance of lazy_tid_reaped(). That's a hot spot when > vacuuming indexes, so we don't want to add any cycles there. Was there any > ready-made test kits on that in this thread? I didn't see any at a quick > glance, but it's a long thread.. If you dig old messages in the thread, I had attached the scripts I used for benchmarking this. I'm attaching again one version of them (I've been modifying it to suit my purposes at each review round), you'll probably want to tweak it to build test cases good for your purpose here.
Attachment
On 06/04/18 01:59, Claudio Freire wrote: > The iteration interface, however, seems quite specific for the use > case of vacuumlazy, so it's not really a good abstraction. Can you elaborate? It does return the items one block at a time. Is that what you mean by being specific for vacuumlazy? I guess that's a bit special, but if you imagine some other users for this abstraction, it's probably not that unusual. For example, if we started using it in bitmap heap scans, a bitmap heap scan would also want to get the TIDs one block number at a time. > It also copies stuff a lot, so it's quite heavyweight. I'd suggest > trying to go for a lighter weight interface with less overhead that > is more general at the same time. Note that there was similar copying, to construct an array of OffsetNumbers, happening in lazy_vacuum_page() before this patch. So the net amount of copying is the same. I'm envisioning that this data structure will sooner or later be optimized further, so that when you have a lot of TIDs pointing to the same block, we would pack them more tightly, storing the block number just once, with an array of offset numbers. This interface that returns an array of offset numbers matches that future well, as the iterator could just return a pointer to the array of offset numbers, with no copying. (If we end up doing something even more dense, like a bitmap, then it doesn't help, but that's ok too.) > About the B-tree, however, I don't think a B-tree is a good idea. > > Trees' main benefit is that they can be inserted to efficiently. When > all your data is loaded sequentially, in-order, in-memory and > immutable; the tree is pointless, more costly to build, and harder to > maintain - in terms of code complexity. > > In this use case, the only benefit of B-trees would be that they're > optimized for disk access. Those are not the reasons for which I'd prefer a B-tree. A B-tree has good cache locality, and when you don't need to worry about random insertions, page splits, deletions etc., it's also very simple to implement. This patch is not much longer than the segmented multi-array. > On the other side, using B-trees incurs memory overhead due to the > need for internal nodes, can fragment memory because internal nodes > aren't the same size as leaf nodes, is easier to get wrong and > introduce bugs... I don't see a gain. The memory overhead incurred by the internal nodes is quite minimal, and can be adjusted by changing the node sizes. After some experimentation, I settled on 2048 items per leaf node, and 64 items per internal node. With those values, the overhead caused by the internal nodes is minimal, below 0.5%. That seems fine, but we could increase the node sizes to bring it further down, if we'd prefer that tradeoff. I don't understand what memory fragmentation problems you're worried about. The tree grows one node at a time, as new TIDs are added, until it's all released at the end. I don't see how the size of internal vs leaf nodes matters. > If you propose its use, at least benchmark it to show some gain. Sure. I used the attached script to test this. It's inspired by the test script you posted. It creates a pgbench database with scale factor 100, deletes 80% of the rows, and runs vacuum. To stress lazy_tid_reaped() more heavily, the test script creates a number of extra indexes. Half of them are on the primary key, just to get more repetitions without having to re-initialize in between, and the rest are like this: create index random_1 on pgbench_accounts((hashint4(aid))) to stress lazy_vacuum_tid_reaped() with a random access pattern, rather than the sequential one that you get with the primary key index. I ran the test script on my laptop, with unpatched master, with your latest multi-array patch, and with the attached version of the b-tree patch. The results are quite noisy, unfortunately, so I wouldn't draw very strong conclusions from it, but it seems that the performance of all three versions is roughly the same. I looked in particular at the CPU time spent in the index vacuums, as reported by VACUUM VERBOSE. > Furthermore, among the 200-ish messages this thread has accumulated, > better ideas have been proposed, better because they do use less > memory and are faster (like using bitmaps when possible), but if we > can't push a simple refactoring first, there's no chance a bigger > rewrite will fare better. Remember, in this use case, using less > memory far outweights any other consideration. Less memory directly > translates to less iterations over the indexes, because more can be > crammed into m_w_m, which is a huge time saving. Far more than any > micro-optimization. > > About 2 years ago, I chose to try to push this simple algorithm first, > then try to improve on it with better data structures. Nobody > complained at the time (I think, IIRC), and I don't think it fair to > go and revisit that now. It just delays getting a solution for this > issue for the persuit of "the perfect implementaiton" that might never > arrive. Or even if it doesn, there's nothing stopping us from pushing > another patch in the future with that better implementation if we > wish. Lets get something simple and proven first. True all that. My point is that the multi-segmented array isn't all that simple and proven, compared to an also straightforward B-tree. It's pretty similar to a B-tree, actually, except that it has exactly two levels, and the node (= segment) sizes grow exponentially. I'd rather go with a true B-tree, than something homegrown that resembles a B-tree, but not quite. > I'm attaching again one version of them (I've been modifying it to > suit my purposes at each review round), you'll probably want to tweak > it to build test cases good for your purpose here. Thanks! Attached is a new version of my b-tree version. Compared to yesterday's version, I fixed a bunch of bugs that turned up in testing. Looking at the changes to the regression test in this, I don't quite understand what it's all about. What are the "wait_barriers" for? If I understand correctly, they're added so that the VACUUMs can remove the tuples that are deleted in the test. But why are they needed now? Was that an orthogonal change we should've done anyway? Rather than add those wait_barriers, should we stop running the 'vacuum' test in parallel with the other tests? Or maybe it's a good thing to run it in parallel, to test some other things? What are the new tests supposed to cover? The test comment says "large mwm vacuum runs", and it sets maintenance_work_mem to 1 MB, which isn't very large - Heikki
Attachment
On 06/04/18 16:39, Heikki Linnakangas wrote: > Sure. I used the attached script to test this. Sorry, I attached the wrong script. Here is the correct one that I used. Here are also the results I got from running it - Heikki
Attachment
Heikki Linnakangas wrote: > On 06/04/18 01:59, Claudio Freire wrote: > > The iteration interface, however, seems quite specific for the use > > case of vacuumlazy, so it's not really a good abstraction. > > Can you elaborate? It does return the items one block at a time. Is that > what you mean by being specific for vacuumlazy? I guess that's a bit > special, but if you imagine some other users for this abstraction, it's > probably not that unusual. For example, if we started using it in bitmap > heap scans, a bitmap heap scan would also want to get the TIDs one block > number at a time. FWIW I liked the idea of having this abstraction possibly do other things -- for instance to vacuum brin indexes you'd like to mark index tuples as "containing tuples that were removed" and eventually re-summarize the range. With the current interface we cannot do that, because vacuum expects brin vacuuming to ask for each heap tuple "is this tid dead?" and of course we don't have a list of tids to ask for. So if we can ask instead "how many dead tuples does this block contain?" brin vacuuming will be much happier. > Looking at the changes to the regression test in this, I don't quite > understand what it's all about. What are the "wait_barriers" for? If I > understand correctly, they're added so that the VACUUMs can remove the > tuples that are deleted in the test. But why are they needed now? Was that > an orthogonal change we should've done anyway? > > Rather than add those wait_barriers, should we stop running the 'vacuum' > test in parallel with the other tests? Or maybe it's a good thing to run it > in parallel, to test some other things? 20180207235226.zygu4r3yv3yfcnmc@alvherre.pgsql -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 06/04/18 01:59, Claudio Freire wrote: >> >> The iteration interface, however, seems quite specific for the use >> case of vacuumlazy, so it's not really a good abstraction. > > > Can you elaborate? It does return the items one block at a time. Is that > what you mean by being specific for vacuumlazy? I guess that's a bit > special, but if you imagine some other users for this abstraction, it's > probably not that unusual. For example, if we started using it in bitmap > heap scans, a bitmap heap scan would also want to get the TIDs one block > number at a time. But you're also tying the caller to the format of the buffer holding those TIDs, for instance. Why would you, when you can have an interface that just iterates TIDs and let the caller store them if/however they want? I do believe a pure iterator interface is a better interface. >> It also copies stuff a lot, so it's quite heavyweight. I'd suggest >> trying to go for a lighter weight interface with less overhead that >> is more general at the same time. > > > Note that there was similar copying, to construct an array of OffsetNumbers, > happening in lazy_vacuum_page() before this patch. So the net amount of > copying is the same. > > I'm envisioning that this data structure will sooner or later be optimized > further, so that when you have a lot of TIDs pointing to the same block, we > would pack them more tightly, storing the block number just once, with an > array of offset numbers. This interface that returns an array of offset > numbers matches that future well, as the iterator could just return a > pointer to the array of offset numbers, with no copying. (If we end up doing > something even more dense, like a bitmap, then it doesn't help, but that's > ok too.) But that's the thing. It's a specialized interface for a future we're not certain. It's premature. A generic interface does not preclude the possibility of implementing those in the future, it allows you *not* to if there's no gain. Doing it now, it forces you to. >> About the B-tree, however, I don't think a B-tree is a good idea. >> >> Trees' main benefit is that they can be inserted to efficiently. When >> all your data is loaded sequentially, in-order, in-memory and >> immutable; the tree is pointless, more costly to build, and harder to >> maintain - in terms of code complexity. >> >> In this use case, the only benefit of B-trees would be that they're >> optimized for disk access. > > > Those are not the reasons for which I'd prefer a B-tree. A B-tree has good > cache locality, and when you don't need to worry about random insertions, > page splits, deletions etc., it's also very simple to implement. This patch > is not much longer than the segmented multi-array. But it *is* complex and less tested. Testing it and making it mature will take time. Why do that if doing bitmaps is a better path? >> On the other side, using B-trees incurs memory overhead due to the >> need for internal nodes, can fragment memory because internal nodes >> aren't the same size as leaf nodes, is easier to get wrong and >> introduce bugs... I don't see a gain. > > > The memory overhead incurred by the internal nodes is quite minimal, and can > be adjusted by changing the node sizes. After some experimentation, I > settled on 2048 items per leaf node, and 64 items per internal node. With > those values, the overhead caused by the internal nodes is minimal, below > 0.5%. That seems fine, but we could increase the node sizes to bring it > further down, if we'd prefer that tradeoff. > > I don't understand what memory fragmentation problems you're worried about. > The tree grows one node at a time, as new TIDs are added, until it's all > released at the end. I don't see how the size of internal vs leaf nodes > matters. Large vacuums do several passes, so they'll create one tid map for each pass. Each pass will allocate m_w_m-worth of pages, and then deallocate them. B-tree page allocations are smaller than malloc's mmap threshold, so freeing them won't return memory to the operating system. Furthermore, if other allocations get interleaved, objects could be left lying at random points in the heap, preventing efficient reuse of the heap for the next round. In essence, internal fragmentation. This may be exacerbated by AllocSet's own fragmentation and double-accounting. From what I can tell, inner nodes will use pools, and leaf nodes will be allocated as dedicated chunks (mallocd). Segments in my implementation are big, exactly because of that. The aim is to have large buffers that malloc will mmap, so they get returned to the os (unmapped) when freed quickly, and with little overhead. This fragmentation may cause actual pain in autovacuum, since autovacuum workers are relatively long-lived. >> If you propose its use, at least benchmark it to show some gain. > > > Sure. I used the attached script to test this. It's inspired by the test > script you posted. It creates a pgbench database with scale factor 100, > deletes 80% of the rows, and runs vacuum. To stress lazy_tid_reaped() more > heavily, the test script creates a number of extra indexes. Half of them are > on the primary key, just to get more repetitions without having to > re-initialize in between, and the rest are like this: > > create index random_1 on pgbench_accounts((hashint4(aid))) > > to stress lazy_vacuum_tid_reaped() with a random access pattern, rather than > the sequential one that you get with the primary key index. > > I ran the test script on my laptop, with unpatched master, with your latest > multi-array patch, and with the attached version of the b-tree patch. The > results are quite noisy, unfortunately, so I wouldn't draw very strong > conclusions from it, but it seems that the performance of all three versions > is roughly the same. I looked in particular at the CPU time spent in the > index vacuums, as reported by VACUUM VERBOSE. Scale factor 100 is hardly enough to stress large m_w_m vacuum. I found scales of 1k-4k (or bigger) are best, with large m_w_m settings (4G for example) to get to really see how the data structure performs. >> Furthermore, among the 200-ish messages this thread has accumulated, >> better ideas have been proposed, better because they do use less >> memory and are faster (like using bitmaps when possible), but if we >> can't push a simple refactoring first, there's no chance a bigger >> rewrite will fare better. Remember, in this use case, using less >> memory far outweights any other consideration. Less memory directly >> translates to less iterations over the indexes, because more can be >> crammed into m_w_m, which is a huge time saving. Far more than any >> micro-optimization. >> >> About 2 years ago, I chose to try to push this simple algorithm first, >> then try to improve on it with better data structures. Nobody >> complained at the time (I think, IIRC), and I don't think it fair to >> go and revisit that now. It just delays getting a solution for this >> issue for the persuit of "the perfect implementaiton" that might never >> arrive. Or even if it doesn, there's nothing stopping us from pushing >> another patch in the future with that better implementation if we >> wish. Lets get something simple and proven first. > > > True all that. My point is that the multi-segmented array isn't all that > simple and proven, compared to an also straightforward B-tree. It's pretty > similar to a B-tree, actually, except that it has exactly two levels, and > the node (= segment) sizes grow exponentially. I'd rather go with a true > B-tree, than something homegrown that resembles a B-tree, but not quite. I disagree. Being similar to what vacuum is already doing, we can be confident the approach is sound, at least as sound as current vacuum. It shares a lot with the current implementation, which is known to be good. The multi-segmented array itself has received a lot of testing during the ~1.5y it has spent in the making, as well. I've been running extensive benchmarking and tests each time I changed something, and I've even ran deep tests on an actual production snapshot. A complete db-wide vacuum of a heavily bloated ~12TB production database, showing significant speedups not only because it does fewer index scans, but also because it uses less CPU time to do so, with consistency checking after the fact, to check for bugs. Of course I can't guarantee it's bug-free, but it *is* decently tested. I'm pretty sure the B-tree implementation hasn't reached that level of testing yet. It might in the future, but it won't happen overnight. Your B-tree patch is also homegrown. You're not reusing well tested btree code, you're coding a B-tree from scratch, so it's as suspect as any new code. I agree the multi-segment algorithm is quite similar to a shallow b-tree, but I'm not convinced a b-tree is what we must aspire to have. In fact, if you used large pages for the B-tree, you don't need more than 2 levels (there's a 12GB limit ATM on the size of the tid map), so the multi-segment approach and the b-tree approach are essentially the same. Except the multi-segment code got more testing. In short, there are other more enticing alternatives to try out first. I'm not enthused by the idea of having to bench and test yet another sorted set implementation before moving forward. On Fri, Apr 6, 2018 at 11:00 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Heikki Linnakangas wrote: >> On 06/04/18 01:59, Claudio Freire wrote: >> > The iteration interface, however, seems quite specific for the use >> > case of vacuumlazy, so it's not really a good abstraction. >> >> Can you elaborate? It does return the items one block at a time. Is that >> what you mean by being specific for vacuumlazy? I guess that's a bit >> special, but if you imagine some other users for this abstraction, it's >> probably not that unusual. For example, if we started using it in bitmap >> heap scans, a bitmap heap scan would also want to get the TIDs one block >> number at a time. > > FWIW I liked the idea of having this abstraction possibly do other > things -- for instance to vacuum brin indexes you'd like to mark index > tuples as "containing tuples that were removed" and eventually > re-summarize the range. With the current interface we cannot do that, > because vacuum expects brin vacuuming to ask for each heap tuple "is > this tid dead?" and of course we don't have a list of tids to ask for. > So if we can ask instead "how many dead tuples does this block contain?" > brin vacuuming will be much happier. I don't think either patch gives you that. The bulkdelete interface is part of the indexam and unlikely to change in this patch.
Claudio Freire wrote: > On Fri, Apr 6, 2018 at 11:00 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > FWIW I liked the idea of having this abstraction possibly do other > > things -- for instance to vacuum brin indexes you'd like to mark index > > tuples as "containing tuples that were removed" and eventually > > re-summarize the range. With the current interface we cannot do that, > > because vacuum expects brin vacuuming to ask for each heap tuple "is > > this tid dead?" and of course we don't have a list of tids to ask for. > > So if we can ask instead "how many dead tuples does this block contain?" > > brin vacuuming will be much happier. > > I don't think either patch gives you that. > > The bulkdelete interface is part of the indexam and unlikely to change > in this patch. I'm sure you're correct. I was just saying that with the abstract interface it is easier to implement what I suggest as a follow-on patch. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> On 06/04/18 01:59, Claudio Freire wrote: >>> >>> The iteration interface, however, seems quite specific for the use >>> case of vacuumlazy, so it's not really a good abstraction. >> >> >> Can you elaborate? It does return the items one block at a time. Is that >> what you mean by being specific for vacuumlazy? I guess that's a bit >> special, but if you imagine some other users for this abstraction, it's >> probably not that unusual. For example, if we started using it in bitmap >> heap scans, a bitmap heap scan would also want to get the TIDs one block >> number at a time. > > But you're also tying the caller to the format of the buffer holding > those TIDs, for instance. Why would you, when you can have an > interface that just iterates TIDs and let the caller store them > if/however they want? > > I do believe a pure iterator interface is a better interface. Between the b-tree or not discussion and the refactoring to separate the code, I don't think we'll get this in the next 24hs. So I guess we'll have ample time to poner on both issues during the next commit fest.
On 04/06/2018 08:00 PM, Claudio Freire wrote: > On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> On 06/04/18 01:59, Claudio Freire wrote: >>>> The iteration interface, however, seems quite specific for the use >>>> case of vacuumlazy, so it's not really a good abstraction. >>> >>> Can you elaborate? It does return the items one block at a time. Is that >>> what you mean by being specific for vacuumlazy? I guess that's a bit >>> special, but if you imagine some other users for this abstraction, it's >>> probably not that unusual. For example, if we started using it in bitmap >>> heap scans, a bitmap heap scan would also want to get the TIDs one block >>> number at a time. >> But you're also tying the caller to the format of the buffer holding >> those TIDs, for instance. Why would you, when you can have an >> interface that just iterates TIDs and let the caller store them >> if/however they want? >> >> I do believe a pure iterator interface is a better interface. > Between the b-tree or not discussion and the refactoring to separate > the code, I don't think we'll get this in the next 24hs. > > So I guess we'll have ample time to poner on both issues during the > next commit fest. > There doesn't seem to have been much pondering done since then, at least publicly. Can we make some progress on this? It's been around for a long time now. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jul 12, 2018 at 10:44 AM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > > > > On 04/06/2018 08:00 PM, Claudio Freire wrote: > > On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > >> On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >>> On 06/04/18 01:59, Claudio Freire wrote: > >>>> The iteration interface, however, seems quite specific for the use > >>>> case of vacuumlazy, so it's not really a good abstraction. > >>> > >>> Can you elaborate? It does return the items one block at a time. Is that > >>> what you mean by being specific for vacuumlazy? I guess that's a bit > >>> special, but if you imagine some other users for this abstraction, it's > >>> probably not that unusual. For example, if we started using it in bitmap > >>> heap scans, a bitmap heap scan would also want to get the TIDs one block > >>> number at a time. > >> But you're also tying the caller to the format of the buffer holding > >> those TIDs, for instance. Why would you, when you can have an > >> interface that just iterates TIDs and let the caller store them > >> if/however they want? > >> > >> I do believe a pure iterator interface is a better interface. > > Between the b-tree or not discussion and the refactoring to separate > > the code, I don't think we'll get this in the next 24hs. > > > > So I guess we'll have ample time to poner on both issues during the > > next commit fest. > > > > > > There doesn't seem to have been much pondering done since then, at least > publicly. Can we make some progress on this? It's been around for a long > time now. Yeah, life has kept me busy and I haven't had much time to make progress here, but I was planning on doing the refactoring as we were discussing soon. Can't give a time frame for that, but "soonish".
On 07/12/2018 12:38 PM, Claudio Freire wrote: > On Thu, Jul 12, 2018 at 10:44 AM Andrew Dunstan > <andrew.dunstan@2ndquadrant.com> wrote: >> >> >> On 04/06/2018 08:00 PM, Claudio Freire wrote: >>> On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>>> On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>>>> On 06/04/18 01:59, Claudio Freire wrote: >>>>>> The iteration interface, however, seems quite specific for the use >>>>>> case of vacuumlazy, so it's not really a good abstraction. >>>>> Can you elaborate? It does return the items one block at a time. Is that >>>>> what you mean by being specific for vacuumlazy? I guess that's a bit >>>>> special, but if you imagine some other users for this abstraction, it's >>>>> probably not that unusual. For example, if we started using it in bitmap >>>>> heap scans, a bitmap heap scan would also want to get the TIDs one block >>>>> number at a time. >>>> But you're also tying the caller to the format of the buffer holding >>>> those TIDs, for instance. Why would you, when you can have an >>>> interface that just iterates TIDs and let the caller store them >>>> if/however they want? >>>> >>>> I do believe a pure iterator interface is a better interface. >>> Between the b-tree or not discussion and the refactoring to separate >>> the code, I don't think we'll get this in the next 24hs. >>> >>> So I guess we'll have ample time to poner on both issues during the >>> next commit fest. >>> >> >> >> There doesn't seem to have been much pondering done since then, at least >> publicly. Can we make some progress on this? It's been around for a long >> time now. > Yeah, life has kept me busy and I haven't had much time to make > progress here, but I was planning on doing the refactoring as we were > discussing soon. Can't give a time frame for that, but "soonish". I fully understand. I think this needs to go back to "Waiting on Author". cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-Jul-12, Andrew Dunstan wrote: > I fully understand. I think this needs to go back to "Waiting on Author". Why? Heikki's patch applies fine and passes the regression tests. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 07/12/2018 06:34 PM, Alvaro Herrera wrote: > On 2018-Jul-12, Andrew Dunstan wrote: > >> I fully understand. I think this needs to go back to "Waiting on Author". > Why? Heikki's patch applies fine and passes the regression tests. > Well, I understood Claudio was going to do some more work (see upthread). If we're going to go with Heikki's patch then do we need to change the author, or add him as an author? cheers andew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 13/07/18 01:39, Andrew Dunstan wrote: > On 07/12/2018 06:34 PM, Alvaro Herrera wrote: >> On 2018-Jul-12, Andrew Dunstan wrote: >> >>> I fully understand. I think this needs to go back to "Waiting on Author". >> Why? Heikki's patch applies fine and passes the regression tests. > > Well, I understood Claudio was going to do some more work (see > upthread). Claudio raised a good point, that doing small pallocs leads to fragmentation, and in particular, it might mean that we can't give back the memory to the OS. The default glibc malloc() implementation has a threshold of 4 or 32 MB or something like that - allocations larger than the threshold are mmap()'d, and can always be returned to the OS. I think a simple solution to that is to allocate larger chunks, something like 32-64 MB at a time, and carve out the allocations for the nodes from those chunks. That's pretty straightforward, because we don't need to worry about freeing the nodes in retail. Keep track of the current half-filled chunk, and allocate a new one when it fills up. He also wanted to refactor the iterator API, to return one ItemPointer at a time. I don't think that's necessary, the current iterator API is more convenient for the callers, but I don't feel strongly about that. Anything else? > If we're going to go with Heikki's patch then do we need to > change the author, or add him as an author? Let's list both of us. At least in the commit message, doesn't matter much what the commitfest app says. - Heikki
On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: > On 13/07/18 01:39, Andrew Dunstan wrote: >> On 07/12/2018 06:34 PM, Alvaro Herrera wrote: >>> On 2018-Jul-12, Andrew Dunstan wrote: >>> >>>> I fully understand. I think this needs to go back to "Waiting on >>>> Author". >>> Why? Heikki's patch applies fine and passes the regression tests. >> >> Well, I understood Claudio was going to do some more work (see >> upthread). > > Claudio raised a good point, that doing small pallocs leads to > fragmentation, and in particular, it might mean that we can't give > back the memory to the OS. The default glibc malloc() implementation > has a threshold of 4 or 32 MB or something like that - allocations > larger than the threshold are mmap()'d, and can always be returned to > the OS. I think a simple solution to that is to allocate larger > chunks, something like 32-64 MB at a time, and carve out the > allocations for the nodes from those chunks. That's pretty > straightforward, because we don't need to worry about freeing the > nodes in retail. Keep track of the current half-filled chunk, and > allocate a new one when it fills up. Google seems to suggest the default threshold is much lower, like 128K. Still, making larger allocations seems sensible. Are you going to work on that? > > He also wanted to refactor the iterator API, to return one ItemPointer > at a time. I don't think that's necessary, the current iterator API is > more convenient for the callers, but I don't feel strongly about that. > > Anything else? > >> If we're going to go with Heikki's patch then do we need to >> change the author, or add him as an author? > > Let's list both of us. At least in the commit message, doesn't matter > much what the commitfest app says. > I added you as an author in the CF App cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > > > > On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: > > On 13/07/18 01:39, Andrew Dunstan wrote: > >> On 07/12/2018 06:34 PM, Alvaro Herrera wrote: > >>> On 2018-Jul-12, Andrew Dunstan wrote: > >>> > >>>> I fully understand. I think this needs to go back to "Waiting on > >>>> Author". > >>> Why? Heikki's patch applies fine and passes the regression tests. > >> > >> Well, I understood Claudio was going to do some more work (see > >> upthread). > > > > Claudio raised a good point, that doing small pallocs leads to > > fragmentation, and in particular, it might mean that we can't give > > back the memory to the OS. The default glibc malloc() implementation > > has a threshold of 4 or 32 MB or something like that - allocations > > larger than the threshold are mmap()'d, and can always be returned to > > the OS. I think a simple solution to that is to allocate larger > > chunks, something like 32-64 MB at a time, and carve out the > > allocations for the nodes from those chunks. That's pretty > > straightforward, because we don't need to worry about freeing the > > nodes in retail. Keep track of the current half-filled chunk, and > > allocate a new one when it fills up. > > > Google seems to suggest the default threshold is much lower, like 128K. > Still, making larger allocations seems sensible. Are you going to work > on that? Below a few MB the threshold is dynamic, and if a block bigger than 128K but smaller than the higher threshold (32-64MB IIRC) is freed, the dynamic threshold is set to the size of the freed block. See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] So I'd suggest allocating blocks bigger than M_MMAP_MAX. [1] http://man7.org/linux/man-pages/man3/mallopt.3.html
On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire <klaussfreire@gmail.com> wrote: > > On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan > <andrew.dunstan@2ndquadrant.com> wrote: > > > > > > > > On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: > > > On 13/07/18 01:39, Andrew Dunstan wrote: > > >> On 07/12/2018 06:34 PM, Alvaro Herrera wrote: > > >>> On 2018-Jul-12, Andrew Dunstan wrote: > > >>> > > >>>> I fully understand. I think this needs to go back to "Waiting on > > >>>> Author". > > >>> Why? Heikki's patch applies fine and passes the regression tests. > > >> > > >> Well, I understood Claudio was going to do some more work (see > > >> upthread). > > > > > > Claudio raised a good point, that doing small pallocs leads to > > > fragmentation, and in particular, it might mean that we can't give > > > back the memory to the OS. The default glibc malloc() implementation > > > has a threshold of 4 or 32 MB or something like that - allocations > > > larger than the threshold are mmap()'d, and can always be returned to > > > the OS. I think a simple solution to that is to allocate larger > > > chunks, something like 32-64 MB at a time, and carve out the > > > allocations for the nodes from those chunks. That's pretty > > > straightforward, because we don't need to worry about freeing the > > > nodes in retail. Keep track of the current half-filled chunk, and > > > allocate a new one when it fills up. > > > > > > Google seems to suggest the default threshold is much lower, like 128K. > > Still, making larger allocations seems sensible. Are you going to work > > on that? > > Below a few MB the threshold is dynamic, and if a block bigger than > 128K but smaller than the higher threshold (32-64MB IIRC) is freed, > the dynamic threshold is set to the size of the freed block. > > See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] > > So I'd suggest allocating blocks bigger than M_MMAP_MAX. > > [1] http://man7.org/linux/man-pages/man3/mallopt.3.html Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the former is something else.
On 16/07/18 18:35, Claudio Freire wrote: > On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire <klaussfreire@gmail.com> wrote: >> On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan >> <andrew.dunstan@2ndquadrant.com> wrote: >>> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: >>>> Claudio raised a good point, that doing small pallocs leads to >>>> fragmentation, and in particular, it might mean that we can't give >>>> back the memory to the OS. The default glibc malloc() implementation >>>> has a threshold of 4 or 32 MB or something like that - allocations >>>> larger than the threshold are mmap()'d, and can always be returned to >>>> the OS. I think a simple solution to that is to allocate larger >>>> chunks, something like 32-64 MB at a time, and carve out the >>>> allocations for the nodes from those chunks. That's pretty >>>> straightforward, because we don't need to worry about freeing the >>>> nodes in retail. Keep track of the current half-filled chunk, and >>>> allocate a new one when it fills up. >>> >>> Google seems to suggest the default threshold is much lower, like 128K. >>> Still, making larger allocations seems sensible. Are you going to work >>> on that? >> >> Below a few MB the threshold is dynamic, and if a block bigger than >> 128K but smaller than the higher threshold (32-64MB IIRC) is freed, >> the dynamic threshold is set to the size of the freed block. >> >> See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] >> >> So I'd suggest allocating blocks bigger than M_MMAP_MAX. >> >> [1] http://man7.org/linux/man-pages/man3/mallopt.3.html > > Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the > former is something else. Yeah, we basically want to be well above whatever the threshold is. I don't think we should try to check for any specific constant, just make it large enough. Different libc implementations might have different policies, too. There's little harm in overshooting, and making e.g. 64 MB allocations when 1 MB would've been enough to trigger the mmap() behavior. It's going to be more granular than the current situation, anyway, where we do a single massive allocation. (A code comment to briefly mention the thresholds on common platforms would be good, though). - Heikki
On 07/16/2018 10:34 AM, Claudio Freire wrote: > On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan > <andrew.dunstan@2ndquadrant.com> wrote: >> >> >> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: >>> On 13/07/18 01:39, Andrew Dunstan wrote: >>>> On 07/12/2018 06:34 PM, Alvaro Herrera wrote: >>>>> On 2018-Jul-12, Andrew Dunstan wrote: >>>>> >>>>>> I fully understand. I think this needs to go back to "Waiting on >>>>>> Author". >>>>> Why? Heikki's patch applies fine and passes the regression tests. >>>> Well, I understood Claudio was going to do some more work (see >>>> upthread). >>> Claudio raised a good point, that doing small pallocs leads to >>> fragmentation, and in particular, it might mean that we can't give >>> back the memory to the OS. The default glibc malloc() implementation >>> has a threshold of 4 or 32 MB or something like that - allocations >>> larger than the threshold are mmap()'d, and can always be returned to >>> the OS. I think a simple solution to that is to allocate larger >>> chunks, something like 32-64 MB at a time, and carve out the >>> allocations for the nodes from those chunks. That's pretty >>> straightforward, because we don't need to worry about freeing the >>> nodes in retail. Keep track of the current half-filled chunk, and >>> allocate a new one when it fills up. >> >> Google seems to suggest the default threshold is much lower, like 128K. >> Still, making larger allocations seems sensible. Are you going to work >> on that? > Below a few MB the threshold is dynamic, and if a block bigger than > 128K but smaller than the higher threshold (32-64MB IIRC) is freed, > the dynamic threshold is set to the size of the freed block. > > See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] > > So I'd suggest allocating blocks bigger than M_MMAP_MAX. > > [1] http://man7.org/linux/man-pages/man3/mallopt.3.html That page says: M_MMAP_MAX This parameter specifies the maximum number of allocation requests that may be simultaneously serviced using mmap(2). This parameter exists because some systems have a limited number of internal tables for use by mmap(2), and using more than a few of them may degrade performance. The default value is 65,536, a value which has no special significance and which serves only as a safeguard. Setting this parameter to 0 disables the use of mmap(2) for servicing large allocation requests. I'm confused about the relevance. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 16, 2018 at 3:30 PM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > > > > On 07/16/2018 10:34 AM, Claudio Freire wrote: > > On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan > > <andrew.dunstan@2ndquadrant.com> wrote: > >> > >> > >> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: > >>> On 13/07/18 01:39, Andrew Dunstan wrote: > >>>> On 07/12/2018 06:34 PM, Alvaro Herrera wrote: > >>>>> On 2018-Jul-12, Andrew Dunstan wrote: > >>>>> > >>>>>> I fully understand. I think this needs to go back to "Waiting on > >>>>>> Author". > >>>>> Why? Heikki's patch applies fine and passes the regression tests. > >>>> Well, I understood Claudio was going to do some more work (see > >>>> upthread). > >>> Claudio raised a good point, that doing small pallocs leads to > >>> fragmentation, and in particular, it might mean that we can't give > >>> back the memory to the OS. The default glibc malloc() implementation > >>> has a threshold of 4 or 32 MB or something like that - allocations > >>> larger than the threshold are mmap()'d, and can always be returned to > >>> the OS. I think a simple solution to that is to allocate larger > >>> chunks, something like 32-64 MB at a time, and carve out the > >>> allocations for the nodes from those chunks. That's pretty > >>> straightforward, because we don't need to worry about freeing the > >>> nodes in retail. Keep track of the current half-filled chunk, and > >>> allocate a new one when it fills up. > >> > >> Google seems to suggest the default threshold is much lower, like 128K. > >> Still, making larger allocations seems sensible. Are you going to work > >> on that? > > Below a few MB the threshold is dynamic, and if a block bigger than > > 128K but smaller than the higher threshold (32-64MB IIRC) is freed, > > the dynamic threshold is set to the size of the freed block. > > > > See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] > > > > So I'd suggest allocating blocks bigger than M_MMAP_MAX. > > > > [1] http://man7.org/linux/man-pages/man3/mallopt.3.html > > > That page says: > > M_MMAP_MAX > This parameter specifies the maximum number of allocation > requests that may be simultaneously serviced using mmap(2). > This parameter exists because some systems have a limited > number of internal tables for use by mmap(2), and using more > than a few of them may degrade performance. > > The default value is 65,536, a value which has no special > significance and which serves only as a safeguard. Setting > this parameter to 0 disables the use of mmap(2) for servicing > large allocation requests. > > > I'm confused about the relevance. It isn't relevant. See my next message, it should have read DEFAULT_MMAP_THRESHOLD_MAX.
On 07/16/2018 11:35 AM, Claudio Freire wrote: > On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire <klaussfreire@gmail.com> wrote: >> On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan >> <andrew.dunstan@2ndquadrant.com> wrote: >>> >>> >>> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: >>>> On 13/07/18 01:39, Andrew Dunstan wrote: >>>>> On 07/12/2018 06:34 PM, Alvaro Herrera wrote: >>>>>> On 2018-Jul-12, Andrew Dunstan wrote: >>>>>> >>>>>>> I fully understand. I think this needs to go back to "Waiting on >>>>>>> Author". >>>>>> Why? Heikki's patch applies fine and passes the regression tests. >>>>> Well, I understood Claudio was going to do some more work (see >>>>> upthread). >>>> Claudio raised a good point, that doing small pallocs leads to >>>> fragmentation, and in particular, it might mean that we can't give >>>> back the memory to the OS. The default glibc malloc() implementation >>>> has a threshold of 4 or 32 MB or something like that - allocations >>>> larger than the threshold are mmap()'d, and can always be returned to >>>> the OS. I think a simple solution to that is to allocate larger >>>> chunks, something like 32-64 MB at a time, and carve out the >>>> allocations for the nodes from those chunks. That's pretty >>>> straightforward, because we don't need to worry about freeing the >>>> nodes in retail. Keep track of the current half-filled chunk, and >>>> allocate a new one when it fills up. >>> >>> Google seems to suggest the default threshold is much lower, like 128K. >>> Still, making larger allocations seems sensible. Are you going to work >>> on that? >> Below a few MB the threshold is dynamic, and if a block bigger than >> 128K but smaller than the higher threshold (32-64MB IIRC) is freed, >> the dynamic threshold is set to the size of the freed block. >> >> See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] >> >> So I'd suggest allocating blocks bigger than M_MMAP_MAX. >> >> [1] http://man7.org/linux/man-pages/man3/mallopt.3.html > Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the > former is something else. Ah, ok. Thanks. ignore the email I just sent about that. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Apr 6, 2018 at 4:25 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> True all that. My point is that the multi-segmented array isn't all that >> simple and proven, compared to an also straightforward B-tree. It's pretty >> similar to a B-tree, actually, except that it has exactly two levels, and >> the node (= segment) sizes grow exponentially. I'd rather go with a true >> B-tree, than something homegrown that resembles a B-tree, but not quite. > > I disagree. Yeah, me too. I think a segmented array is a lot simpler than a home-grown btree. I wrote a home-grown btree that ended up becoming src/backend/utils/mmgr/freepage.c and it took me a long time to get rid of all the bugs. Heikki is almost certainly better at coding up a bug-free btree than I am, but a segmented array is a dead simple data structure, or should be if done properly, and a btree is not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 16, 2018 at 02:33:17PM -0400, Andrew Dunstan wrote: > Ah, ok. Thanks. ignore the email I just sent about that. So... This thread has basically died of inactivity, while there have been a couple of interesting things discussed, like the version from Heikki here: https://www.postgresql.org/message-id/cd8f7b62-17e1-4307-9f81-427922e5a1f6@iki.fi I am marking the patches as returned with feedback for now. -- Michael