Thread: Boundary value check in lazy_tid_reaped()
Hi all, In the current lazy vacuum implementation, some index AMs such as btree indexes call lazy_tid_reaped() for each index tuples during ambulkdelete to check if the index tuple points to the (collected) garbage tuple. In that function, we simply call bsearch(3) but we should be able to know the result without bsearch(3) if the index tuple points to the heap tuple that is out of range of the collected garbage tuples. I've checked whether bsearch(3) does the boundary value check or not but looking at the bsearch(3) implementation of FreeBSD libc[1], there is not. The same is true for glibc. So my proposal is to add boundary value check in lazy_tid_reaped() before executing bsearch(3). This will help when index vacuum happens multiple times or when garbage tuples are concentrated to a narrow range. I’ve done three performance tests. maintenance_work_mem is set to 64MB with which we can collect about 11 million tuples at maximum and the table size is 10GB. The table has one btree index. Here is the average of index vacuum (ambulkdelete) execution time of three trials: * Test case 1 I made all pages dirty with 15 million garbage tuples to invoke index vacuum twice. HEAD: 164.7 s Patched: 138.81 s * Test case 2 I made all pages dirty with 10 million garbage tuples to invoke index vacuum only once, checking the performance degradation. HEAD: 127.8 s Patched: 128.25 s * Test case 3 I made a certain range of table dirty with 1 million garbage tuples. HEAD: 31.07 s Patched: 9.41 s I thought that we can have a generic function wrapping bsearch(3) that does boundary value checks and then does bsearch(3) so that we can use it in other similar places as well. But the attached patch doesn't do that as I'd like to hear opinions on the proposal first. Feedback is very welcome. Regards, [1] https://svnweb.freebsd.org/base/head/lib/libc/stdlib/bsearch.c?revision=326025&view=markup -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Sun, Aug 30, 2020 at 11:08 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > So my proposal is to add boundary value check in lazy_tid_reaped() > before executing bsearch(3). This will help when index vacuum happens > multiple times or when garbage tuples are concentrated to a narrow > range. Makes sense if it's often out of range. > I thought that we can have a generic function wrapping bsearch(3) that > does boundary value checks and then does bsearch(3) so that we can use > it in other similar places as well. But the attached patch doesn't do > that as I'd like to hear opinions on the proposal first. I wonder if you would also see a speed-up with a bsearch() replacement that is inlineable, so it can inline the comparator (instead of calling it through a function pointer). I wonder if something more like (lblk << 32 | loff) - (rblk << 32 | roff) would go faster than the branchy comparator.
On Tue, Sep 1, 2020 at 7:21 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sun, Aug 30, 2020 at 11:08 PM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > So my proposal is to add boundary value check in lazy_tid_reaped() > > before executing bsearch(3). This will help when index vacuum happens > > multiple times or when garbage tuples are concentrated to a narrow > > range. > > Makes sense if it's often out of range. ... though I'm not sure why you need to add extra members to do it? > > I thought that we can have a generic function wrapping bsearch(3) that > > does boundary value checks and then does bsearch(3) so that we can use > > it in other similar places as well. But the attached patch doesn't do > > that as I'd like to hear opinions on the proposal first. > > I wonder if you would also see a speed-up with a bsearch() replacement > that is inlineable, so it can inline the comparator (instead of > calling it through a function pointer). I wonder if something more > like (lblk << 32 | loff) - (rblk << 32 | roff) would go faster than > the branchy comparator. Erm, off course that expression won't work... should be << 16, but even then it would only work with a bsearch that uses int64_t comparators, so I take that part back.
On Mon, Aug 31, 2020 at 12:22 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sun, Aug 30, 2020 at 11:08 PM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > So my proposal is to add boundary value check in lazy_tid_reaped() > > before executing bsearch(3). This will help when index vacuum happens > > multiple times or when garbage tuples are concentrated to a narrow > > range. > > Makes sense if it's often out of range. I also think this is a good idea. But I also wonder if it goes far enough. Only one or two dead TIDs in inconvenient heap pages can completely defeat the optimization. A related problem with the TID list is that it is very space inefficient. It would be nice to fix that problem at some point. We could make multiple index scans by VACUUM operations much rarer if we tried. But how can we do all of this together? I wonder if Roaring bitmaps would work well for this: https://arxiv.org/abs/1709.07821 This data structure looks like it might work well in lazy_tid_reaped() (for the TID array), because it offers effective bitmap compression without compromising on binary search speed. Of course, you'd have to encode TIDs as bitmaps to make use of the data structure, much like tidbitmap.c. I imagine that the Roaring bitmap indirection would be very CPU cache efficient in cases that are similar to the cases Sawada-san is worried about, but a bit more complicated. VACUUM would be able to make the summarizing information for the TID bitmap resident in CPU cache. Only this well-cached summarizing information (the top-level bitmap range indirection) will need to be accessed by most individual calls to a Roaring-bitmap-aware lazy_tid_reaped() that return false (i.e. calls that say "don't kill this TID, it's alive"). These performance characteristics seem likely when a certain amount of clustering of dead tuples takes place in the heap. I bet having completely random positions for dead TIDs almost never happens -- *some* clustering is practically certain to take place, even without natural locality in the data (which is itself very common). Think about how opportunistic pruning accumulates many LP_DEAD items in heap pages. There is a reference Roaring bitmap implementation in C, so it's probably not that hard to experimentally determine how well it would work: https://github.com/RoaringBitmap/CRoaring Lots of open source projects already use it, so there are no problems with patents. Seems worth investigating. (I also wonder if we could use it for clog.) -- Peter Geoghegan
On Mon, Aug 31, 2020 at 1:56 PM Peter Geoghegan <pg@bowt.ie> wrote: > I wonder if Roaring bitmaps would work well for this: > > https://arxiv.org/abs/1709.07821 Alternatively, perhaps we could use a negative cache of heap blocks that have no tuples to kill at all. Maybe just a simple array whose elements are BlockNumber pairs. Each pair represents a range of blocks known to contain heap pages that *don't* have any TIDs for VACUUM to kill. The array could be size limited to 8KB, allowing it to be accessed in L1 cache throughout vacuuming. When the representation cannot fit in 8KB, maybe we disable the negative cache for the rest of the VACUUM operation. This is a bit like Masahiko's min_blkno + max_blkno idea, except: 1). It's a negative cache, and 2.) There are perhaps as many as 1024 min/max pairs -- not just 1. It's pretty clear that more than 90% of all calls to lazy_tid_reaped() return false unless vacuum is running behind (false means "don't kill this TID, it's alive"). But it could be much more than 90%. This may be because autovacuum is configured to run more aggressively than the default settings allow. But it may also happen when LP_DEAD killing in indexes works well enough to do most of the cleanup needed in practice. I bet pgbench finds that ~99% of all calls to lazy_tid_reaped() that take place during index vacuuming find that the TID doesn't need to be killed. So negative caching could really help. -- Peter Geoghegan
On Tue, 1 Sep 2020 at 04:44, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Tue, Sep 1, 2020 at 7:21 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sun, Aug 30, 2020 at 11:08 PM Masahiko Sawada > > <masahiko.sawada@2ndquadrant.com> wrote: > > > So my proposal is to add boundary value check in lazy_tid_reaped() > > > before executing bsearch(3). This will help when index vacuum happens > > > multiple times or when garbage tuples are concentrated to a narrow > > > range. > > > > Makes sense if it's often out of range. > > ... though I'm not sure why you need to add extra members to do it? Indeed. We can use the first and last elements of itemptrs array. > > > > I thought that we can have a generic function wrapping bsearch(3) that > > > does boundary value checks and then does bsearch(3) so that we can use > > > it in other similar places as well. But the attached patch doesn't do > > > that as I'd like to hear opinions on the proposal first. > > > > I wonder if you would also see a speed-up with a bsearch() replacement > > that is inlineable, so it can inline the comparator (instead of > > calling it through a function pointer). I wonder if something more > > like (lblk << 32 | loff) - (rblk << 32 | roff) would go faster than > > the branchy comparator. > > Erm, off course that expression won't work... should be << 16, but > even then it would only work with a bsearch that uses int64_t > comparators, so I take that part back. Yeah, it seems to worth benchmarking the speed-up with an inlining. I'll do some performance tests with/without inlining on top of checking boundary values. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 8, 2020 at 1:23 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > > I wonder if you would also see a speed-up with a bsearch() replacement > > > that is inlineable, so it can inline the comparator (instead of > > > calling it through a function pointer). I wonder if something more > > > like (lblk << 32 | loff) - (rblk << 32 | roff) would go faster than > > > the branchy comparator. > > > > Erm, off course that expression won't work... should be << 16, but > > even then it would only work with a bsearch that uses int64_t > > comparators, so I take that part back. > > Yeah, it seems to worth benchmarking the speed-up with an inlining. > I'll do some performance tests with/without inlining on top of > checking boundary values. It sounds like Thomas was talking about something like itemptr_encode() + itemptr_decode(). In case you didn't know, we actually do something like this for the TID tuplesort used for CREATE INDEX CONCURRENTLY. -- Peter Geoghegan
On Tue, 1 Sep 2020 at 08:01, Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Aug 31, 2020 at 1:56 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I wonder if Roaring bitmaps would work well for this: > > > > https://arxiv.org/abs/1709.07821 > > Alternatively, perhaps we could use a negative cache of heap blocks > that have no tuples to kill at all. Maybe just a simple array whose > elements are BlockNumber pairs. Each pair represents a range of blocks > known to contain heap pages that *don't* have any TIDs for VACUUM to > kill. The array could be size limited to 8KB, allowing it to be > accessed in L1 cache throughout vacuuming. When the representation > cannot fit in 8KB, maybe we disable the negative cache for the rest of > the VACUUM operation. > > This is a bit like Masahiko's min_blkno + max_blkno idea, except: 1). > It's a negative cache, and 2.) There are perhaps as many as 1024 > min/max pairs -- not just 1. Interesting idea. Maybe we need one more bsearch() for the min/max pairs array when a TID should be killed? > > It's pretty clear that more than 90% of all calls to lazy_tid_reaped() > return false unless vacuum is running behind (false means "don't kill > this TID, it's alive"). But it could be much more than 90%. This may > be because autovacuum is configured to run more aggressively than the > default settings allow. But it may also happen when LP_DEAD killing in > indexes works well enough to do most of the cleanup needed in > practice. I bet pgbench finds that ~99% of all calls to > lazy_tid_reaped() that take place during index vacuuming find that the > TID doesn't need to be killed. So negative caching could really help. I agree that lazy_tid_reaped() returns false in most checks in the case autovacuum is not running behind. But I'm concerned a bit that it instead costs the case where vacuum needs to kill many TIDs and uses the min/max filter because it needs to call bsearch() twice. I think that this is the case where autovacuum is running behind and the user wants to complete the vacuum as soon as possible. Since I expect that checking the filtering using min/max pairs array should be done in a very short time it might not be a problem but I think it’s worth benchmarking the overhead in the worst case. Or I guess we can use a bloom filter for this purpose instead. Also, I'm concerned that 1024 min/max pairs might not be enough in practice, especially for very large tables. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, 1 Sep 2020 at 05:56, Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Aug 31, 2020 at 12:22 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sun, Aug 30, 2020 at 11:08 PM Masahiko Sawada > > <masahiko.sawada@2ndquadrant.com> wrote: > > > So my proposal is to add boundary value check in lazy_tid_reaped() > > > before executing bsearch(3). This will help when index vacuum happens > > > multiple times or when garbage tuples are concentrated to a narrow > > > range. > > > > Makes sense if it's often out of range. > > I also think this is a good idea. But I also wonder if it goes far > enough. Only one or two dead TIDs in inconvenient heap pages can > completely defeat the optimization. > > A related problem with the TID list is that it is very space > inefficient. It would be nice to fix that problem at some point. We > could make multiple index scans by VACUUM operations much rarer if we > tried. But how can we do all of this together? > > I wonder if Roaring bitmaps would work well for this: > > https://arxiv.org/abs/1709.07821 > > This data structure looks like it might work well in lazy_tid_reaped() > (for the TID array), because it offers effective bitmap compression > without compromising on binary search speed. Of course, you'd have to > encode TIDs as bitmaps to make use of the data structure, much like > tidbitmap.c. I imagine that the Roaring bitmap indirection would be > very CPU cache efficient in cases that are similar to the cases > Sawada-san is worried about, but a bit more complicated. > > VACUUM would be able to make the summarizing information for the TID > bitmap resident in CPU cache. Only this well-cached summarizing > information (the top-level bitmap range indirection) will need to be > accessed by most individual calls to a Roaring-bitmap-aware > lazy_tid_reaped() that return false (i.e. calls that say "don't kill > this TID, it's alive"). These performance characteristics seem likely > when a certain amount of clustering of dead tuples takes place in the > heap. I bet having completely random positions for dead TIDs almost > never happens -- *some* clustering is practically certain to take > place, even without natural locality in the data (which is itself very > common). Think about how opportunistic pruning accumulates many > LP_DEAD items in heap pages. > > There is a reference Roaring bitmap implementation in C, so it's > probably not that hard to experimentally determine how well it would > work: > > https://github.com/RoaringBitmap/CRoaring > > Lots of open source projects already use it, so there are no problems > with patents. Seems worth investigating. (I also wonder if we could > use it for clog.) Very interesting. Before getting into CRoaring bitmap, I've done some experiments with three different methods of recording dead tuple TIDs: array, array-minmax, and integer set. 'array' is the current method lazy vacuum uses. It just adds dead tuple TIDs to the simple array of ItemPointerData. 'array-minmax' is the same as 'array' method except for checking min and max boundaries when deleting index dead tuples (i.g., in lazy_tid_reaped()). 'intset' uses the integer set (src/backend/lib/integerset.c) to record dead tuples TIDs. It's an in-memory data structure to hold 64-bit integers. In the experiments, I created the table with 4GB and made 20% of total tuples dirty in all test cases while changing the distribution of where dead tuples exist within the table. The table has only one index, therefore I did't use parallel vacuum. I set enough maintenance_work_mem so that the index scan runs only once. Here is the result, showing the "total execution time in second (heap scan time/index vacuum time/table vacuum time/memory usage in MB)”. The numbers are round down to the nearest decimal. 1. Updated evenly the table (every block has at least one dead tuple). array : 22 (8/12/2/114) array-minmax : 20 (8/11/2/114) intset : 17 (7/8/1/19) 2. Updated the middle of the table. array : 25 (6/19/0/114) array-minmax : 17 (6/11/0/114) intset : 17 (6/11/0/7) 3. Updated the tail of the table. array : 25 (6/19/0/114) array-minmax : 18 (6/11/0/114) intset : 18 (6/11/0/5) 4. Updated both the beginning and the tail of the table. array : 25 (6/19/0/114) array-minmax : 23 (6/17/0/114) intset : 21 (6/14/0/6) The memory usage is calculated by (# of dead tuples * sizeof(ItemPointerData)) in both ‘array’ and ‘array-minmax’ case, although the actual amount of palloc'd memory is different, and by intset_memory_usage() in ‘intset’ case. Using the integer set is very memory efficient (5MB vs. 114MB in the base case) and there is no 1GB limitation. Looking at the execution time, I had expected that using the integer set is slower on recording TIDs than using the simple array but in this experiment, there is no big difference among methods. Perhaps the result will vary if vacuum needs to record much more dead tuple TIDs. From the results, I can see a good improvement in the integer set case and probably a good start but if we want to use it also for lazy vacuum, we will need to improve it so that it can be used on DSA for the parallel vacuum. I've attached the patch I used for the experiment that adds xx_vacuum GUC parameter that controls the method of recording TIDs. Please note that it doesn't support parallel vacuum. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2020-10-30 02:43, Masahiko Sawada wrote: > Using the integer set is very memory efficient (5MB vs. 114MB in the > base case) and there is no 1GB limitation. Looking at the execution > time, I had expected that using the integer set is slower on recording > TIDs than using the simple array but in this experiment, there is no > big difference among methods. Perhaps the result will vary if vacuum > needs to record much more dead tuple TIDs. From the results, I can see > a good improvement in the integer set case and probably a good start > but if we want to use it also for lazy vacuum, we will need to improve > it so that it can be used on DSA for the parallel vacuum. > > I've attached the patch I used for the experiment that adds xx_vacuum > GUC parameter that controls the method of recording TIDs. Please note > that it doesn't support parallel vacuum. How do you want to proceed here? The approach of writing a wrapper for bsearch with bound check sounded like a good start. All the other ideas discussed here seem larger projects and would probably be out of scope of this commit fest.
On Wed, Jan 20, 2021 at 3:50 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 2020-10-30 02:43, Masahiko Sawada wrote: > > Using the integer set is very memory efficient (5MB vs. 114MB in the > > base case) and there is no 1GB limitation. Looking at the execution > > time, I had expected that using the integer set is slower on recording > > TIDs than using the simple array but in this experiment, there is no > > big difference among methods. Perhaps the result will vary if vacuum > > needs to record much more dead tuple TIDs. From the results, I can see > > a good improvement in the integer set case and probably a good start > > but if we want to use it also for lazy vacuum, we will need to improve > > it so that it can be used on DSA for the parallel vacuum. > > > > I've attached the patch I used for the experiment that adds xx_vacuum > > GUC parameter that controls the method of recording TIDs. Please note > > that it doesn't support parallel vacuum. > > How do you want to proceed here? The approach of writing a wrapper for > bsearch with bound check sounded like a good start. All the other ideas > discussed here seem larger projects and would probably be out of scope > of this commit fest. Agreed. bsearch with bound check showed a reasonable improvement in my evaluation in terms of performance. Regarding memory efficiency, we can experiment with other methods later. I've attached the patch that adds a bound check for encoded itermpointers before bsearch() in lazy_tid_reaped() and inlines the function. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/