Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | CAH2-WzmBp8YGpp2pDudCcCxw-7rhL-KhR7_DBfPG+vz+AK+2fA@mail.gmail.com Whole thread Raw |
In response to | [PoC] Improve dead tuple storage for lazy vacuum (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: [PoC] Improve dead tuple storage for lazy vacuum
Re: [PoC] Improve dead tuple storage for lazy vacuum |
List | pgsql-hackers |
On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Currently, the TIDs of dead tuples are stored in an array that is > collectively allocated at the start of lazy vacuum and TID lookup uses > bsearch(). There are the following challenges and limitations: > > 1. Don't allocate more than 1GB. There was a discussion to eliminate > this limitation by using MemoryContextAllocHuge() but there were > concerns about point 2[1]. I think that the main problem with the 1GB limitation is that it is surprising -- it can cause disruption when we first exceed the magical limit of ~174 million TIDs. This can cause us to dirty index pages a second time when we might have been able to just do it once with sufficient memory for TIDs. OTOH there are actually cases where having less memory for TIDs makes performance *better* because of locality effects. This perverse behavior with memory sizing isn't a rare case that we can safely ignore -- unfortunately it's fairly common. My point is that we should be careful to choose the correct goal. Obviously memory use matters. But it might be more helpful to think of memory use as just a proxy for what truly matters, not a goal in itself. It's hard to know what this means (what is the "real goal"?), and hard to measure it even if you know for sure. It could still be useful to think of it like this. > A run container is selected in this test case, using 4 bytes for each block. > > Execution Time Memory Usage > array 8,883.03 600,008,248 > intset 7,358.23 100,671,488 > tbm 758.81 100,671,544 > rtbm 764.33 29,384,816 > > Overall, 'rtbm' has a much better lookup performance and good memory > usage especially if there are relatively many dead tuples. However, in > some cases, 'intset' and 'array' have a better memory usage. This seems very promising. I wonder how much you have thought about the index AM side. It makes sense to initially evaluate these techniques using this approach of separating the data structure from how it is used by VACUUM -- I think that that was a good idea. But at the same time there may be certain important theoretical questions that cannot be answered this way -- questions about how everything "fits together" in a real VACUUM might matter a lot. You've probably thought about this at least a little already. Curious to hear how you think it "fits together" with the work that you've done already. The loop inside btvacuumpage() makes each loop iteration call the callback -- this is always a call to lazy_tid_reaped() in practice. And that's where we do binary searches. These binary searches are usually where we see a huge number of cycles spent when we look at profiles, including the profile that produced your flame graph. But I worry that that might be a bit misleading -- the way that profilers attribute costs is very complicated and can never be fully trusted. While it is true that lazy_tid_reaped() often accesses main memory, which will of course add a huge amount of latency and make it a huge bottleneck, the "big picture" is still relevant. I think that the compiler currently has to make very conservative assumptions when generating the machine code used by the loop inside btvacuumpage(), which calls through an opaque function pointer at least once per loop iteration -- anything can alias, so the compiler must be conservative. The data dependencies are hard for both the compiler and the CPU to analyze. The cost of using a function pointer compared to a direct function call is usually quite low, but there are important exceptions -- cases where it prevents other useful optimizations. Maybe this is an exception. I wonder how much it would help to break up that loop into two loops. Make the callback into a batch operation that generates state that describes what to do with each and every index tuple on the leaf page. The first loop would build a list of TIDs, then you'd call into vacuumlazy.c and get it to process the TIDs, and finally the second loop would physically delete the TIDs that need to be deleted. This would mean that there would be only one call per leaf page per btbulkdelete(). This would reduce the number of calls to the callback by at least 100x, and maybe more than 1000x. This approach would make btbulkdelete() similar to _bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really an independent idea to your ideas -- I imagine that this would work far better when combined with a more compact data structure, which is naturally more capable of batch processing than a simple array of TIDs. Maybe this will help the compiler and the CPU to fully understand the *natural* data dependencies, so that they can be as effective as possible in making the code run fast. It's possible that a modern CPU will be able to *hide* the latency more intelligently than what we have today. The latency is such a big problem that we may be able to justify "wasting" other CPU resources, just because it sometimes helps with hiding the latency. For example, it might actually be okay to sort all of the TIDs on the page to make the bulk processing work -- though you might still do a precheck that is similar to the precheck inside lazy_tid_reaped() that was added by you in commit bbaf315309e. Of course it's very easy to be wrong about stuff like this. But it might not be that hard to prototype. You can literally copy and paste code from _bt_delitems_delete_check() to do this. It does the same basic thing already. -- Peter Geoghegan
pgsql-hackers by date: