Thread: [PoC] Improve dead tuple storage for lazy vacuum
Hi all, Index vacuuming is one of the most time-consuming processes in lazy vacuuming. lazy_tid_reaped() is a large part among them. The attached the flame graph shows a profile of a vacuum on a table that has one index and 80 million live rows and 20 million dead rows, where lazy_tid_reaped() accounts for about 47% of the total vacuum execution time. lazy_tid_reaped() is essentially an existence check; for every index tuple, it checks if the TID of the heap it points to exists in the set of TIDs of dead tuples. The maximum size of dead tuples is limited by maintenance_work_mem, and if the upper limit is reached, the heap scan is suspended, index vacuum and heap vacuum are performed, and then heap scan is resumed again. Therefore, in terms of the performance of index vacuuming, there are two important factors: the performance of lookup TIDs from the set of dead tuples and its memory usage. The former is obvious whereas the latter affects the number of Index vacuuming. In many index AMs, index vacuuming (i.e., ambulkdelete) performs a full scan of the index, so it is important in terms of performance to avoid index vacuuming from being executed more than once during lazy vacuum. Currently, the TIDs of dead tuples are stored in an array that is collectively allocated at the start of lazy vacuum and TID lookup uses bsearch(). There are the following challenges and limitations: 1. Don't allocate more than 1GB. There was a discussion to eliminate this limitation by using MemoryContextAllocHuge() but there were concerns about point 2[1]. 2. Allocate the whole memory space at once. 3. Slow lookup performance (O(logN)). I’ve done some experiments in this area and would like to share the results and discuss ideas. Problems Solutions =============== Firstly, I've considered using existing data structures: IntegerSet(src/backend/lib/integerset.c) and TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but only either point 2 or 3. IntegerSet uses lower memory thanks to simple-8b encoding but is slow at lookup, still O(logN), since it’s a tree structure. On the other hand, TIDBitmap has a good lookup performance, O(1), but could unnecessarily use larger memory in some cases since it always allocates the space for bitmap enough to store all possible offsets. With 8kB blocks, the maximum number of line pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the bitmap is 40 bytes long and we always need 46 bytes in total per block including other meta information. So I prototyped a new data structure dedicated to storing dead tuples during lazy vacuum while borrowing the idea from Roaring Bitmap[2]. The authors provide an implementation of Roaring Bitmap[3] (Apache 2.0 license). But I've implemented this idea from scratch because we need to integrate it with Dynamic Shared Memory/Area to support parallel vacuum and need to support ItemPointerData, 6-bytes integer in total, whereas the implementation supports only 4-bytes integers. Also, when it comes to vacuum, we neither need to compute the intersection, the union, nor the difference between sets, but need only an existence check. The data structure is somewhat similar to TIDBitmap. It consists of the hash table and the container area; the hash table has entries per block and each block entry allocates its memory space, called a container, in the container area to store its offset numbers. The container area is actually an array of bytes and can be enlarged as needed. In the container area, the data representation of offset numbers varies depending on their cardinality. It has three container types: array, bitmap, and run. For example, if there are two dead tuples at offset 1 and 150, it uses the array container that has an array of two 2-byte integers representing 1 and 150, using 4 bytes in total. If we used the bitmap container in this case, we would need 20 bytes instead. On the other hand, if there are consecutive 20 dead tuples from offset 1 to 20, it uses the run container that has an array of 2-byte integers. The first value in each pair represents a starting offset number, whereas the second value represents its length. Therefore, in this case, the run container uses only 4 bytes in total. Finally, if there are dead tuples at every other offset from 1 to 100, it uses the bitmap container that has an uncompressed bitmap, using 13 bytes. We need another 16 bytes per block entry for hash table entry. The lookup complexity of a bitmap container is O(1) whereas the one of an array and a run container is O(N) or O(logN) but the number of elements in those two containers should not be large it would not be a problem. Evaluation ======== Before implementing this idea and integrating it with lazy vacuum code, I've implemented a benchmark tool dedicated to evaluating lazy_tid_reaped() performance[4]. It has some functions: generating TIDs for both index tuples and dead tuples, loading dead tuples to the data structure, simulating lazy_tid_reaped() using those virtual heap tuples and heap dead tuples. So the code lacks many features such as iteration and DSM/DSA support but it makes testing of data structure easier. FYI I've confirmed the validity of this tool. When I ran a vacuum on the table with 3GB size, index vacuuming took 12.3 sec and lazy_tid_reaped() took approximately 8.5 sec. Simulating a similar situation with the tool, the lookup benchmark with the array data structure took approximately 8.0 sec. Given that the tool doesn't simulate the cost of function calls, it seems to reasonably simulate it. I've evaluated the lookup performance and memory foot point against the four types of data structure: array, integerset (intset), tidbitmap (tbm), roaring tidbitmap (rtbm) while changing the distribution of dead tuples in blocks. Since tbm doesn't have a function for existence check I've added it and allocate enough memory to make sure that tbm never be lossy during the evaluation. In all test cases, I simulated that the table has 1,000,000 blocks and every block has at least one dead tuple. The benchmark scenario is that for each virtual heap tuple we check if there is its TID in the dead tuple storage. Here are the results of execution time in milliseconds and memory usage in bytes: * Test-case 1 (10 dead tuples in 20 offsets interval) An array container is selected in this test case, using 20 bytes for each block. Execution Time Memory Usage array 14,140.91 60,008,248 intset 9,350.08 50,339,840 tbm 1,299.62 100,671,544 rtbm 1,892.52 58,744,944 * Test-case 2 (10 consecutive dead tuples from offset 1) A bitmap container is selected in this test case, using 2 bytes for each block. Execution Time Memory Usage array 1,056.60 60,008,248 intset 650.85 50,339,840 tbm 194.61 100,671,544 rtbm 154.57 27,287,664 * Test-case 3 (2 dead tuples at 1 and 100 offsets) An array container is selected in this test case, using 4 bytes for each block. Since 'array' data structure (not array container of rtbm) uses only 12 bytes for each block, given that the size of hash table entry size in 'rtbm', 'array' data structure uses less memory. Execution Time Memory Usage array 6,054.22 12,008,248 intset 4,203.41 16,785,408 tbm 759.17 100,671,544 rtbm 750.08 29,384,816 * Test-case 4 (100 consecutive dead tuples from 1) A run container is selected in this test case, using 4 bytes for each block. Execution Time Memory Usage array 8,883.03 600,008,248 intset 7,358.23 100,671,488 tbm 758.81 100,671,544 rtbm 764.33 29,384,816 Overall, 'rtbm' has a much better lookup performance and good memory usage especially if there are relatively many dead tuples. However, in some cases, 'intset' and 'array' have a better memory usage. Feedback is very welcome. Thank you for reading the email through to the end. Regards, [1] https://www.postgresql.org/message-id/CAGTBQpbDCaR6vv9%3DscXzuT8fSbckf%3Da3NgZdWFWZbdVugVht6Q%40mail.gmail.com [2] http://roaringbitmap.org/ [3] https://github.com/RoaringBitmap/CRoaring [4] https://github.com/MasahikoSawada/pgtools/tree/master/bdbench -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
On Wed, 7 Jul 2021 at 13:47, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi all, > > Index vacuuming is one of the most time-consuming processes in lazy > vacuuming. lazy_tid_reaped() is a large part among them. The attached > the flame graph shows a profile of a vacuum on a table that has one index > and 80 million live rows and 20 million dead rows, where > lazy_tid_reaped() accounts for about 47% of the total vacuum execution > time. > > [...] > > Overall, 'rtbm' has a much better lookup performance and good memory > usage especially if there are relatively many dead tuples. However, in > some cases, 'intset' and 'array' have a better memory usage. Those are some great results, with a good path to meaningful improvements. > Feedback is very welcome. Thank you for reading the email through to the end. The current available infrastructure for TIDs is quite ill-defined for TableAM authors [0], and other TableAMs might want to use more than just the 11 bits in use by max-BLCKSZ HeapAM MaxHeapTuplesPerPage to identify tuples. (MaxHeapTuplesPerPage is 1169 at the maximum 32k BLCKSZ, which requires 11 bits to fit). Could you also check what the (performance, memory) impact would be if these proposed structures were to support the maximum MaxHeapTuplesPerPage of 1169 or the full uint16-range of offset numbers that could be supported by our current TID struct? Kind regards, Matthias van de Meent [0] https://www.postgresql.org/message-id/flat/0bbeb784050503036344e1f08513f13b2083244b.camel%40j-davis.com
On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Currently, the TIDs of dead tuples are stored in an array that is > collectively allocated at the start of lazy vacuum and TID lookup uses > bsearch(). There are the following challenges and limitations: > > 1. Don't allocate more than 1GB. There was a discussion to eliminate > this limitation by using MemoryContextAllocHuge() but there were > concerns about point 2[1]. I think that the main problem with the 1GB limitation is that it is surprising -- it can cause disruption when we first exceed the magical limit of ~174 million TIDs. This can cause us to dirty index pages a second time when we might have been able to just do it once with sufficient memory for TIDs. OTOH there are actually cases where having less memory for TIDs makes performance *better* because of locality effects. This perverse behavior with memory sizing isn't a rare case that we can safely ignore -- unfortunately it's fairly common. My point is that we should be careful to choose the correct goal. Obviously memory use matters. But it might be more helpful to think of memory use as just a proxy for what truly matters, not a goal in itself. It's hard to know what this means (what is the "real goal"?), and hard to measure it even if you know for sure. It could still be useful to think of it like this. > A run container is selected in this test case, using 4 bytes for each block. > > Execution Time Memory Usage > array 8,883.03 600,008,248 > intset 7,358.23 100,671,488 > tbm 758.81 100,671,544 > rtbm 764.33 29,384,816 > > Overall, 'rtbm' has a much better lookup performance and good memory > usage especially if there are relatively many dead tuples. However, in > some cases, 'intset' and 'array' have a better memory usage. This seems very promising. I wonder how much you have thought about the index AM side. It makes sense to initially evaluate these techniques using this approach of separating the data structure from how it is used by VACUUM -- I think that that was a good idea. But at the same time there may be certain important theoretical questions that cannot be answered this way -- questions about how everything "fits together" in a real VACUUM might matter a lot. You've probably thought about this at least a little already. Curious to hear how you think it "fits together" with the work that you've done already. The loop inside btvacuumpage() makes each loop iteration call the callback -- this is always a call to lazy_tid_reaped() in practice. And that's where we do binary searches. These binary searches are usually where we see a huge number of cycles spent when we look at profiles, including the profile that produced your flame graph. But I worry that that might be a bit misleading -- the way that profilers attribute costs is very complicated and can never be fully trusted. While it is true that lazy_tid_reaped() often accesses main memory, which will of course add a huge amount of latency and make it a huge bottleneck, the "big picture" is still relevant. I think that the compiler currently has to make very conservative assumptions when generating the machine code used by the loop inside btvacuumpage(), which calls through an opaque function pointer at least once per loop iteration -- anything can alias, so the compiler must be conservative. The data dependencies are hard for both the compiler and the CPU to analyze. The cost of using a function pointer compared to a direct function call is usually quite low, but there are important exceptions -- cases where it prevents other useful optimizations. Maybe this is an exception. I wonder how much it would help to break up that loop into two loops. Make the callback into a batch operation that generates state that describes what to do with each and every index tuple on the leaf page. The first loop would build a list of TIDs, then you'd call into vacuumlazy.c and get it to process the TIDs, and finally the second loop would physically delete the TIDs that need to be deleted. This would mean that there would be only one call per leaf page per btbulkdelete(). This would reduce the number of calls to the callback by at least 100x, and maybe more than 1000x. This approach would make btbulkdelete() similar to _bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really an independent idea to your ideas -- I imagine that this would work far better when combined with a more compact data structure, which is naturally more capable of batch processing than a simple array of TIDs. Maybe this will help the compiler and the CPU to fully understand the *natural* data dependencies, so that they can be as effective as possible in making the code run fast. It's possible that a modern CPU will be able to *hide* the latency more intelligently than what we have today. The latency is such a big problem that we may be able to justify "wasting" other CPU resources, just because it sometimes helps with hiding the latency. For example, it might actually be okay to sort all of the TIDs on the page to make the bulk processing work -- though you might still do a precheck that is similar to the precheck inside lazy_tid_reaped() that was added by you in commit bbaf315309e. Of course it's very easy to be wrong about stuff like this. But it might not be that hard to prototype. You can literally copy and paste code from _bt_delitems_delete_check() to do this. It does the same basic thing already. -- Peter Geoghegan
On Wed, Jul 7, 2021 at 1:24 PM Peter Geoghegan <pg@bowt.ie> wrote: > I wonder how much it would help to break up that loop into two loops. > Make the callback into a batch operation that generates state that > describes what to do with each and every index tuple on the leaf page. > The first loop would build a list of TIDs, then you'd call into > vacuumlazy.c and get it to process the TIDs, and finally the second > loop would physically delete the TIDs that need to be deleted. This > would mean that there would be only one call per leaf page per > btbulkdelete(). This would reduce the number of calls to the callback > by at least 100x, and maybe more than 1000x. Maybe for something like rtbm.c (which is inspired by Roaring bitmaps), you would really want to use an "intersection" operation for this. The TIDs that we need to physically delete from the leaf page inside btvacuumpage() are the intersection of two bitmaps: our bitmap of all TIDs on the leaf page, and our bitmap of all TIDs that need to be deleting by the ongoing btbulkdelete() call. Obviously the typical case is that most TIDs in the index do *not* get deleted -- needing to delete more than ~20% of all TIDs in the index will be rare. Ideally it would be very cheap to figure out that a TID does not need to be deleted at all. Something a little like a negative cache (but not a true negative cache). This is a little bit like how hash joins can be made faster by adding a Bloom filter -- most hash probes don't need to join a tuple in the real world, and we can make these hash probes even faster by using a Bloom filter as a negative cache. If you had the list of TIDs from a leaf page sorted for batch processing, and if you had roaring bitmap style "chunks" with "container" metadata stored in the data structure, you could then use merging/intersection -- that has some of the same advantages. I think that this would be a lot more efficient than having one binary search per TID. Most TIDs from the leaf page can be skipped over very quickly, in large groups. It's very rare for VACUUM to need to delete TIDs from completely random heap table blocks in the real world (some kind of pattern is much more common). When this merging process finds 1 TID that might really be deletable then it's probably going to find much more than 1 -- better to make that cache miss take care of all of the TIDs together. Also seems like the CPU could do some clever prefetching with this approach -- it could prefetch TIDs where the initial chunk metadata is insufficient to eliminate them early -- these are the groups of TIDs that will have many TIDs that we actually need to delete. ISTM that improving temporal locality through batching could matter a lot here. -- Peter Geoghegan
On Wed, Jul 7, 2021 at 11:25 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Wed, 7 Jul 2021 at 13:47, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Hi all, > > > > Index vacuuming is one of the most time-consuming processes in lazy > > vacuuming. lazy_tid_reaped() is a large part among them. The attached > > the flame graph shows a profile of a vacuum on a table that has one index > > and 80 million live rows and 20 million dead rows, where > > lazy_tid_reaped() accounts for about 47% of the total vacuum execution > > time. > > > > [...] > > > > Overall, 'rtbm' has a much better lookup performance and good memory > > usage especially if there are relatively many dead tuples. However, in > > some cases, 'intset' and 'array' have a better memory usage. > > Those are some great results, with a good path to meaningful improvements. > > > Feedback is very welcome. Thank you for reading the email through to the end. > > The current available infrastructure for TIDs is quite ill-defined for > TableAM authors [0], and other TableAMs might want to use more than > just the 11 bits in use by max-BLCKSZ HeapAM MaxHeapTuplesPerPage to > identify tuples. (MaxHeapTuplesPerPage is 1169 at the maximum 32k > BLCKSZ, which requires 11 bits to fit). > > Could you also check what the (performance, memory) impact would be if > these proposed structures were to support the maximum > MaxHeapTuplesPerPage of 1169 or the full uint16-range of offset > numbers that could be supported by our current TID struct? I think tbm will be the most affected by the memory impact of the larger maximum MaxHeapTuplesPerPage. For example, with 32kB blocks (MaxHeapTuplesPerPage = 1169), even if there is only one dead tuple in a block, it will always require at least 147 bytes per block. Rtbm chooses the container type among array, bitmap, or run depending on the number and distribution of dead tuples in a block, and only bitmap containers can be searched with O(1). Run containers depend on the distribution of dead tuples within a block. So let’s compare array and bitmap containers. With 8kB blocks (MaxHeapTuplesPerPage = 291), 36 bytes are needed for a bitmap container at maximum. In other words, when compared to an array container, bitmap will be chosen if there are more than 18 dead tuples in a block. On the other hand, with 32kB blocks (MaxHeapTuplesPerPage = 1169), 147 bytes are needed for a bitmap container at maximum, so bitmap container will be chosen if there are more than 74 dead tuples in a block. And, with full uint16-range (MaxHeapTuplesPerPage = 65535), 8192 bytes are needed at maximum, so bitmap container will be chosen if there are more than 4096 dead tuples in a block. Therefore, in any case, if more than about 6% of tuples in a block are garbage, a bitmap container will be chosen and bring a faster lookup performance. (Of course, if a run container is chosen, the container size gets smaller but the lookup performance is O(logN).) But if the number of dead tuples in the table is small and we have the larger MaxHeapTuplesPerPage, it’s likely to choose an array container, and the lookup performance becomes O(logN). Still, it should be faster than the array data structure because the range of search targets in an array container is much smaller. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Thu, Jul 8, 2021 at 5:24 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Currently, the TIDs of dead tuples are stored in an array that is > > collectively allocated at the start of lazy vacuum and TID lookup uses > > bsearch(). There are the following challenges and limitations: > > > > 1. Don't allocate more than 1GB. There was a discussion to eliminate > > this limitation by using MemoryContextAllocHuge() but there were > > concerns about point 2[1]. > > I think that the main problem with the 1GB limitation is that it is > surprising -- it can cause disruption when we first exceed the magical > limit of ~174 million TIDs. This can cause us to dirty index pages a > second time when we might have been able to just do it once with > sufficient memory for TIDs. OTOH there are actually cases where having > less memory for TIDs makes performance *better* because of locality > effects. This perverse behavior with memory sizing isn't a rare case > that we can safely ignore -- unfortunately it's fairly common. > > My point is that we should be careful to choose the correct goal. > Obviously memory use matters. But it might be more helpful to think of > memory use as just a proxy for what truly matters, not a goal in > itself. It's hard to know what this means (what is the "real goal"?), > and hard to measure it even if you know for sure. It could still be > useful to think of it like this. As I wrote in the first email, I think there are two important factors in index vacuuming performance: the performance to check if heap TID that an index tuple points to is dead, and the number of times to perform index bulk-deletion. The flame graph I attached in the first mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a disk-intensive operation in practice. Given that most index AM's bulk-deletion does a full index scan and a table could have multiple indexes, reducing the number of times to perform index bulk-deletion really contributes to reducing the execution time, especially for large tables. I think that a more compact data structure for dead tuple TIDs is one of the ways to achieve that. > > > A run container is selected in this test case, using 4 bytes for each block. > > > > Execution Time Memory Usage > > array 8,883.03 600,008,248 > > intset 7,358.23 100,671,488 > > tbm 758.81 100,671,544 > > rtbm 764.33 29,384,816 > > > > Overall, 'rtbm' has a much better lookup performance and good memory > > usage especially if there are relatively many dead tuples. However, in > > some cases, 'intset' and 'array' have a better memory usage. > > This seems very promising. > > I wonder how much you have thought about the index AM side. It makes > sense to initially evaluate these techniques using this approach of > separating the data structure from how it is used by VACUUM -- I think > that that was a good idea. But at the same time there may be certain > important theoretical questions that cannot be answered this way -- > questions about how everything "fits together" in a real VACUUM might > matter a lot. You've probably thought about this at least a little > already. Curious to hear how you think it "fits together" with the > work that you've done already. Yeah, that definitely needs to be considered. Currently, what we need for the dead tuple storage for lazy vacuum are store, lookup, and iteration. And given the parallel vacuum, it has to be able to be allocated on DSM or DSA. While implementing the PoC code, I'm trying to integrate it with the current lazy vacuum code. As far as I've seen so far, the integration is not hard, at least with the *current* lazy vacuum code and index AMs code. > > The loop inside btvacuumpage() makes each loop iteration call the > callback -- this is always a call to lazy_tid_reaped() in practice. > And that's where we do binary searches. These binary searches are > usually where we see a huge number of cycles spent when we look at > profiles, including the profile that produced your flame graph. But I > worry that that might be a bit misleading -- the way that profilers > attribute costs is very complicated and can never be fully trusted. > While it is true that lazy_tid_reaped() often accesses main memory, > which will of course add a huge amount of latency and make it a huge > bottleneck, the "big picture" is still relevant. > > I think that the compiler currently has to make very conservative > assumptions when generating the machine code used by the loop inside > btvacuumpage(), which calls through an opaque function pointer at > least once per loop iteration -- anything can alias, so the compiler > must be conservative. The data dependencies are hard for both the > compiler and the CPU to analyze. The cost of using a function pointer > compared to a direct function call is usually quite low, but there are > important exceptions -- cases where it prevents other useful > optimizations. Maybe this is an exception. > > I wonder how much it would help to break up that loop into two loops. > Make the callback into a batch operation that generates state that > describes what to do with each and every index tuple on the leaf page. > The first loop would build a list of TIDs, then you'd call into > vacuumlazy.c and get it to process the TIDs, and finally the second > loop would physically delete the TIDs that need to be deleted. This > would mean that there would be only one call per leaf page per > btbulkdelete(). This would reduce the number of calls to the callback > by at least 100x, and maybe more than 1000x. > > This approach would make btbulkdelete() similar to > _bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really > an independent idea to your ideas -- I imagine that this would work > far better when combined with a more compact data structure, which is > naturally more capable of batch processing than a simple array of > TIDs. Maybe this will help the compiler and the CPU to fully > understand the *natural* data dependencies, so that they can be as > effective as possible in making the code run fast. It's possible that > a modern CPU will be able to *hide* the latency more intelligently > than what we have today. The latency is such a big problem that we may > be able to justify "wasting" other CPU resources, just because it > sometimes helps with hiding the latency. For example, it might > actually be okay to sort all of the TIDs on the page to make the bulk > processing work -- though you might still do a precheck that is > similar to the precheck inside lazy_tid_reaped() that was added by you > in commit bbaf315309e. Interesting idea. I remember you mentioned this idea somewhere and I've considered this idea too while implementing the PoC code. It's definitely worth trying. Maybe we can write a patch for this as a separate patch? It will change index AM and could improve also the current bulk-deletion. We can consider a better data structure on top of this idea. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Very nice results. I have been working on the same problem but a bit different solution - a mix of binary search for (sub)pages and 32-bit bitmaps for tid-in-page. Even with currebnt allocation heuristics (allocate 291 tids per page) it initially allocate much less space, instead of current 291*6=1746 bytes per page it needs to allocate 80 bytes. Also it can be laid out so that it is friendly to parallel SIMD searches doing up to 8 tid lookups in parallel. That said, for allocating the tid array, the best solution is to postpone it as much as possible and to do the initial collection into a file, which 1) postpones the memory allocation to the beginning of index cleanups 2) lets you select the correct size and structure as you know more about the distribution at that time 3) do the first heap pass in one go and then advance frozenxmin *before* index cleanup Also, collecting dead tids into a file makes it trivial (well, almost :) ) to parallelize the initial heap scan, so more resources can be thrown at it if available. Cheers ----- Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested. On Thu, Jul 8, 2021 at 10:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Jul 8, 2021 at 5:24 AM Peter Geoghegan <pg@bowt.ie> wrote: > > > > On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Currently, the TIDs of dead tuples are stored in an array that is > > > collectively allocated at the start of lazy vacuum and TID lookup uses > > > bsearch(). There are the following challenges and limitations: > > > > > > 1. Don't allocate more than 1GB. There was a discussion to eliminate > > > this limitation by using MemoryContextAllocHuge() but there were > > > concerns about point 2[1]. > > > > I think that the main problem with the 1GB limitation is that it is > > surprising -- it can cause disruption when we first exceed the magical > > limit of ~174 million TIDs. This can cause us to dirty index pages a > > second time when we might have been able to just do it once with > > sufficient memory for TIDs. OTOH there are actually cases where having > > less memory for TIDs makes performance *better* because of locality > > effects. This perverse behavior with memory sizing isn't a rare case > > that we can safely ignore -- unfortunately it's fairly common. > > > > My point is that we should be careful to choose the correct goal. > > Obviously memory use matters. But it might be more helpful to think of > > memory use as just a proxy for what truly matters, not a goal in > > itself. It's hard to know what this means (what is the "real goal"?), > > and hard to measure it even if you know for sure. It could still be > > useful to think of it like this. > > As I wrote in the first email, I think there are two important factors > in index vacuuming performance: the performance to check if heap TID > that an index tuple points to is dead, and the number of times to > perform index bulk-deletion. The flame graph I attached in the first > mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a > disk-intensive operation in practice. Given that most index AM's > bulk-deletion does a full index scan and a table could have multiple > indexes, reducing the number of times to perform index bulk-deletion > really contributes to reducing the execution time, especially for > large tables. I think that a more compact data structure for dead > tuple TIDs is one of the ways to achieve that. > > > > > > A run container is selected in this test case, using 4 bytes for each block. > > > > > > Execution Time Memory Usage > > > array 8,883.03 600,008,248 > > > intset 7,358.23 100,671,488 > > > tbm 758.81 100,671,544 > > > rtbm 764.33 29,384,816 > > > > > > Overall, 'rtbm' has a much better lookup performance and good memory > > > usage especially if there are relatively many dead tuples. However, in > > > some cases, 'intset' and 'array' have a better memory usage. > > > > This seems very promising. > > > > I wonder how much you have thought about the index AM side. It makes > > sense to initially evaluate these techniques using this approach of > > separating the data structure from how it is used by VACUUM -- I think > > that that was a good idea. But at the same time there may be certain > > important theoretical questions that cannot be answered this way -- > > questions about how everything "fits together" in a real VACUUM might > > matter a lot. You've probably thought about this at least a little > > already. Curious to hear how you think it "fits together" with the > > work that you've done already. > > Yeah, that definitely needs to be considered. Currently, what we need > for the dead tuple storage for lazy vacuum are store, lookup, and > iteration. And given the parallel vacuum, it has to be able to be > allocated on DSM or DSA. While implementing the PoC code, I'm trying > to integrate it with the current lazy vacuum code. As far as I've seen > so far, the integration is not hard, at least with the *current* lazy > vacuum code and index AMs code. > > > > > The loop inside btvacuumpage() makes each loop iteration call the > > callback -- this is always a call to lazy_tid_reaped() in practice. > > And that's where we do binary searches. These binary searches are > > usually where we see a huge number of cycles spent when we look at > > profiles, including the profile that produced your flame graph. But I > > worry that that might be a bit misleading -- the way that profilers > > attribute costs is very complicated and can never be fully trusted. > > While it is true that lazy_tid_reaped() often accesses main memory, > > which will of course add a huge amount of latency and make it a huge > > bottleneck, the "big picture" is still relevant. > > > > I think that the compiler currently has to make very conservative > > assumptions when generating the machine code used by the loop inside > > btvacuumpage(), which calls through an opaque function pointer at > > least once per loop iteration -- anything can alias, so the compiler > > must be conservative. The data dependencies are hard for both the > > compiler and the CPU to analyze. The cost of using a function pointer > > compared to a direct function call is usually quite low, but there are > > important exceptions -- cases where it prevents other useful > > optimizations. Maybe this is an exception. > > > > I wonder how much it would help to break up that loop into two loops. > > Make the callback into a batch operation that generates state that > > describes what to do with each and every index tuple on the leaf page. > > The first loop would build a list of TIDs, then you'd call into > > vacuumlazy.c and get it to process the TIDs, and finally the second > > loop would physically delete the TIDs that need to be deleted. This > > would mean that there would be only one call per leaf page per > > btbulkdelete(). This would reduce the number of calls to the callback > > by at least 100x, and maybe more than 1000x. > > > > This approach would make btbulkdelete() similar to > > _bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really > > an independent idea to your ideas -- I imagine that this would work > > far better when combined with a more compact data structure, which is > > naturally more capable of batch processing than a simple array of > > TIDs. Maybe this will help the compiler and the CPU to fully > > understand the *natural* data dependencies, so that they can be as > > effective as possible in making the code run fast. It's possible that > > a modern CPU will be able to *hide* the latency more intelligently > > than what we have today. The latency is such a big problem that we may > > be able to justify "wasting" other CPU resources, just because it > > sometimes helps with hiding the latency. For example, it might > > actually be okay to sort all of the TIDs on the page to make the bulk > > processing work -- though you might still do a precheck that is > > similar to the precheck inside lazy_tid_reaped() that was added by you > > in commit bbaf315309e. > > Interesting idea. I remember you mentioned this idea somewhere and > I've considered this idea too while implementing the PoC code. It's > definitely worth trying. Maybe we can write a patch for this as a > separate patch? It will change index AM and could improve also the > current bulk-deletion. We can consider a better data structure on top > of this idea. > > Regards, > > -- > Masahiko Sawada > EDB: https://www.enterprisedb.com/ > >
Resending as forgot to send to the list (thanks Peter :) ) On Wed, Jul 7, 2021 at 10:24 PM Peter Geoghegan <pg@bowt.ie> wrote: > > The loop inside btvacuumpage() makes each loop iteration call the > callback -- this is always a call to lazy_tid_reaped() in practice. > And that's where we do binary searches. These binary searches are > usually where we see a huge number of cycles spent when we look at > profiles, including the profile that produced your flame graph. But I > worry that that might be a bit misleading -- the way that profilers > attribute costs is very complicated and can never be fully trusted. > While it is true that lazy_tid_reaped() often accesses main memory, > which will of course add a huge amount of latency and make it a huge > bottleneck, the "big picture" is still relevant. This is why I have mainly focused on making it possible to use SIMD and run 4-8 binary searches in parallel, mostly 8, for AVX2. How I am approaching this is separating "page search" tyo run over a (naturally) sorted array of 32 bit page pointers and only when the page is found the indexes in this array are used to look up the in-page bitmaps. This allows the heavier bsearch activity to run on smaller range of memory, hopefully reducing the cache trashing. There are opportunities to optimise this further for cash hits, buy collecting the tids from indexes in larger patches and then constraining the searches in the main is-deleted-bitmap to run over sections of it, but at some point this becomes a very complex balancing act, as the manipulation of the bits-to-check from indexes also takes time, not to mention the need to release the index pages and then later chase the tid pointers in case they have moved while checking them. I have not measured anything yet, but one of my concerns in case of very large dead tuple collections searched by 8-way parallel bsearch could actually get close to saturating RAM bandwidth by reading (8 x 32bits x cache-line-size) bytes from main memory every few cycles, so we may need some inner-loop level throttling similar to current vacuum_cost_limit for data pages. > I think that the compiler currently has to make very conservative > assumptions when generating the machine code used by the loop inside > btvacuumpage(), which calls through an opaque function pointer at > least once per loop iteration -- anything can alias, so the compiler > must be conservative. Definitely this! The lookup function needs to be turned into an inline function or #define as well to give the compiler maximum freedoms. > The data dependencies are hard for both the > compiler and the CPU to analyze. The cost of using a function pointer > compared to a direct function call is usually quite low, but there are > important exceptions -- cases where it prevents other useful > optimizations. Maybe this is an exception. Yes. Also this could be a place where unrolling the loop could make a real difference. Maybe not unrolling the full 32 loops for 32 bit bserach, but something like 8-loop unroll for getting most of the benefit. The 32x unroll would not be really that bad for performance if all 32 loops were needed, but mostly we would need to jump into last 10 to 20 loops for lookup min 1000 to 1000000 pages and I suspect this is such a weird corner case that compiler is really unlikely to have this optimisation supported. Of course I may be wrong and ith is a common enough case for the optimiser. > > I wonder how much it would help to break up that loop into two loops. > Make the callback into a batch operation that generates state that > describes what to do with each and every index tuple on the leaf page. > The first loop would build a list of TIDs, then you'd call into > vacuumlazy.c and get it to process the TIDs, and finally the second > loop would physically delete the TIDs that need to be deleted. This > would mean that there would be only one call per leaf page per > btbulkdelete(). This would reduce the number of calls to the callback > by at least 100x, and maybe more than 1000x. While it may make sense to have different bitmap encodings for different distributions, it likely would not be good for optimisations if all these are used at the same time. This is why I propose the first bitmap collecting phase to collect into a file and then - when reading into memory for lookups phase - possibly rewrite the initial structure to something else if it sees that it is more efficient. Like for example where the first half of the file consists of only empty pages. > This approach would make btbulkdelete() similar to > _bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really > an independent idea to your ideas -- I imagine that this would work > far better when combined with a more compact data structure, which is > naturally more capable of batch processing than a simple array of > TIDs. Maybe this will help the compiler and the CPU to fully > understand the *natural* data dependencies, so that they can be as > effective as possible in making the code run fast. It's possible that > a modern CPU will be able to *hide* the latency more intelligently > than what we have today. The latency is such a big problem that we may > be able to justify "wasting" other CPU resources, just because it > sometimes helps with hiding the latency. For example, it might > actually be okay to sort all of the TIDs on the page to make the bulk > processing work Then again it may be so much extra work that it starts to dominate some parts of profiles. For example see the work that was done in improving the mini-vacuum part where it was actually faster to copy data out to a separate buffer and then back in than shuffle it around inside the same 8k page :) So only testing will tell. > -- though you might still do a precheck that is > similar to the precheck inside lazy_tid_reaped() that was added by you > in commit bbaf315309e. > > Of course it's very easy to be wrong about stuff like this. But it > might not be that hard to prototype. You can literally copy and paste > code from _bt_delitems_delete_check() to do this. It does the same > basic thing already. Also a lot of testing would be needed to figure out which strategy fits best for which distribution of dead tuples, and possibly their relation to the order of tuples to check from indexes . Cheers -- Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested.
On Thu, Jul 8, 2021 at 1:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > As I wrote in the first email, I think there are two important factors > in index vacuuming performance: the performance to check if heap TID > that an index tuple points to is dead, and the number of times to > perform index bulk-deletion. The flame graph I attached in the first > mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a > disk-intensive operation in practice. Maybe. But I recently bought an NVME SSD that can read at over 6GB/second. So "disk-intensive" is not what it used to be -- at least not for reads. In general it's not good if we do multiple scans of an index -- no question. But there is a danger in paying a little too much attention to what is true in general -- we should not ignore what might be true in specific cases either. Maybe we can solve some problems by spilling the TID data structure to disk -- if we trade sequential I/O for random I/O, we may be able to do only one pass over the index (especially when we have *almost* enough memory to fit all TIDs, but not quite enough). The big problem with multiple passes over the index is not the extra read bandwidth -- it's the extra page dirtying (writes), especially with things like indexes on UUID columns. We want to dirty each leaf page in each index at most once per VACUUM, and should be willing to pay some cost in order to get a larger benefit with page dirtying. After all, writes are much more expensive on modern flash devices -- if we have to do more random read I/O to spill the TIDs then that might actually be 100% worth it. And, we don't need much memory for something that works well as a negative cache, either -- so maybe the extra random read I/O needed to spill the TIDs will be very limited anyway. There are many possibilities. You can probably think of other trade-offs yourself. We could maybe use a cost model for all this -- it is a little like a hash join IMV. This is just something to think about while refining the design. > Interesting idea. I remember you mentioned this idea somewhere and > I've considered this idea too while implementing the PoC code. It's > definitely worth trying. Maybe we can write a patch for this as a > separate patch? It will change index AM and could improve also the > current bulk-deletion. We can consider a better data structure on top > of this idea. I'm happy to write it as a separate patch, either by leaving it to you or by collaborating directly. It's not necessary to tie it to the first patch. But at the same time it is highly related to what you're already doing. As I said I am totally prepared to be wrong here. But it seems worth it to try. In Postgres 14, the _bt_delitems_vacuum() function (which actually carries out VACUUM's physical page modifications to a leaf page) is almost identical to _bt_delitems_delete(). And _bt_delitems_delete() was already built with these kinds of problems in mind -- it batches work to get the most out of synchronizing with distant state describing which tuples to delete. It's not exactly the same situation, but it's *kinda* similar. More importantly, it's a relatively cheap and easy experiment to run, since we already have most of what we need (we can take it from _bt_delitems_delete_check()). Usually this kind of micro optimization is not very valuable -- 99.9%+ of all code just isn't that sensitive to having the right optimizations. But this is one of the rare important cases where we really should look at the raw machine code, and do some kind of microarchitectural level analysis through careful profiling, using tools like perf. The laws of physics (or electronic engineering) make it inevitable that searching for TIDs to match is going to be kind of slow. But we should at least make sure that we use every trick available to us to reduce the bottleneck, since it really does matter a lot to users. Users should be able to expect that this code will at least be as fast as the hardware that they paid for can allow (or close to it). There is a great deal of microarchitectural sophistication with modern CPUs, much of which is designed to make problems like this one less bad [1]. [1] https://www.agner.org/optimize/microarchitecture.pdf -- Peter Geoghegan
On Thu, Jul 8, 2021 at 1:53 PM Hannu Krosing <hannuk@google.com> wrote: > How I am approaching this is separating "page search" tyo run over a > (naturally) sorted array of 32 bit page pointers and only when the > page is found the indexes in this array are used to look up the > in-page bitmaps. > This allows the heavier bsearch activity to run on smaller range of > memory, hopefully reducing the cache trashing. I think that the really important thing is to figure out roughly the right data structure first. > There are opportunities to optimise this further for cash hits, buy > collecting the tids from indexes in larger patches and then > constraining the searches in the main is-deleted-bitmap to run over > sections of it, but at some point this becomes a very complex > balancing act, as the manipulation of the bits-to-check from indexes > also takes time, not to mention the need to release the index pages > and then later chase the tid pointers in case they have moved while > checking them. I would say that 200 TIDs per leaf page is common and ~1350 TIDs per leaf page is not uncommon (with deduplication). Seems like that might be enough? > I have not measured anything yet, but one of my concerns in case of > very large dead tuple collections searched by 8-way parallel bsearch > could actually get close to saturating RAM bandwidth by reading (8 x > 32bits x cache-line-size) bytes from main memory every few cycles, so > we may need some inner-loop level throttling similar to current > vacuum_cost_limit for data pages. If it happens then it'll be a nice problem to have, I suppose. > Maybe not unrolling the full 32 loops for 32 bit bserach, but > something like 8-loop unroll for getting most of the benefit. My current assumption is that we're bound by memory speed right now, and that that is the big bottleneck to eliminate -- we must keep the CPU busy with data to process first. That seems like the most promising thing to focus on right now. > While it may make sense to have different bitmap encodings for > different distributions, it likely would not be good for optimisations > if all these are used at the same time. To some degree designs like Roaring bitmaps are just that -- a way of dynamically figuring out which strategy to use based on data characteristics. > This is why I propose the first bitmap collecting phase to collect > into a file and then - when reading into memory for lookups phase - > possibly rewrite the initial structure to something else if it sees > that it is more efficient. Like for example where the first half of > the file consists of only empty pages. Yeah, I agree that something like that could make sense. Although rewriting it doesn't seem particularly promising, since we can easily make it cheap to process any TID that falls into a range of blocks that have no dead tuples. We don't need to rewrite the data structure to make it do that well, AFAICT. When I said that I thought of this a little like a hash join, I was being more serious than you might imagine. Note that the number of index tuples that VACUUM will delete from each index can now be far less than the total number of TIDs stored in memory. So even when we have (say) 20% of all of the TIDs from the table in our in memory list managed by vacuumlazy.c, it's now quite possible that VACUUM will only actually "match"/"join" (i.e. delete) as few as 2% of the index tuples it finds in the index (there really is no way to predict how many). The opportunistic deletion stuff could easily be doing most of the required cleanup in an eager fashion following recent improvements -- VACUUM need only take care of "floating garbage" these days. In other words, thinking about this as something that is a little bit like a hash join makes sense because hash joins do very well with high join selectivity, and high join selectivity is common in the real world. The intersection of TIDs from each leaf page with the in-memory TID delete structure will often be very small indeed. > Then again it may be so much extra work that it starts to dominate > some parts of profiles. > > For example see the work that was done in improving the mini-vacuum > part where it was actually faster to copy data out to a separate > buffer and then back in than shuffle it around inside the same 8k page Some of what I'm saying is based on the experience of improving similar code used by index tuple deletion in Postgres 14. That did quite a lot of sorting of TIDs and things like that. In the end the sorting had no more than a negligible impact on performance. What really mattered was that we efficiently coordinate with distant heap pages that describe which index tuples we can delete from a given leaf page. Sorting hundreds of TIDs is cheap. Reading hundreds of random locations in memory (or even far fewer) is not so cheap. It might even be very slow indeed. Sorting in order to batch could end up looking like cheap insurance that we should be glad to pay for. > So only testing will tell. True. -- Peter Geoghegan
On Fri, Jul 9, 2021 at 12:34 AM Peter Geoghegan <pg@bowt.ie> wrote: > ... > > I would say that 200 TIDs per leaf page is common and ~1350 TIDs per > leaf page is not uncommon (with deduplication). Seems like that might > be enough? Likely yes, and also it would have the nice property of not changing the index page locking behaviour. Are deduplicated tids in the leaf page already sorted in heap order ? This could potentially simplify / speed up the sort. > > I have not measured anything yet, but one of my concerns in case of > > very large dead tuple collections searched by 8-way parallel bsearch > > could actually get close to saturating RAM bandwidth by reading (8 x > > 32bits x cache-line-size) bytes from main memory every few cycles, so > > we may need some inner-loop level throttling similar to current > > vacuum_cost_limit for data pages. > > If it happens then it'll be a nice problem to have, I suppose. > > > Maybe not unrolling the full 32 loops for 32 bit bserach, but > > something like 8-loop unroll for getting most of the benefit. > > My current assumption is that we're bound by memory speed right now, Most likely yes, and this should be also easy to check with manually unrolling perhaps 4 loops and measuring any speed increase. > and that that is the big bottleneck to eliminate -- we must keep the > CPU busy with data to process first. That seems like the most > promising thing to focus on right now. This has actually two parts - trying to make sure that we can make as much as possible from cache - if we need to get out of cache then try to parallelise this as much as possible at the same time we need to watch that we are not making the index tuple preparation work so heavy that it starts to dominate over memory access > > While it may make sense to have different bitmap encodings for > > different distributions, it likely would not be good for optimisations > > if all these are used at the same time. > > To some degree designs like Roaring bitmaps are just that -- a way of > dynamically figuring out which strategy to use based on data > characteristics. it is, but as I am keeping one eye open for vectorisation, I don't like when different parts of the same bitmap have radically different encoding strategies. > > This is why I propose the first bitmap collecting phase to collect > > into a file and then - when reading into memory for lookups phase - > > possibly rewrite the initial structure to something else if it sees > > that it is more efficient. Like for example where the first half of > > the file consists of only empty pages. > > Yeah, I agree that something like that could make sense. Although > rewriting it doesn't seem particularly promising, yeah, I hope to prove (or verify :) ) the structure is good enough so that it does not need the rewrite. > since we can easily > make it cheap to process any TID that falls into a range of blocks > that have no dead tuples. I actually meant the opposite case, where we could replace a full 80 bytes 291-bit "all dead" bitmap with just a range - int4 for page and two int2-s for min and max tid-in page for extra 10x reduction, on top of original 21x reduction from current 6 bytes / bit encoding to my page_bsearch_vector bitmaps which encodes one page to maximum of 80 bytes (5 x int4 sub-page pointers + 5 x int4 bitmaps). I also started out by investigating RoaringBitmaps, but when I realized that we will likely have to rewrite it anyway I continued working on getting to a single uniform encoding which fits most use cases Good Enough and then use that uniformity to enable the compiler to do its optimisation and hopefully also vectoriziation magic. > We don't need to rewrite the data structure > to make it do that well, AFAICT. > > When I said that I thought of this a little like a hash join, I was > being more serious than you might imagine. Note that the number of > index tuples that VACUUM will delete from each index can now be far > less than the total number of TIDs stored in memory. So even when we > have (say) 20% of all of the TIDs from the table in our in memory list > managed by vacuumlazy.c, it's now quite possible that VACUUM will only > actually "match"/"join" (i.e. delete) as few as 2% of the index tuples > it finds in the index (there really is no way to predict how many). > The opportunistic deletion stuff could easily be doing most of the > required cleanup in an eager fashion following recent improvements -- > VACUUM need only take care of "floating garbage" these days. Ok, this points to the need to mainly optimise for quite sparse population of dead tuples, which is still mainly clustered page-wise ? > In other > words, thinking about this as something that is a little bit like a > hash join makes sense because hash joins do very well with high join > selectivity, and high join selectivity is common in the real world. > The intersection of TIDs from each leaf page with the in-memory TID > delete structure will often be very small indeed. The hard to optimize case is still when we have dead tuple counts in hundreds of millions, or even billions, like on a HTAP database after a few hours of OLAP query have accumulated loads of dead tuples in tables getting heavy OLTP traffic. There of course we could do a totally different optimisation, where we also allow reaping tuples newer than the OLAP queries snapshot if we can prove that when the snapshot moves forward next time, it has to jump over said transactions making them indeed DEAD and not RECENTLY DEAD. Currently we let a single OLAP query ruin everything :) > > Then again it may be so much extra work that it starts to dominate > > some parts of profiles. > > > > For example see the work that was done in improving the mini-vacuum > > part where it was actually faster to copy data out to a separate > > buffer and then back in than shuffle it around inside the same 8k page > > Some of what I'm saying is based on the experience of improving > similar code used by index tuple deletion in Postgres 14. That did > quite a lot of sorting of TIDs and things like that. In the end the > sorting had no more than a negligible impact on performance. Good to know :) > What > really mattered was that we efficiently coordinate with distant heap > pages that describe which index tuples we can delete from a given leaf > page. Sorting hundreds of TIDs is cheap. Reading hundreds of random > locations in memory (or even far fewer) is not so cheap. It might even > be very slow indeed. Sorting in order to batch could end up looking > like cheap insurance that we should be glad to pay for. If the most expensive operation is sorting a few hundred of tids, then this should be fast enough. My worries were more that after the sorting we can not to dsimple index lookups for them, but each needs to be found via bseach or maybe even just search if that is faster under some size limit, and that these could add up. Or some other needed thing that also has to be done, like allocating extra memory or moving other data around in a way that CPU does not like. Cheers ----- Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested.
Hi, On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote: > 1. Don't allocate more than 1GB. There was a discussion to eliminate > this limitation by using MemoryContextAllocHuge() but there were > concerns about point 2[1]. > > 2. Allocate the whole memory space at once. > > 3. Slow lookup performance (O(logN)). > > I’ve done some experiments in this area and would like to share the > results and discuss ideas. Yea, this is a serious issue. 3) could possibly be addressed to a decent degree without changing the fundamental datastructure too much. There's some sizable and trivial wins by just changing vac_cmp_itemptr() to compare int64s and by using an open coded bsearch(). The big problem with bsearch isn't imo the O(log(n)) complexity - it's that it has an abominally bad cache locality. And that can be addressed https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf Imo 2) isn't really that a hard problem to improve, even if we were to stay with the current bsearch approach. Reallocation with an aggressive growth factor or such isn't that bad. That's not to say we ought to stay with binary search... > Problems Solutions > =============== > > Firstly, I've considered using existing data structures: > IntegerSet(src/backend/lib/integerset.c) and > TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but > only either point 2 or 3. IntegerSet uses lower memory thanks to > simple-8b encoding but is slow at lookup, still O(logN), since it’s a > tree structure. On the other hand, TIDBitmap has a good lookup > performance, O(1), but could unnecessarily use larger memory in some > cases since it always allocates the space for bitmap enough to store > all possible offsets. With 8kB blocks, the maximum number of line > pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the > bitmap is 40 bytes long and we always need 46 bytes in total per block > including other meta information. Imo tidbitmap isn't particularly good, even in the current use cases - it's constraining in what we can store (a problem for other AMs), not actually that dense, the lossy mode doesn't choose what information to loose well etc. It'd be nice if we came up with a datastructure that could also replace the bitmap scan cases. > The data structure is somewhat similar to TIDBitmap. It consists of > the hash table and the container area; the hash table has entries per > block and each block entry allocates its memory space, called a > container, in the container area to store its offset numbers. The > container area is actually an array of bytes and can be enlarged as > needed. In the container area, the data representation of offset > numbers varies depending on their cardinality. It has three container > types: array, bitmap, and run. Not a huge fan of encoding this much knowledge about the tid layout... > For example, if there are two dead tuples at offset 1 and 150, it uses > the array container that has an array of two 2-byte integers > representing 1 and 150, using 4 bytes in total. If we used the bitmap > container in this case, we would need 20 bytes instead. On the other > hand, if there are consecutive 20 dead tuples from offset 1 to 20, it > uses the run container that has an array of 2-byte integers. The first > value in each pair represents a starting offset number, whereas the > second value represents its length. Therefore, in this case, the run > container uses only 4 bytes in total. Finally, if there are dead > tuples at every other offset from 1 to 100, it uses the bitmap > container that has an uncompressed bitmap, using 13 bytes. We need > another 16 bytes per block entry for hash table entry. > > The lookup complexity of a bitmap container is O(1) whereas the one of > an array and a run container is O(N) or O(logN) but the number of > elements in those two containers should not be large it would not be a > problem. Hm. Why is O(N) not an issue? Consider e.g. the case of a table in which many tuples have been deleted. In cases where the "run" storage is cheaper (e.g. because there's high offset numbers due to HOT pruning), we could end up regularly scanning a few hundred entries for a match. That's not cheap anymore. > Evaluation > ======== > > Before implementing this idea and integrating it with lazy vacuum > code, I've implemented a benchmark tool dedicated to evaluating > lazy_tid_reaped() performance[4]. Good idea! > In all test cases, I simulated that the table has 1,000,000 blocks and > every block has at least one dead tuple. That doesn't strike me as a particularly common scenario? I think it's quite rare for there to be so evenly but sparse dead tuples. In particularly it's very common for there to be long runs of dead tuples separated by long ranges of no dead tuples at all... > The benchmark scenario is that for > each virtual heap tuple we check if there is its TID in the dead > tuple storage. Here are the results of execution time in milliseconds > and memory usage in bytes: In which order are the dead tuples checked? Looks like in sequential order? In the case of an index over a column that's not correlated with the heap order the lookups are often much more random - which can influence lookup performance drastically, due to cache differences in cache locality. Which will make some structures look worse/better than others. Greetings, Andres Freund
Hi, On 2021-07-08 20:53:32 -0700, Andres Freund wrote: > On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote: > > 1. Don't allocate more than 1GB. There was a discussion to eliminate > > this limitation by using MemoryContextAllocHuge() but there were > > concerns about point 2[1]. > > > > 2. Allocate the whole memory space at once. > > > > 3. Slow lookup performance (O(logN)). > > > > I’ve done some experiments in this area and would like to share the > > results and discuss ideas. > > Yea, this is a serious issue. > > > 3) could possibly be addressed to a decent degree without changing the > fundamental datastructure too much. There's some sizable and trivial > wins by just changing vac_cmp_itemptr() to compare int64s and by using > an open coded bsearch(). Just using itemptr_encode() makes array in test #1 go from 8s to 6.5s on my machine. Another thing I just noticed is that you didn't include the build times for the datastructures. They are lower than the lookups currently, but it does seem like a relevant thing to measure as well. E.g. for #1 I see the following build times array 24.943 ms tbm 206.456 ms intset 93.575 ms vtbm 134.315 ms rtbm 145.964 ms that's a significant range... Randomizing the lookup order (using a random shuffle in generate_index_tuples()) changes the benchmark results for #1 significantly: shuffled time unshuffled time array 6551.726 ms 6478.554 ms intset 67590.879 ms 10815.810 ms rtbm 17992.487 ms 2518.492 ms tbm 364.917 ms 360.128 ms vtbm 12227.884 ms 1288.123 ms FWIW, I get an assertion failure when using an assertion build: #2 0x0000561800ea02e0 in ExceptionalCondition (conditionName=0x7f9115a88e91 "found", errorType=0x7f9115a88d11 "FailedAssertion", fileName=0x7f9115a88e8a "rtbm.c", lineNumber=242) at /home/andres/src/postgresql/src/backend/utils/error/assert.c:69 #3 0x00007f9115a87645 in rtbm_add_tuples (rtbm=0x561806293280, blkno=0, offnums=0x7fffdccabb00, nitems=10) at rtbm.c:242 #4 0x00007f9115a8363d in load_rtbm (rtbm=0x561806293280, itemptrs=0x7f908a203050, nitems=10000000) at bdbench.c:618 #5 0x00007f9115a834b9 in rtbm_attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143,maxoff=32639) at bdbench.c:587 #6 0x00007f9115a83837 in attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143,maxoff=32639) at bdbench.c:658 #7 0x00007f9115a84190 in attach_dead_tuples (fcinfo=0x56180322d690) at bdbench.c:873 I assume you just inverted the Assert(found) assertion? Greetings, Andres Freund
On Fri, Jul 9, 2021 at 12:53 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > > On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote: > > 1. Don't allocate more than 1GB. There was a discussion to eliminate > > this limitation by using MemoryContextAllocHuge() but there were > > concerns about point 2[1]. > > > > 2. Allocate the whole memory space at once. > > > > 3. Slow lookup performance (O(logN)). > > > > I’ve done some experiments in this area and would like to share the > > results and discuss ideas. > > Yea, this is a serious issue. > > > 3) could possibly be addressed to a decent degree without changing the > fundamental datastructure too much. There's some sizable and trivial > wins by just changing vac_cmp_itemptr() to compare int64s and by using > an open coded bsearch(). > > The big problem with bsearch isn't imo the O(log(n)) complexity - it's > that it has an abominally bad cache locality. And that can be addressed > https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf > > Imo 2) isn't really that a hard problem to improve, even if we were to > stay with the current bsearch approach. Reallocation with an aggressive > growth factor or such isn't that bad. > > > That's not to say we ought to stay with binary search... > > > > > Problems Solutions > > =============== > > > > Firstly, I've considered using existing data structures: > > IntegerSet(src/backend/lib/integerset.c) and > > TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but > > only either point 2 or 3. IntegerSet uses lower memory thanks to > > simple-8b encoding but is slow at lookup, still O(logN), since it’s a > > tree structure. On the other hand, TIDBitmap has a good lookup > > performance, O(1), but could unnecessarily use larger memory in some > > cases since it always allocates the space for bitmap enough to store > > all possible offsets. With 8kB blocks, the maximum number of line > > pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the > > bitmap is 40 bytes long and we always need 46 bytes in total per block > > including other meta information. > > Imo tidbitmap isn't particularly good, even in the current use cases - > it's constraining in what we can store (a problem for other AMs), not > actually that dense, the lossy mode doesn't choose what information to > loose well etc. > > It'd be nice if we came up with a datastructure that could also replace > the bitmap scan cases. Agreed. > > > > The data structure is somewhat similar to TIDBitmap. It consists of > > the hash table and the container area; the hash table has entries per > > block and each block entry allocates its memory space, called a > > container, in the container area to store its offset numbers. The > > container area is actually an array of bytes and can be enlarged as > > needed. In the container area, the data representation of offset > > numbers varies depending on their cardinality. It has three container > > types: array, bitmap, and run. > > Not a huge fan of encoding this much knowledge about the tid layout... > > > > For example, if there are two dead tuples at offset 1 and 150, it uses > > the array container that has an array of two 2-byte integers > > representing 1 and 150, using 4 bytes in total. If we used the bitmap > > container in this case, we would need 20 bytes instead. On the other > > hand, if there are consecutive 20 dead tuples from offset 1 to 20, it > > uses the run container that has an array of 2-byte integers. The first > > value in each pair represents a starting offset number, whereas the > > second value represents its length. Therefore, in this case, the run > > container uses only 4 bytes in total. Finally, if there are dead > > tuples at every other offset from 1 to 100, it uses the bitmap > > container that has an uncompressed bitmap, using 13 bytes. We need > > another 16 bytes per block entry for hash table entry. > > > > The lookup complexity of a bitmap container is O(1) whereas the one of > > an array and a run container is O(N) or O(logN) but the number of > > elements in those two containers should not be large it would not be a > > problem. > > Hm. Why is O(N) not an issue? Consider e.g. the case of a table in which > many tuples have been deleted. In cases where the "run" storage is > cheaper (e.g. because there's high offset numbers due to HOT pruning), > we could end up regularly scanning a few hundred entries for a > match. That's not cheap anymore. With 8kB blocks, the maximum size of a bitmap container is 37 bytes. IOW, other two types of containers are always smaller than 37 bytes. Since the run container uses 4 bytes per run, the number of runs in a run container never be more than 9. Even with 32kB blocks, we don’t have more than 37 runs. So I think N is small enough in this case. > > > > Evaluation > > ======== > > > > Before implementing this idea and integrating it with lazy vacuum > > code, I've implemented a benchmark tool dedicated to evaluating > > lazy_tid_reaped() performance[4]. > > Good idea! > > > > In all test cases, I simulated that the table has 1,000,000 blocks and > > every block has at least one dead tuple. > > That doesn't strike me as a particularly common scenario? I think it's > quite rare for there to be so evenly but sparse dead tuples. In > particularly it's very common for there to be long runs of dead tuples > separated by long ranges of no dead tuples at all... Agreed. I'll test with such scenarios. > > > > The benchmark scenario is that for > > each virtual heap tuple we check if there is its TID in the dead > > tuple storage. Here are the results of execution time in milliseconds > > and memory usage in bytes: > > In which order are the dead tuples checked? Looks like in sequential > order? In the case of an index over a column that's not correlated with > the heap order the lookups are often much more random - which can > influence lookup performance drastically, due to cache differences in > cache locality. Which will make some structures look worse/better than > others. Good point. It's sequential order, which is not good. I'll test again after shuffling virtual index tuples. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Fri, Jul 9, 2021 at 2:37 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-07-08 20:53:32 -0700, Andres Freund wrote: > > On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote: > > > 1. Don't allocate more than 1GB. There was a discussion to eliminate > > > this limitation by using MemoryContextAllocHuge() but there were > > > concerns about point 2[1]. > > > > > > 2. Allocate the whole memory space at once. > > > > > > 3. Slow lookup performance (O(logN)). > > > > > > I’ve done some experiments in this area and would like to share the > > > results and discuss ideas. > > > > Yea, this is a serious issue. > > > > > > 3) could possibly be addressed to a decent degree without changing the > > fundamental datastructure too much. There's some sizable and trivial > > wins by just changing vac_cmp_itemptr() to compare int64s and by using > > an open coded bsearch(). > > Just using itemptr_encode() makes array in test #1 go from 8s to 6.5s on my > machine. > > Another thing I just noticed is that you didn't include the build times for the > datastructures. They are lower than the lookups currently, but it does seem > like a relevant thing to measure as well. E.g. for #1 I see the following build > times > > array 24.943 ms > tbm 206.456 ms > intset 93.575 ms > vtbm 134.315 ms > rtbm 145.964 ms > > that's a significant range... Good point. I got similar results when measuring on my machine: array 57.987 ms tbm 297.720 ms intset 113.796 ms vtbm 165.268 ms rtbm 199.658 ms > > Randomizing the lookup order (using a random shuffle in > generate_index_tuples()) changes the benchmark results for #1 significantly: > > shuffled time unshuffled time > array 6551.726 ms 6478.554 ms > intset 67590.879 ms 10815.810 ms > rtbm 17992.487 ms 2518.492 ms > tbm 364.917 ms 360.128 ms > vtbm 12227.884 ms 1288.123 ms I believe that in your test, tbm_reaped() actually always returned true. That could explain tbm was very fast in both cases. Since TIDBitmap in the core doesn't support the existence check tbm_reaped() in bdbench.c always returns true. I added a patch in the repository to add existence check support to TIDBitmap, although it assumes bitmap never be lossy. That being said, I'm surprised that rtbm is slower than array even in the unshuffled case. I've also measured the shuffle cases and got different results. To be clear, I used prepare() SQL function to prepare both virtual dead tuples and index tuples, load them by attach_dead_tuples() SQL function, and executed bench() SQL function for each data structure. Here are the results: shuffled time unshuffled time array 88899.513 ms 12616.521 ms intset 73476.055 ms 10063.405 ms rtbm 22264.671 ms 2073.171 ms tbm 10285.092 ms 1417.312 ms vtbm 14488.581 ms 1240.666 ms > > FWIW, I get an assertion failure when using an assertion build: > > #2 0x0000561800ea02e0 in ExceptionalCondition (conditionName=0x7f9115a88e91 "found", errorType=0x7f9115a88d11 "FailedAssertion", > fileName=0x7f9115a88e8a "rtbm.c", lineNumber=242) at /home/andres/src/postgresql/src/backend/utils/error/assert.c:69 > #3 0x00007f9115a87645 in rtbm_add_tuples (rtbm=0x561806293280, blkno=0, offnums=0x7fffdccabb00, nitems=10) at rtbm.c:242 > #4 0x00007f9115a8363d in load_rtbm (rtbm=0x561806293280, itemptrs=0x7f908a203050, nitems=10000000) at bdbench.c:618 > #5 0x00007f9115a834b9 in rtbm_attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143,maxoff=32639) > at bdbench.c:587 > #6 0x00007f9115a83837 in attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143,maxoff=32639) > at bdbench.c:658 > #7 0x00007f9115a84190 in attach_dead_tuples (fcinfo=0x56180322d690) at bdbench.c:873 > > I assume you just inverted the Assert(found) assertion? Right. Fixed it. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Thu, Jul 8, 2021 at 7:51 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Jul 7, 2021 at 1:24 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I wonder how much it would help to break up that loop into two loops. > > Make the callback into a batch operation that generates state that > > describes what to do with each and every index tuple on the leaf page. > > The first loop would build a list of TIDs, then you'd call into > > vacuumlazy.c and get it to process the TIDs, and finally the second > > loop would physically delete the TIDs that need to be deleted. This > > would mean that there would be only one call per leaf page per > > btbulkdelete(). This would reduce the number of calls to the callback > > by at least 100x, and maybe more than 1000x. > > Maybe for something like rtbm.c (which is inspired by Roaring > bitmaps), you would really want to use an "intersection" operation for > this. The TIDs that we need to physically delete from the leaf page > inside btvacuumpage() are the intersection of two bitmaps: our bitmap > of all TIDs on the leaf page, and our bitmap of all TIDs that need to > be deleting by the ongoing btbulkdelete() call. Agreed. In such a batch operation, what we need to do here is to compute the intersection of two bitmaps. > > Obviously the typical case is that most TIDs in the index do *not* get > deleted -- needing to delete more than ~20% of all TIDs in the index > will be rare. Ideally it would be very cheap to figure out that a TID > does not need to be deleted at all. Something a little like a negative > cache (but not a true negative cache). This is a little bit like how > hash joins can be made faster by adding a Bloom filter -- most hash > probes don't need to join a tuple in the real world, and we can make > these hash probes even faster by using a Bloom filter as a negative > cache. Agreed. > > If you had the list of TIDs from a leaf page sorted for batch > processing, and if you had roaring bitmap style "chunks" with > "container" metadata stored in the data structure, you could then use > merging/intersection -- that has some of the same advantages. I think > that this would be a lot more efficient than having one binary search > per TID. Most TIDs from the leaf page can be skipped over very > quickly, in large groups. It's very rare for VACUUM to need to delete > TIDs from completely random heap table blocks in the real world (some > kind of pattern is much more common). > > When this merging process finds 1 TID that might really be deletable > then it's probably going to find much more than 1 -- better to make > that cache miss take care of all of the TIDs together. Also seems like > the CPU could do some clever prefetching with this approach -- it > could prefetch TIDs where the initial chunk metadata is insufficient > to eliminate them early -- these are the groups of TIDs that will have > many TIDs that we actually need to delete. ISTM that improving > temporal locality through batching could matter a lot here. That's a promising approach. In rtbm, the pair of one hash entry and one container is used per block. Therefore, we can skip TID from the leaf page by checking the hash table, if there is no dead tuple in the block. If there is the hash entry, since it means the block has at least one dead tuple, we can look for the offset of TID from the leaf page from the container. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Thu, Jul 8, 2021 at 10:40 PM Hannu Krosing <hannuk@google.com> wrote: > > Very nice results. > > I have been working on the same problem but a bit different solution - > a mix of binary search for (sub)pages and 32-bit bitmaps for > tid-in-page. > > Even with currebnt allocation heuristics (allocate 291 tids per page) > it initially allocate much less space, instead of current 291*6=1746 > bytes per page it needs to allocate 80 bytes. > > Also it can be laid out so that it is friendly to parallel SIMD > searches doing up to 8 tid lookups in parallel. Interesting. > > That said, for allocating the tid array, the best solution is to > postpone it as much as possible and to do the initial collection into > a file, which > > 1) postpones the memory allocation to the beginning of index cleanups > > 2) lets you select the correct size and structure as you know more > about the distribution at that time > > 3) do the first heap pass in one go and then advance frozenxmin > *before* index cleanup I think we have to do index vacuuming before heap vacuuming (2nd heap pass). So do you mean that it advances relfrozenxid of pg_class before both index vacuuming and heap vacuuming? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Hi, On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote: > Currently, the TIDs of dead tuples are stored in an array that is > collectively allocated at the start of lazy vacuum and TID lookup uses > bsearch(). There are the following challenges and limitations: > So I prototyped a new data structure dedicated to storing dead tuples > during lazy vacuum while borrowing the idea from Roaring Bitmap[2]. > The authors provide an implementation of Roaring Bitmap[3] (Apache > 2.0 license). But I've implemented this idea from scratch because we > need to integrate it with Dynamic Shared Memory/Area to support > parallel vacuum and need to support ItemPointerData, 6-bytes integer > in total, whereas the implementation supports only 4-bytes integers. > Also, when it comes to vacuum, we neither need to compute the > intersection, the union, nor the difference between sets, but need > only an existence check. > > The data structure is somewhat similar to TIDBitmap. It consists of > the hash table and the container area; the hash table has entries per > block and each block entry allocates its memory space, called a > container, in the container area to store its offset numbers. The > container area is actually an array of bytes and can be enlarged as > needed. In the container area, the data representation of offset > numbers varies depending on their cardinality. It has three container > types: array, bitmap, and run. How are you thinking of implementing iteration efficiently for rtbm? The second heap pass needs that obviously... I think the only option would be to qsort the whole thing? Greetings, Andres Freund
Hi, On 2021-07-09 10:17:49 -0700, Andres Freund wrote: > On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote: > > Currently, the TIDs of dead tuples are stored in an array that is > > collectively allocated at the start of lazy vacuum and TID lookup uses > > bsearch(). There are the following challenges and limitations: > > > So I prototyped a new data structure dedicated to storing dead tuples > > during lazy vacuum while borrowing the idea from Roaring Bitmap[2]. > > The authors provide an implementation of Roaring Bitmap[3] (Apache > > 2.0 license). But I've implemented this idea from scratch because we > > need to integrate it with Dynamic Shared Memory/Area to support > > parallel vacuum and need to support ItemPointerData, 6-bytes integer > > in total, whereas the implementation supports only 4-bytes integers. > > Also, when it comes to vacuum, we neither need to compute the > > intersection, the union, nor the difference between sets, but need > > only an existence check. > > > > The data structure is somewhat similar to TIDBitmap. It consists of > > the hash table and the container area; the hash table has entries per > > block and each block entry allocates its memory space, called a > > container, in the container area to store its offset numbers. The > > container area is actually an array of bytes and can be enlarged as > > needed. In the container area, the data representation of offset > > numbers varies depending on their cardinality. It has three container > > types: array, bitmap, and run. > > How are you thinking of implementing iteration efficiently for rtbm? The > second heap pass needs that obviously... I think the only option would > be to qsort the whole thing? I experimented further, trying to use an old radix tree implementation I had lying around to store dead tuples. With a bit of trickery that seems to work well. The radix tree implementation I have basically maps an int64 to another int64. Each level of the radix tree stores 6 bits of the key, and uses those 6 bits to index a 1<<64 long array leading to the next level. My first idea was to use itemptr_encode() to convert tids into an int64 and store the lower 6 bits in the value part of the radix tree. That turned out to work well performance wise, but awfully memory usage wise. The problem is that we at most use 9 bits for offsets, but reserve 16 bits for it in the ItemPointerData. Which means that there's often a lot of empty "tree levels" for those 0 bits, making it hard to get to a decent memory usage. The simplest way to address that was to simply compress out those guaranteed-to-be-zero bits. That results in memory usage that's quite good - nearly always beating array, occasionally beating rtbm. It's an ordered datastructure, so the latter isn't too surprising. For lookup performance the radix approach is commonly among the best, if not the best. A variation of the storage approach is to just use the block number as the index, and store the tids as the value. Even with the absolutely naive approach of just using a Bitmapset that reduces memory usage substantially - at a small cost to search performance. Of course it'd be better to use an adaptive approach like you did for rtbm, I just thought this is good enough. This largely works well, except when there are a large number of evenly spread out dead tuples. I don't think that's a particularly common situation, but it's worth considering anyway. The reason the memory usage can be larger for sparse workloads obviously can lead to tree nodes with only one child. As they are quite large (1<<6 pointers to further children) that then can lead to large increase in memory usage. I have toyed with implementing adaptively large radix nodes like proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't gotten it quite working. Greetings, Andres Freund
On Sat, Jul 10, 2021 at 2:17 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote: > > Currently, the TIDs of dead tuples are stored in an array that is > > collectively allocated at the start of lazy vacuum and TID lookup uses > > bsearch(). There are the following challenges and limitations: > > > So I prototyped a new data structure dedicated to storing dead tuples > > during lazy vacuum while borrowing the idea from Roaring Bitmap[2]. > > The authors provide an implementation of Roaring Bitmap[3] (Apache > > 2.0 license). But I've implemented this idea from scratch because we > > need to integrate it with Dynamic Shared Memory/Area to support > > parallel vacuum and need to support ItemPointerData, 6-bytes integer > > in total, whereas the implementation supports only 4-bytes integers. > > Also, when it comes to vacuum, we neither need to compute the > > intersection, the union, nor the difference between sets, but need > > only an existence check. > > > > The data structure is somewhat similar to TIDBitmap. It consists of > > the hash table and the container area; the hash table has entries per > > block and each block entry allocates its memory space, called a > > container, in the container area to store its offset numbers. The > > container area is actually an array of bytes and can be enlarged as > > needed. In the container area, the data representation of offset > > numbers varies depending on their cardinality. It has three container > > types: array, bitmap, and run. > > How are you thinking of implementing iteration efficiently for rtbm? The > second heap pass needs that obviously... I think the only option would > be to qsort the whole thing? Yes, I'm thinking that the iteration of rtbm is somewhat similar to tbm. That is, we iterate and collect hash table entries and do qsort hash entries by the block number. Then fetch the entry along with its container one by one in order of the block number. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Sorry for the late reply. On Sat, Jul 10, 2021 at 11:55 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-07-09 10:17:49 -0700, Andres Freund wrote: > > On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote: > > > Currently, the TIDs of dead tuples are stored in an array that is > > > collectively allocated at the start of lazy vacuum and TID lookup uses > > > bsearch(). There are the following challenges and limitations: > > > > > So I prototyped a new data structure dedicated to storing dead tuples > > > during lazy vacuum while borrowing the idea from Roaring Bitmap[2]. > > > The authors provide an implementation of Roaring Bitmap[3] (Apache > > > 2.0 license). But I've implemented this idea from scratch because we > > > need to integrate it with Dynamic Shared Memory/Area to support > > > parallel vacuum and need to support ItemPointerData, 6-bytes integer > > > in total, whereas the implementation supports only 4-bytes integers. > > > Also, when it comes to vacuum, we neither need to compute the > > > intersection, the union, nor the difference between sets, but need > > > only an existence check. > > > > > > The data structure is somewhat similar to TIDBitmap. It consists of > > > the hash table and the container area; the hash table has entries per > > > block and each block entry allocates its memory space, called a > > > container, in the container area to store its offset numbers. The > > > container area is actually an array of bytes and can be enlarged as > > > needed. In the container area, the data representation of offset > > > numbers varies depending on their cardinality. It has three container > > > types: array, bitmap, and run. > > > > How are you thinking of implementing iteration efficiently for rtbm? The > > second heap pass needs that obviously... I think the only option would > > be to qsort the whole thing? > > I experimented further, trying to use an old radix tree implementation I > had lying around to store dead tuples. With a bit of trickery that seems > to work well. Thank you for experimenting with another approach. > > The radix tree implementation I have basically maps an int64 to another > int64. Each level of the radix tree stores 6 bits of the key, and uses > those 6 bits to index a 1<<64 long array leading to the next level. > > My first idea was to use itemptr_encode() to convert tids into an int64 > and store the lower 6 bits in the value part of the radix tree. That > turned out to work well performance wise, but awfully memory usage > wise. The problem is that we at most use 9 bits for offsets, but reserve > 16 bits for it in the ItemPointerData. Which means that there's often a > lot of empty "tree levels" for those 0 bits, making it hard to get to a > decent memory usage. > > The simplest way to address that was to simply compress out those > guaranteed-to-be-zero bits. That results in memory usage that's quite > good - nearly always beating array, occasionally beating rtbm. It's an > ordered datastructure, so the latter isn't too surprising. For lookup > performance the radix approach is commonly among the best, if not the > best. How were its both lookup performance and memory usage comparing to intset? I guess the performance trends of those two approaches are similar since both consists of a tree. Intset encodes uint64 by simple-8B encoding so I'm interested also in the comparison in terms of memory usage. > > A variation of the storage approach is to just use the block number as > the index, and store the tids as the value. Even with the absolutely > naive approach of just using a Bitmapset that reduces memory usage > substantially - at a small cost to search performance. Of course it'd be > better to use an adaptive approach like you did for rtbm, I just thought > this is good enough. > > > This largely works well, except when there are a large number of evenly > spread out dead tuples. I don't think that's a particularly common > situation, but it's worth considering anyway. > > The reason the memory usage can be larger for sparse workloads obviously > can lead to tree nodes with only one child. As they are quite large > (1<<6 pointers to further children) that then can lead to large increase > in memory usage. Interesting. How big was it in such workloads comparing to other data structures? I personally like adaptive approaches especially in the context of vacuum improvements. We know common patterns of dead tuple distribution but it’s not necessarily true since it depends on data distribution and timings of autovacuum etc even with the same workload. And we might be able to provide a new approach that works well in 95% of use cases but if things get worse than before in another 5% I think the approach is not a good approach. Ideally, it should be better in common cases and at least be the same as before in other cases. BTW is the implementation of the radix tree approach available somewhere? If so I'd like to experiment with that too. > > I have toyed with implementing adaptively large radix nodes like > proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't > gotten it quite working. That seems promising approach. Regards, [1] https://www.postgresql.org/message-id/CA%2BTgmoakKFXwUv1Cx2mspUuPQHzYF74BfJ8koF5YdgVLCvhpwA%40mail.gmail.com -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Hi, On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote: > BTW is the implementation of the radix tree approach available > somewhere? If so I'd like to experiment with that too. > > > > > I have toyed with implementing adaptively large radix nodes like > > proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't > > gotten it quite working. > > That seems promising approach. I've since implemented some, but not all of the ideas of that paper (adaptive node sizes, but not the tree compression pieces). E.g. for select prepare( 1000000, -- max block 20, -- # of dead tuples per page 10, -- dead tuples interval within a page 1 -- page inteval ); attach size shuffled ordered array 69 ms 120 MB 84.87 s 8.66 s intset 173 ms 65 MB 68.82 s 11.75 s rtbm 201 ms 67 MB 11.54 s 1.35 s tbm 232 ms 100 MB 8.33 s 1.26 s vtbm 162 ms 58 MB 10.01 s 1.22 s radix 88 ms 42 MB 11.49 s 1.67 s and for select prepare( 1000000, -- max block 10, -- # of dead tuples per page 1, -- dead tuples interval within a page 1 -- page inteval ); attach size shuffled ordered array 24 ms 60MB 3.74s 1.02 s intset 97 ms 49MB 3.14s 0.75 s rtbm 138 ms 36MB 0.41s 0.14 s tbm 198 ms 101MB 0.41s 0.14 s vtbm 118 ms 27MB 0.39s 0.12 s radix 33 ms 10MB 0.28s 0.10 s (this is an almost unfairly good case for radix) Running out of time to format the results of the other testcases before I have to run, unfortunately. radix uses 42MB both in test case 3 and 4. The radix tree code isn't good right now. A ridiculous amount of duplication etc. The naming clearly shows its origins from a buffer mapping radix tree... Currently in a bunch of the cases 20% of the time is spent in radix_reaped(). If I move that into radix.c and for bfm_lookup() to be inlined, I get reduced overhead. rbtm for example essentially already does that, because it does splitting of ItemPointer in rtbm.c. I've attached my current patches against your tree. Greetings, Andres Freund
Attachment
Hi, On 2021-07-19 16:49:15 -0700, Andres Freund wrote: > E.g. for > > select prepare( > 1000000, -- max block > 20, -- # of dead tuples per page > 10, -- dead tuples interval within a page > 1 -- page inteval > ); > attach size shuffled ordered > array 69 ms 120 MB 84.87 s 8.66 s > intset 173 ms 65 MB 68.82 s 11.75 s > rtbm 201 ms 67 MB 11.54 s 1.35 s > tbm 232 ms 100 MB 8.33 s 1.26 s > vtbm 162 ms 58 MB 10.01 s 1.22 s > radix 88 ms 42 MB 11.49 s 1.67 s > > and for > select prepare( > 1000000, -- max block > 10, -- # of dead tuples per page > 1, -- dead tuples interval within a page > 1 -- page inteval > ); > > attach size shuffled ordered > array 24 ms 60MB 3.74s 1.02 s > intset 97 ms 49MB 3.14s 0.75 s > rtbm 138 ms 36MB 0.41s 0.14 s > tbm 198 ms 101MB 0.41s 0.14 s > vtbm 118 ms 27MB 0.39s 0.12 s > radix 33 ms 10MB 0.28s 0.10 s Oh, I forgot: The performance numbers are with the fixes in https://www.postgresql.org/message-id/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de applied. Greetings, Andres Freund
Hi, I've dreamed to write more compact structure for vacuum for three years, but life didn't give me a time to. Let me join to friendly competition. I've bet on HATM approach: popcount-ing bitmaps for non-empty elements. Novelties: - 32 consecutive pages are stored together in a single sparse array (called "chunks"). Chunk contains: - its number, - 4 byte bitmap of non-empty pages, - array of non-empty page headers 2 byte each. Page header contains offset of page's bitmap in bitmaps container. (Except if there is just one dead tuple in a page. Then it is written into header itself). - container of concatenated bitmaps. Ie, page metadata overhead varies from 2.4byte (32pages in single chunk) to 18byte (1 page in single chunk) per page. - If page's bitmap is sparse ie contains a lot of "all-zero" bytes, it is compressed by removing zero byte and indexing with two-level bitmap index. Two-level index - zero bytes in first level are removed using second level. It is mostly done for 32kb pages, but let it stay since it is almost free. - If page's bitmaps contains a lot of "all-one" bytes, it is inverted and then encoded as sparse. - Chunks are allocated with custom "allocator" that has no per-allocation overhead. It is possible because there is no need to perform "free": allocator is freed as whole at once. - Array of pointers to chunks is also bitmap indexed. It saves cpu time when not every 32 consecutive pages has at least one dead tuple. But consumes time otherwise. Therefore additional optimization is added to quick skip lookup for first non-empty run of chunks. (Ahhh, I believe this explanation is awful). Andres Freund wrote 2021-07-20 02:49: > Hi, > > On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote: >> BTW is the implementation of the radix tree approach available >> somewhere? If so I'd like to experiment with that too. >> >> > >> > I have toyed with implementing adaptively large radix nodes like >> > proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't >> > gotten it quite working. >> >> That seems promising approach. > > I've since implemented some, but not all of the ideas of that paper > (adaptive node sizes, but not the tree compression pieces). > > E.g. for > > select prepare( > 1000000, -- max block > 20, -- # of dead tuples per page > 10, -- dead tuples interval within a page > 1 -- page inteval > ); > attach size shuffled ordered > array 69 ms 120 MB 84.87 s 8.66 s > intset 173 ms 65 MB 68.82 s 11.75 s > rtbm 201 ms 67 MB 11.54 s 1.35 s > tbm 232 ms 100 MB 8.33 s 1.26 s > vtbm 162 ms 58 MB 10.01 s 1.22 s > radix 88 ms 42 MB 11.49 s 1.67 s > > and for > select prepare( > 1000000, -- max block > 10, -- # of dead tuples per page > 1, -- dead tuples interval within a page > 1 -- page inteval > ); > > attach size shuffled ordered > array 24 ms 60MB 3.74s 1.02 s > intset 97 ms 49MB 3.14s 0.75 s > rtbm 138 ms 36MB 0.41s 0.14 s > tbm 198 ms 101MB 0.41s 0.14 s > vtbm 118 ms 27MB 0.39s 0.12 s > radix 33 ms 10MB 0.28s 0.10 s > > (this is an almost unfairly good case for radix) > > Running out of time to format the results of the other testcases before > I have to run, unfortunately. radix uses 42MB both in test case 3 and > 4. My results (Ubuntu 20.04 Intel Core i7-1165G7): Test1. select prepare(1000000, 10, 20, 1); -- original attach size shuffled array 29ms 60MB 93.99s intset 93ms 49MB 80.94s rtbm 171ms 67MB 14.05s tbm 238ms 100MB 8.36s vtbm 148ms 59MB 9.12s radix 100ms 42MB 11.81s svtm 75ms 29MB 8.90s select prepare(1000000, 20, 10, 1); -- Andres's variant attach size shuffled array 61ms 120MB 111.91s intset 163ms 66MB 85.00s rtbm 236ms 67MB 10.72s tbm 290ms 100MB 8.40s vtbm 190ms 59MB 9.28s radix 117ms 42MB 12.00s svtm 98ms 29MB 8.77s Test2. select prepare(1000000, 10, 1, 1); attach size shuffled array 31ms 60MB 4.68s intset 97ms 49MB 4.03s rtbm 163ms 36MB 0.42s tbm 240ms 100MB 0.42s vtbm 136ms 27MB 0.36s radix 60ms 10MB 0.72s svtm 39ms 6MB 0.19s (Bad radix result probably due to smaller cache in notebook's CPU ?) Test3 select prepare(1000000, 2, 100, 1); attach size shuffled array 6ms 12MB 53.42s intset 23ms 16MB 54.99s rtbm 115ms 38MB 8.19s tbm 186ms 100MB 8.37s vtbm 105ms 59MB 9.08s radix 64ms 42MB 10.41s svtm 73ms 10MB 7.49s Test4 select prepare(1000000, 100, 1, 1); attach size shuffled array 304ms 600MB 75.12s intset 775ms 98MB 47.49s rtbm 356ms 38MB 4.11s tbm 539ms 100MB 4.20s vtbm 493ms 42MB 4.44s radix 263ms 42MB 6.05s svtm 360ms 8MB 3.49s Therefore Specialized Vaccum Tid Map always consumes least memory amount and usually faster. (I've applied Andres's patch for slab allocator before testing) Attached patch is against 6753911a444e12e4b55 commit of your pgtools with applied Andres's patches for radix method. I've also pushed it to github: https://github.com/funny-falcon/pgtools/tree/svtm/bdbench regards, Yura Sokolov
Attachment
On Mon, Jul 26, 2021 at 1:07 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > > Hi, > > I've dreamed to write more compact structure for vacuum for three > years, but life didn't give me a time to. > > Let me join to friendly competition. > > I've bet on HATM approach: popcount-ing bitmaps for non-empty elements. Thank you for proposing the new idea! > > Novelties: > - 32 consecutive pages are stored together in a single sparse array > (called "chunks"). > Chunk contains: > - its number, > - 4 byte bitmap of non-empty pages, > - array of non-empty page headers 2 byte each. > Page header contains offset of page's bitmap in bitmaps container. > (Except if there is just one dead tuple in a page. Then it is > written into header itself). > - container of concatenated bitmaps. > > Ie, page metadata overhead varies from 2.4byte (32pages in single > chunk) > to 18byte (1 page in single chunk) per page. > > - If page's bitmap is sparse ie contains a lot of "all-zero" bytes, > it is compressed by removing zero byte and indexing with two-level > bitmap index. > Two-level index - zero bytes in first level are removed using > second level. It is mostly done for 32kb pages, but let it stay since > it is almost free. > > - If page's bitmaps contains a lot of "all-one" bytes, it is inverted > and then encoded as sparse. > > - Chunks are allocated with custom "allocator" that has no > per-allocation overhead. It is possible because there is no need > to perform "free": allocator is freed as whole at once. > > - Array of pointers to chunks is also bitmap indexed. It saves cpu time > when not every 32 consecutive pages has at least one dead tuple. > But consumes time otherwise. Therefore additional optimization is > added > to quick skip lookup for first non-empty run of chunks. > (Ahhh, I believe this explanation is awful). It sounds better than my proposal. > > Andres Freund wrote 2021-07-20 02:49: > > Hi, > > > > On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote: > >> BTW is the implementation of the radix tree approach available > >> somewhere? If so I'd like to experiment with that too. > >> > >> > > >> > I have toyed with implementing adaptively large radix nodes like > >> > proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't > >> > gotten it quite working. > >> > >> That seems promising approach. > > > > I've since implemented some, but not all of the ideas of that paper > > (adaptive node sizes, but not the tree compression pieces). > > > > E.g. for > > > > select prepare( > > 1000000, -- max block > > 20, -- # of dead tuples per page > > 10, -- dead tuples interval within a page > > 1 -- page inteval > > ); > > attach size shuffled ordered > > array 69 ms 120 MB 84.87 s 8.66 s > > intset 173 ms 65 MB 68.82 s 11.75 s > > rtbm 201 ms 67 MB 11.54 s 1.35 s > > tbm 232 ms 100 MB 8.33 s 1.26 s > > vtbm 162 ms 58 MB 10.01 s 1.22 s > > radix 88 ms 42 MB 11.49 s 1.67 s > > > > and for > > select prepare( > > 1000000, -- max block > > 10, -- # of dead tuples per page > > 1, -- dead tuples interval within a page > > 1 -- page inteval > > ); > > > > attach size shuffled ordered > > array 24 ms 60MB 3.74s 1.02 s > > intset 97 ms 49MB 3.14s 0.75 s > > rtbm 138 ms 36MB 0.41s 0.14 s > > tbm 198 ms 101MB 0.41s 0.14 s > > vtbm 118 ms 27MB 0.39s 0.12 s > > radix 33 ms 10MB 0.28s 0.10 s > > > > (this is an almost unfairly good case for radix) > > > > Running out of time to format the results of the other testcases before > > I have to run, unfortunately. radix uses 42MB both in test case 3 and > > 4. > > My results (Ubuntu 20.04 Intel Core i7-1165G7): > > Test1. > > select prepare(1000000, 10, 20, 1); -- original > > attach size shuffled > array 29ms 60MB 93.99s > intset 93ms 49MB 80.94s > rtbm 171ms 67MB 14.05s > tbm 238ms 100MB 8.36s > vtbm 148ms 59MB 9.12s > radix 100ms 42MB 11.81s > svtm 75ms 29MB 8.90s > > select prepare(1000000, 20, 10, 1); -- Andres's variant > > attach size shuffled > array 61ms 120MB 111.91s > intset 163ms 66MB 85.00s > rtbm 236ms 67MB 10.72s > tbm 290ms 100MB 8.40s > vtbm 190ms 59MB 9.28s > radix 117ms 42MB 12.00s > svtm 98ms 29MB 8.77s > > Test2. > > select prepare(1000000, 10, 1, 1); > > attach size shuffled > array 31ms 60MB 4.68s > intset 97ms 49MB 4.03s > rtbm 163ms 36MB 0.42s > tbm 240ms 100MB 0.42s > vtbm 136ms 27MB 0.36s > radix 60ms 10MB 0.72s > svtm 39ms 6MB 0.19s > > (Bad radix result probably due to smaller cache in notebook's CPU ?) > > Test3 > > select prepare(1000000, 2, 100, 1); > > attach size shuffled > array 6ms 12MB 53.42s > intset 23ms 16MB 54.99s > rtbm 115ms 38MB 8.19s > tbm 186ms 100MB 8.37s > vtbm 105ms 59MB 9.08s > radix 64ms 42MB 10.41s > svtm 73ms 10MB 7.49s > > Test4 > > select prepare(1000000, 100, 1, 1); > > attach size shuffled > array 304ms 600MB 75.12s > intset 775ms 98MB 47.49s > rtbm 356ms 38MB 4.11s > tbm 539ms 100MB 4.20s > vtbm 493ms 42MB 4.44s > radix 263ms 42MB 6.05s > svtm 360ms 8MB 3.49s > > Therefore Specialized Vaccum Tid Map always consumes least memory amount > and usually faster. I'll experiment with the proposed ideas including this idea in more scenarios and share the results tomorrow. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Jul 26, 2021 at 11:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I'll experiment with the proposed ideas including this idea in more > scenarios and share the results tomorrow. > I've done some benchmarks for proposed data structures. In this trial, I've done with the scenario where dead tuples are concentrated on a particular range of table blocks (test 5-8), in addition to the scenarios I've done in the previous trial. Also, I've done benchmarks of each scenario while increasing table size. In the first test, the maximum block number of the table is 1,000,000 (i.g., 8GB table) and in the second test, it's 10,000,000 (80GB table). We can see how performance and memory consumption changes with a large-scale table. Here are the results: * Test 1 select prepare( 1000000, -- max block 10, -- # of dead tuples per page 1, -- dead tuples interval within a page 1, -- # of consecutive pages having dead tuples 20 -- page interval ); name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10 --------+-----------+--------+----------+------------+-----------+------------- array | 57.23 MB | 0.040 | 98.613 | 572.21 MB | 0.387 | 1521.981 intset | 46.88 MB | 0.114 | 75.944 | 468.67 MB | 0.961 | 997.760 radix | 40.26 MB | 0.102 | 18.427 | 336.64 MB | 0.797 | 266.146 rtbm | 64.02 MB | 0.234 | 22.443 | 512.02 MB | 2.230 | 275.143 svtm | 27.28 MB | 0.060 | 13.568 | 274.07 MB | 0.476 | 211.073 tbm | 96.01 MB | 0.273 | 10.347 | 768.01 MB | 2.882 | 128.103 * Test 2 select prepare( 1000000, -- max block 10, -- # of dead tuples per page 1, -- dead tuples interval within a page 1, -- # of consecutive pages having dead tuples 1 -- page interval ); name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10 --------+-----------+--------+----------+------------+-----------+------------- array | 57.23 MB | 0.041 | 4.757 | 572.21 MB | 0.344 | 71.228 intset | 46.88 MB | 0.127 | 3.762 | 468.67 MB | 1.093 | 49.573 radix | 9.95 MB | 0.048 | 0.679 | 82.57 MB | 0.371 | 16.211 rtbm | 34.02 MB | 0.179 | 0.534 | 288.02 MB | 2.092 | 8.693 svtm | 5.78 MB | 0.043 | 0.239 | 54.60 MB | 0.342 | 7.759 tbm | 96.01 MB | 0.274 | 0.521 | 768.01 MB | 2.685 | 6.360 * Test 3 select prepare( 1000000, -- max block 2, -- # of dead tuples per page 100, -- dead tuples interval within a page 1, -- # of consecutive pages having dead tuples 1 -- page interval ); name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10 --------+-----------+--------+----------+------------+-----------+------------- array | 11.45 MB | 0.009 | 57.698 | 114.45 MB | 0.076 | 1045.639 intset | 15.63 MB | 0.031 | 46.083 | 156.23 MB | 0.243 | 848.525 radix | 40.26 MB | 0.063 | 13.755 | 336.64 MB | 0.501 | 223.413 rtbm | 36.02 MB | 0.123 | 11.527 | 320.02 MB | 1.843 | 180.977 svtm | 9.28 MB | 0.053 | 9.631 | 92.59 MB | 0.438 | 212.626 tbm | 96.01 MB | 0.228 | 10.381 | 768.01 MB | 2.258 | 126.630 * Test 4 select prepare( 1000000, -- max block 100, -- # of dead tuples per page 1, -- dead tuples interval within a page 1, -- # of consecutive pages having dead tuples 1 -- page interval ); name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10 --------+-----------+--------+----------+------------+-----------+------------- array | 572.21 MB | 0.367 | 78.047 | 5722.05 MB | 3.942 | 1154.776 intset | 93.74 MB | 0.777 | 45.146 | 937.34 MB | 7.716 | 643.708 radix | 40.26 MB | 0.203 | 9.015 | 336.64 MB | 1.775 | 133.294 rtbm | 36.02 MB | 0.369 | 5.639 | 320.02 MB | 3.823 | 88.832 svtm | 7.28 MB | 0.294 | 3.891 | 73.60 MB | 2.690 | 103.744 tbm | 96.01 MB | 0.534 | 5.223 | 768.01 MB | 5.679 | 60.632 * Test 5 select prepare( 1000000, -- max block 150, -- # of dead tuples per page 1, -- dead tuples interval within a page 10000, -- # of consecutive pages having dead tuples 20000 -- page interval ); There are 10000 consecutive pages that have 150 dead tuples at every 20000 pages. name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10 --------+-----------+--------+----------+------------+-----------+------------- array | 429.16 MB | 0.274 | 75.664 | 4291.54 MB | 3.067 | 1259.501 intset | 46.88 MB | 0.559 | 36.449 | 468.67 MB | 4.565 | 517.445 radix | 20.26 MB | 0.166 | 8.466 | 196.90 MB | 1.273 | 166.587 rtbm | 18.02 MB | 0.242 | 8.491 | 160.02 MB | 2.407 | 171.725 svtm | 3.66 MB | 0.243 | 3.635 | 37.10 MB | 2.022 | 86.165 tbm | 48.01 MB | 0.344 | 9.763 | 384.01 MB | 3.327 | 151.824 * Test 6 select prepare( 1000000, -- max block 10, -- # of dead tuples per page 1, -- dead tuples interval within a page 10000, -- # of consecutive pages having dead tuples 20000 -- page interval ); There are 10000 consecutive pages that have 10 dead tuples at every 20000 pages. name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10 --------+-----------+--------+----------+------------+-----------+------------- array | 28.62 MB | 0.022 | 2.791 | 286.11 MB | 0.170 | 46.920 intset | 23.45 MB | 0.061 | 2.156 | 234.34 MB | 0.501 | 32.577 radix | 5.04 MB | 0.026 | 0.433 | 48.57 MB | 0.191 | 11.060 rtbm | 17.02 MB | 0.074 | 0.533 | 144.02 MB | 0.954 | 11.502 svtm | 3.16 MB | 0.023 | 0.206 | 27.60 MB | 0.175 | 4.886 tbm | 48.01 MB | 0.132 | 0.656 | 384.01 MB | 1.284 | 10.231 * Test 7 select prepare( 1000000, -- max block 150, -- # of dead tuples per page 1, -- dead tuples interval within a page 1000, -- # of consecutive pages having dead tuples 999000 -- page interval ); There are pages that have 150 dead tuples at first 1000 blocks and last 1000 blocks. name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10 --------+-----------+--------+----------+------------+-----------+------------- array | 1.72 MB | 0.002 | 7.507 | 17.17 MB | 0.011 | 76.510 intset | 0.20 MB | 0.003 | 6.742 | 1.89 MB | 0.022 | 52.122 radix | 0.20 MB | 0.001 | 1.023 | 1.07 MB | 0.007 | 12.023 rtbm | 0.15 MB | 0.001 | 2.637 | 0.65 MB | 0.009 | 34.528 svtm | 0.52 MB | 0.002 | 0.721 | 0.61 MB | 0.010 | 6.434 tbm | 0.20 MB | 0.002 | 2.733 | 1.51 MB | 0.015 | 38.538 * Test 8 select prepare( 1000000, -- max block 100, -- # of dead tuples per page 1, -- dead tuples interval within a page 50, -- # of consecutive pages having dead tuples 100 -- page interval ); There are 50 consecutive pages that have 100 dead tuples at every 100 pages. name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10 --------+-----------+--------+----------+------------+-----------+------------- array | 286.11 MB | 0.184 | 67.233 | 2861.03 MB | 1.743 | 979.070 intset | 46.88 MB | 0.389 | 35.176 | 468.67 MB | 3.698 | 505.322 radix | 21.82 MB | 0.116 | 6.160 | 186.86 MB | 0.891 | 117.730 rtbm | 18.02 MB | 0.182 | 5.909 | 160.02 MB | 1.870 | 112.550 svtm | 4.28 MB | 0.152 | 3.213 | 37.60 MB | 1.383 | 79.073 tbm | 48.01 MB | 0.265 | 6.673 | 384.01 MB | 2.586 | 101.327 Overall, 'svtm' is faster and consumes less memory. 'radix' tree also has good performance and memory usage. From these results, svtm is the best data structure among proposed ideas for dead tuple storage used during lazy vacuum in terms of performance and memory usage. I think it can support iteration by extracting the offset of dead tuples for each block while iterating chunks. Apart from performance and memory usage points of view, we also need to consider the reusability of the code. When I started this thread, I thought the best data structure would be the one optimized for vacuum's dead tuple storage. However, if we can use a data structure that can also be used in general, we can use it also for other purposes. Moreover, if it's too optimized for the current TID system (32 bits block number, 16 bits offset number, maximum block/offset number, etc.) it may become a blocker for future changes. In that sense, radix tree also seems good since it can also be used in gist vacuum as a replacement for intset, or a replacement for hash table for shared buffer as discussed before. Are there any other use cases? On the other hand, I’m concerned that radix tree would be an over-engineering in terms of vacuum's dead tuples storage since the dead tuple storage is static data and requires only lookup operation, so if we want to use radix tree as dead tuple storage, I'd like to see further use cases. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Masahiko Sawada писал 2021-07-27 07:06: > On Mon, Jul 26, 2021 at 11:01 PM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: >> >> I'll experiment with the proposed ideas including this idea in more >> scenarios and share the results tomorrow. >> > > I've done some benchmarks for proposed data structures. In this trial, > I've done with the scenario where dead tuples are concentrated on a > particular range of table blocks (test 5-8), in addition to the > scenarios I've done in the previous trial. Also, I've done benchmarks > of each scenario while increasing table size. In the first test, the > maximum block number of the table is 1,000,000 (i.g., 8GB table) and > in the second test, it's 10,000,000 (80GB table). We can see how > performance and memory consumption changes with a large-scale table. > Here are the results: > > * Test 1 > select prepare( > 1000000, -- max block > 10, -- # of dead tuples per page > 1, -- dead tuples interval within a page > 1, -- # of consecutive pages having dead tuples > 20 -- page interval > ); > > name | attach | attach | shuffled | size_x10 | attach_x10| > shuffled_x10 > --------+-----------+--------+----------+------------+-----------+------------- > array | 57.23 MB | 0.040 | 98.613 | 572.21 MB | 0.387 | > 1521.981 > intset | 46.88 MB | 0.114 | 75.944 | 468.67 MB | 0.961 | > 997.760 > radix | 40.26 MB | 0.102 | 18.427 | 336.64 MB | 0.797 | > 266.146 > rtbm | 64.02 MB | 0.234 | 22.443 | 512.02 MB | 2.230 | > 275.143 > svtm | 27.28 MB | 0.060 | 13.568 | 274.07 MB | 0.476 | > 211.073 > tbm | 96.01 MB | 0.273 | 10.347 | 768.01 MB | 2.882 | > 128.103 > > * Test 2 > select prepare( > 1000000, -- max block > 10, -- # of dead tuples per page > 1, -- dead tuples interval within a page > 1, -- # of consecutive pages having dead tuples > 1 -- page interval > ); > > name | attach | attach | shuffled | size_x10 | attach_x10| > shuffled_x10 > --------+-----------+--------+----------+------------+-----------+------------- > array | 57.23 MB | 0.041 | 4.757 | 572.21 MB | 0.344 | > 71.228 > intset | 46.88 MB | 0.127 | 3.762 | 468.67 MB | 1.093 | > 49.573 > radix | 9.95 MB | 0.048 | 0.679 | 82.57 MB | 0.371 | > 16.211 > rtbm | 34.02 MB | 0.179 | 0.534 | 288.02 MB | 2.092 | > 8.693 > svtm | 5.78 MB | 0.043 | 0.239 | 54.60 MB | 0.342 | > 7.759 > tbm | 96.01 MB | 0.274 | 0.521 | 768.01 MB | 2.685 | > 6.360 > > * Test 3 > select prepare( > 1000000, -- max block > 2, -- # of dead tuples per page > 100, -- dead tuples interval within a page > 1, -- # of consecutive pages having dead tuples > 1 -- page interval > ); > > name | attach | attach | shuffled | size_x10 | attach_x10| > shuffled_x10 > --------+-----------+--------+----------+------------+-----------+------------- > array | 11.45 MB | 0.009 | 57.698 | 114.45 MB | 0.076 | > 1045.639 > intset | 15.63 MB | 0.031 | 46.083 | 156.23 MB | 0.243 | > 848.525 > radix | 40.26 MB | 0.063 | 13.755 | 336.64 MB | 0.501 | > 223.413 > rtbm | 36.02 MB | 0.123 | 11.527 | 320.02 MB | 1.843 | > 180.977 > svtm | 9.28 MB | 0.053 | 9.631 | 92.59 MB | 0.438 | > 212.626 > tbm | 96.01 MB | 0.228 | 10.381 | 768.01 MB | 2.258 | > 126.630 > > * Test 4 > select prepare( > 1000000, -- max block > 100, -- # of dead tuples per page > 1, -- dead tuples interval within a page > 1, -- # of consecutive pages having dead tuples > 1 -- page interval > ); > > name | attach | attach | shuffled | size_x10 | attach_x10| > shuffled_x10 > --------+-----------+--------+----------+------------+-----------+------------- > array | 572.21 MB | 0.367 | 78.047 | 5722.05 MB | 3.942 | > 1154.776 > intset | 93.74 MB | 0.777 | 45.146 | 937.34 MB | 7.716 | > 643.708 > radix | 40.26 MB | 0.203 | 9.015 | 336.64 MB | 1.775 | > 133.294 > rtbm | 36.02 MB | 0.369 | 5.639 | 320.02 MB | 3.823 | > 88.832 > svtm | 7.28 MB | 0.294 | 3.891 | 73.60 MB | 2.690 | > 103.744 > tbm | 96.01 MB | 0.534 | 5.223 | 768.01 MB | 5.679 | > 60.632 > > > * Test 5 > select prepare( > 1000000, -- max block > 150, -- # of dead tuples per page > 1, -- dead tuples interval within a page > 10000, -- # of consecutive pages having dead tuples > 20000 -- page interval > ); > > There are 10000 consecutive pages that have 150 dead tuples at every > 20000 pages. > > name | attach | attach | shuffled | size_x10 | attach_x10| > shuffled_x10 > --------+-----------+--------+----------+------------+-----------+------------- > array | 429.16 MB | 0.274 | 75.664 | 4291.54 MB | 3.067 | > 1259.501 > intset | 46.88 MB | 0.559 | 36.449 | 468.67 MB | 4.565 | > 517.445 > radix | 20.26 MB | 0.166 | 8.466 | 196.90 MB | 1.273 | > 166.587 > rtbm | 18.02 MB | 0.242 | 8.491 | 160.02 MB | 2.407 | > 171.725 > svtm | 3.66 MB | 0.243 | 3.635 | 37.10 MB | 2.022 | > 86.165 > tbm | 48.01 MB | 0.344 | 9.763 | 384.01 MB | 3.327 | > 151.824 > > * Test 6 > select prepare( > 1000000, -- max block > 10, -- # of dead tuples per page > 1, -- dead tuples interval within a page > 10000, -- # of consecutive pages having dead tuples > 20000 -- page interval > ); > > There are 10000 consecutive pages that have 10 dead tuples at every > 20000 pages. > > name | attach | attach | shuffled | size_x10 | attach_x10| > shuffled_x10 > --------+-----------+--------+----------+------------+-----------+------------- > array | 28.62 MB | 0.022 | 2.791 | 286.11 MB | 0.170 | > 46.920 > intset | 23.45 MB | 0.061 | 2.156 | 234.34 MB | 0.501 | > 32.577 > radix | 5.04 MB | 0.026 | 0.433 | 48.57 MB | 0.191 | > 11.060 > rtbm | 17.02 MB | 0.074 | 0.533 | 144.02 MB | 0.954 | > 11.502 > svtm | 3.16 MB | 0.023 | 0.206 | 27.60 MB | 0.175 | > 4.886 > tbm | 48.01 MB | 0.132 | 0.656 | 384.01 MB | 1.284 | > 10.231 > > * Test 7 > select prepare( > 1000000, -- max block > 150, -- # of dead tuples per page > 1, -- dead tuples interval within a page > 1000, -- # of consecutive pages having dead tuples > 999000 -- page interval > ); > > There are pages that have 150 dead tuples at first 1000 blocks and > last 1000 blocks. > > name | attach | attach | shuffled | size_x10 | attach_x10| > shuffled_x10 > --------+-----------+--------+----------+------------+-----------+------------- > array | 1.72 MB | 0.002 | 7.507 | 17.17 MB | 0.011 | > 76.510 > intset | 0.20 MB | 0.003 | 6.742 | 1.89 MB | 0.022 | > 52.122 > radix | 0.20 MB | 0.001 | 1.023 | 1.07 MB | 0.007 | > 12.023 > rtbm | 0.15 MB | 0.001 | 2.637 | 0.65 MB | 0.009 | > 34.528 > svtm | 0.52 MB | 0.002 | 0.721 | 0.61 MB | 0.010 | > 6.434 > tbm | 0.20 MB | 0.002 | 2.733 | 1.51 MB | 0.015 | > 38.538 > > * Test 8 > select prepare( > 1000000, -- max block > 100, -- # of dead tuples per page > 1, -- dead tuples interval within a page > 50, -- # of consecutive pages having dead tuples > 100 -- page interval > ); > > There are 50 consecutive pages that have 100 dead tuples at every 100 > pages. > > name | attach | attach | shuffled | size_x10 | attach_x10| > shuffled_x10 > --------+-----------+--------+----------+------------+-----------+------------- > array | 286.11 MB | 0.184 | 67.233 | 2861.03 MB | 1.743 | > 979.070 > intset | 46.88 MB | 0.389 | 35.176 | 468.67 MB | 3.698 | > 505.322 > radix | 21.82 MB | 0.116 | 6.160 | 186.86 MB | 0.891 | > 117.730 > rtbm | 18.02 MB | 0.182 | 5.909 | 160.02 MB | 1.870 | > 112.550 > svtm | 4.28 MB | 0.152 | 3.213 | 37.60 MB | 1.383 | > 79.073 > tbm | 48.01 MB | 0.265 | 6.673 | 384.01 MB | 2.586 | > 101.327 > > Overall, 'svtm' is faster and consumes less memory. 'radix' tree also > has good performance and memory usage. > > From these results, svtm is the best data structure among proposed > ideas for dead tuple storage used during lazy vacuum in terms of > performance and memory usage. I think it can support iteration by > extracting the offset of dead tuples for each block while iterating > chunks. > > Apart from performance and memory usage points of view, we also need > to consider the reusability of the code. When I started this thread, I > thought the best data structure would be the one optimized for > vacuum's dead tuple storage. However, if we can use a data structure > that can also be used in general, we can use it also for other > purposes. Moreover, if it's too optimized for the current TID system > (32 bits block number, 16 bits offset number, maximum block/offset > number, etc.) it may become a blocker for future changes. > > In that sense, radix tree also seems good since it can also be used in > gist vacuum as a replacement for intset, or a replacement for hash > table for shared buffer as discussed before. Are there any other use > cases? On the other hand, I’m concerned that radix tree would be an > over-engineering in terms of vacuum's dead tuples storage since the > dead tuple storage is static data and requires only lookup operation, > so if we want to use radix tree as dead tuple storage, I'd like to see > further use cases. I can evolve svtm to transparent intset replacement certainly. Using same trick from radix_to_key it will store tids efficiently: shift = pg_ceil_log2_32(MaxHeapTuplesPerPage); tid_i = ItemPointerGetOffsetNumber(tid); tid_i |= ItemPointerGetBlockNumber(tid) << shift; Will do today's evening. regards Yura Sokolov aka funny_falcon
Hi, On 2021-07-25 19:07:18 +0300, Yura Sokolov wrote: > I've dreamed to write more compact structure for vacuum for three > years, but life didn't give me a time to. > > Let me join to friendly competition. > > I've bet on HATM approach: popcount-ing bitmaps for non-empty elements. My concern with several of the proposals in this thread is that they over-optimize for this specific case. It's not actually that crucial to have a crazily optimized vacuum dead tid storage datatype. Having something more general that also performs reasonably for the dead tuple storage, but also performs well in a number of other cases, makes a lot more sense to me. > (Bad radix result probably due to smaller cache in notebook's CPU ?) Probably largely due to the node dispatch. a) For some reason gcc likes jump tables too much, I get better numbers when disabling those b) the node type dispatch should be stuffed into the low bits of the pointer. > select prepare(1000000, 2, 100, 1); > > attach size shuffled > array 6ms 12MB 53.42s > intset 23ms 16MB 54.99s > rtbm 115ms 38MB 8.19s > tbm 186ms 100MB 8.37s > vtbm 105ms 59MB 9.08s > radix 64ms 42MB 10.41s > svtm 73ms 10MB 7.49s > Test4 > > select prepare(1000000, 100, 1, 1); > > attach size shuffled > array 304ms 600MB 75.12s > intset 775ms 98MB 47.49s > rtbm 356ms 38MB 4.11s > tbm 539ms 100MB 4.20s > vtbm 493ms 42MB 4.44s > radix 263ms 42MB 6.05s > svtm 360ms 8MB 3.49s > > Therefore Specialized Vaccum Tid Map always consumes least memory amount > and usually faster. Impressive. Greetings, Andres Freund
Hi, On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote: > Apart from performance and memory usage points of view, we also need > to consider the reusability of the code. When I started this thread, I > thought the best data structure would be the one optimized for > vacuum's dead tuple storage. However, if we can use a data structure > that can also be used in general, we can use it also for other > purposes. Moreover, if it's too optimized for the current TID system > (32 bits block number, 16 bits offset number, maximum block/offset > number, etc.) it may become a blocker for future changes. Indeed. > In that sense, radix tree also seems good since it can also be used in > gist vacuum as a replacement for intset, or a replacement for hash > table for shared buffer as discussed before. Are there any other use > cases? Yes, I think there are. Whenever there is some spatial locality it has a decent chance of winning over a hash table, and it will most of the time win over ordered datastructures like rbtrees (which perform very poorly due to the number of branches and pointer dispatches). There's plenty hashtables, e.g. for caches, locks, etc, in PG that have a medium-high degree of locality, so I'd expect a few potential uses. When adding "tree compression" (i.e. skip inner nodes that have a single incoming & outgoing node) radix trees even can deal quite performantly with variable width keys. > On the other hand, I’m concerned that radix tree would be an > over-engineering in terms of vacuum's dead tuples storage since the > dead tuple storage is static data and requires only lookup operation, > so if we want to use radix tree as dead tuple storage, I'd like to see > further use cases. I don't think we should rely on the read-only-ness. It seems pretty clear that we'd want parallel dead-tuple scans at a point not too far into the future? Greetings, Andres Freund
On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote: > > Apart from performance and memory usage points of view, we also need > > to consider the reusability of the code. When I started this thread, I > > thought the best data structure would be the one optimized for > > vacuum's dead tuple storage. However, if we can use a data structure > > that can also be used in general, we can use it also for other > > purposes. Moreover, if it's too optimized for the current TID system > > (32 bits block number, 16 bits offset number, maximum block/offset > > number, etc.) it may become a blocker for future changes. > > Indeed. > > > > In that sense, radix tree also seems good since it can also be used in > > gist vacuum as a replacement for intset, or a replacement for hash > > table for shared buffer as discussed before. Are there any other use > > cases? > > Yes, I think there are. Whenever there is some spatial locality it has a > decent chance of winning over a hash table, and it will most of the time > win over ordered datastructures like rbtrees (which perform very poorly > due to the number of branches and pointer dispatches). There's plenty > hashtables, e.g. for caches, locks, etc, in PG that have a medium-high > degree of locality, so I'd expect a few potential uses. When adding > "tree compression" (i.e. skip inner nodes that have a single incoming & > outgoing node) radix trees even can deal quite performantly with > variable width keys. Good point. > > > On the other hand, I’m concerned that radix tree would be an > > over-engineering in terms of vacuum's dead tuples storage since the > > dead tuple storage is static data and requires only lookup operation, > > so if we want to use radix tree as dead tuple storage, I'd like to see > > further use cases. > > I don't think we should rely on the read-only-ness. It seems pretty > clear that we'd want parallel dead-tuple scans at a point not too far > into the future? Indeed. Given that the radix tree itself has other use cases, I have no concern about using radix tree for vacuum's dead tuples storage. It will be better to have one that can be generally used and has some optimizations that are helpful also for vacuum's use case, rather than having one that is very optimized only for vacuum's use case. During the performance benchmark, I found some bugs in the radix tree implementation. Also, we need the functionality of tree iteration, and if we have the radix tree in the source tree as a general library, we need some changes since the current implementation seems to be for a replacement for shared buffer’s hash table. I'll try to work on those stuff as PoC if you don't. What do you think? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Masahiko Sawada писал 2021-07-29 12:11: > On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de> > wrote: >> >> Hi, >> >> On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote: >> > Apart from performance and memory usage points of view, we also need >> > to consider the reusability of the code. When I started this thread, I >> > thought the best data structure would be the one optimized for >> > vacuum's dead tuple storage. However, if we can use a data structure >> > that can also be used in general, we can use it also for other >> > purposes. Moreover, if it's too optimized for the current TID system >> > (32 bits block number, 16 bits offset number, maximum block/offset >> > number, etc.) it may become a blocker for future changes. >> >> Indeed. >> >> >> > In that sense, radix tree also seems good since it can also be used in >> > gist vacuum as a replacement for intset, or a replacement for hash >> > table for shared buffer as discussed before. Are there any other use >> > cases? >> >> Yes, I think there are. Whenever there is some spatial locality it has >> a >> decent chance of winning over a hash table, and it will most of the >> time >> win over ordered datastructures like rbtrees (which perform very >> poorly >> due to the number of branches and pointer dispatches). There's plenty >> hashtables, e.g. for caches, locks, etc, in PG that have a medium-high >> degree of locality, so I'd expect a few potential uses. When adding >> "tree compression" (i.e. skip inner nodes that have a single incoming >> & >> outgoing node) radix trees even can deal quite performantly with >> variable width keys. > > Good point. > >> >> > On the other hand, I’m concerned that radix tree would be an >> > over-engineering in terms of vacuum's dead tuples storage since the >> > dead tuple storage is static data and requires only lookup operation, >> > so if we want to use radix tree as dead tuple storage, I'd like to see >> > further use cases. >> >> I don't think we should rely on the read-only-ness. It seems pretty >> clear that we'd want parallel dead-tuple scans at a point not too far >> into the future? > > Indeed. Given that the radix tree itself has other use cases, I have > no concern about using radix tree for vacuum's dead tuples storage. It > will be better to have one that can be generally used and has some > optimizations that are helpful also for vacuum's use case, rather than > having one that is very optimized only for vacuum's use case. Main portion of svtm that leads to memory saving is compression of many pages at once (CHUNK). It could be combined with radix as a storage for pointers to CHUNKs. For a moment I'm benchmarking IntegerSet replacement based on Trie (HATM like) and CHUNK compression, therefore datastructure could be used for gist vacuum as well. Since it is generic (allows to index all 64bit) it lacks of trick used to speedup svtm. Still on 10x test it is faster than radix. I'll send result later today after all benchmarks complete. And I'll try then to make mix of radix and CHUNK compression. > During the performance benchmark, I found some bugs in the radix tree > implementation. There is a bug in radix_to_key_off as well: tid_i |= ItemPointerGetBlockNumber(tid) << shift; ItemPointerGetBlockNumber returns uint32, therefore result after shift is uint32 as well. It leads to lesser memory consumption (and therefore better times) on 10x test, when page number exceed 2^23 (8M). It still produce "correct" result for test since every page is filled in the same way. Could you push your fixes for radix, please? regards, Yura Sokolov y.sokolov@postgrespro.ru funny.falcon@gmail.com
On Thu, Jul 29, 2021 at 8:03 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > > Masahiko Sawada писал 2021-07-29 12:11: > > On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de> > > wrote: > >> > >> Hi, > >> > >> On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote: > >> > Apart from performance and memory usage points of view, we also need > >> > to consider the reusability of the code. When I started this thread, I > >> > thought the best data structure would be the one optimized for > >> > vacuum's dead tuple storage. However, if we can use a data structure > >> > that can also be used in general, we can use it also for other > >> > purposes. Moreover, if it's too optimized for the current TID system > >> > (32 bits block number, 16 bits offset number, maximum block/offset > >> > number, etc.) it may become a blocker for future changes. > >> > >> Indeed. > >> > >> > >> > In that sense, radix tree also seems good since it can also be used in > >> > gist vacuum as a replacement for intset, or a replacement for hash > >> > table for shared buffer as discussed before. Are there any other use > >> > cases? > >> > >> Yes, I think there are. Whenever there is some spatial locality it has > >> a > >> decent chance of winning over a hash table, and it will most of the > >> time > >> win over ordered datastructures like rbtrees (which perform very > >> poorly > >> due to the number of branches and pointer dispatches). There's plenty > >> hashtables, e.g. for caches, locks, etc, in PG that have a medium-high > >> degree of locality, so I'd expect a few potential uses. When adding > >> "tree compression" (i.e. skip inner nodes that have a single incoming > >> & > >> outgoing node) radix trees even can deal quite performantly with > >> variable width keys. > > > > Good point. > > > >> > >> > On the other hand, I’m concerned that radix tree would be an > >> > over-engineering in terms of vacuum's dead tuples storage since the > >> > dead tuple storage is static data and requires only lookup operation, > >> > so if we want to use radix tree as dead tuple storage, I'd like to see > >> > further use cases. > >> > >> I don't think we should rely on the read-only-ness. It seems pretty > >> clear that we'd want parallel dead-tuple scans at a point not too far > >> into the future? > > > > Indeed. Given that the radix tree itself has other use cases, I have > > no concern about using radix tree for vacuum's dead tuples storage. It > > will be better to have one that can be generally used and has some > > optimizations that are helpful also for vacuum's use case, rather than > > having one that is very optimized only for vacuum's use case. > > Main portion of svtm that leads to memory saving is compression of many > pages at once (CHUNK). It could be combined with radix as a storage for > pointers to CHUNKs. > > For a moment I'm benchmarking IntegerSet replacement based on Trie (HATM > like) > and CHUNK compression, therefore datastructure could be used for gist > vacuum as well. > > Since it is generic (allows to index all 64bit) it lacks of trick used > to speedup svtm. Still on 10x test it is faster than radix. BTW, how does svtm work when we add two sets of dead tuple TIDs to one svtm? Dead tuple TIDs are unique sets but those sets could have TIDs of the different offsets on the same block. The case I imagine is the idea discussed on this thread[1]. With this idea, we store the collected dead tuple TIDs somewhere and skip index vacuuming for some reason (index skipping optimization, failsafe mode, or interruptions etc.). Then, in the next lazy vacuum timing, we load the dead tuple TIDs and start to scan the heap. During the heap scan in the second lazy vacuum, it's possible that new dead tuples will be found on the pages that we have already stored in svtm during the first lazy vacuum. How can we efficiently update the chunk in the svtm? Regards, [1] https://www.postgresql.org/message-id/CA%2BTgmoZgapzekbTqdBrcH8O8Yifi10_nB7uWLB8ajAhGL21M6A%40mail.gmail.com -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Masahiko Sawada писал 2021-07-29 17:29: > On Thu, Jul 29, 2021 at 8:03 PM Yura Sokolov <y.sokolov@postgrespro.ru> > wrote: >> >> Masahiko Sawada писал 2021-07-29 12:11: >> > On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de> >> > wrote: >> >> >> >> Hi, >> >> >> >> On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote: >> >> > Apart from performance and memory usage points of view, we also need >> >> > to consider the reusability of the code. When I started this thread, I >> >> > thought the best data structure would be the one optimized for >> >> > vacuum's dead tuple storage. However, if we can use a data structure >> >> > that can also be used in general, we can use it also for other >> >> > purposes. Moreover, if it's too optimized for the current TID system >> >> > (32 bits block number, 16 bits offset number, maximum block/offset >> >> > number, etc.) it may become a blocker for future changes. >> >> >> >> Indeed. >> >> >> >> >> >> > In that sense, radix tree also seems good since it can also be used in >> >> > gist vacuum as a replacement for intset, or a replacement for hash >> >> > table for shared buffer as discussed before. Are there any other use >> >> > cases? >> >> >> >> Yes, I think there are. Whenever there is some spatial locality it has >> >> a >> >> decent chance of winning over a hash table, and it will most of the >> >> time >> >> win over ordered datastructures like rbtrees (which perform very >> >> poorly >> >> due to the number of branches and pointer dispatches). There's plenty >> >> hashtables, e.g. for caches, locks, etc, in PG that have a medium-high >> >> degree of locality, so I'd expect a few potential uses. When adding >> >> "tree compression" (i.e. skip inner nodes that have a single incoming >> >> & >> >> outgoing node) radix trees even can deal quite performantly with >> >> variable width keys. >> > >> > Good point. >> > >> >> >> >> > On the other hand, I’m concerned that radix tree would be an >> >> > over-engineering in terms of vacuum's dead tuples storage since the >> >> > dead tuple storage is static data and requires only lookup operation, >> >> > so if we want to use radix tree as dead tuple storage, I'd like to see >> >> > further use cases. >> >> >> >> I don't think we should rely on the read-only-ness. It seems pretty >> >> clear that we'd want parallel dead-tuple scans at a point not too far >> >> into the future? >> > >> > Indeed. Given that the radix tree itself has other use cases, I have >> > no concern about using radix tree for vacuum's dead tuples storage. It >> > will be better to have one that can be generally used and has some >> > optimizations that are helpful also for vacuum's use case, rather than >> > having one that is very optimized only for vacuum's use case. >> >> Main portion of svtm that leads to memory saving is compression of >> many >> pages at once (CHUNK). It could be combined with radix as a storage >> for >> pointers to CHUNKs., bute >> >> For a moment I'm benchmarking IntegerSet replacement based on Trie >> (HATM >> like) >> and CHUNK compression, therefore datastructure could be used for gist >> vacuum as well. >> >> Since it is generic (allows to index all 64bit) it lacks of trick used >> to speedup svtm. Still on 10x test it is faster than radix. I've attached IntegerSet2 patch for pgtools repo and benchmark results. Branch https://github.com/funny-falcon/pgtools/tree/integerset2 SVTM is measured with couple of changes from commit 5055ef72d23482dd3e11ce in that branch: 1) more often compress bitmap, but slower, 2) couple of popcount tricks. IntegerSet consists of trie index to CHUNKS. CHUNKS is compressed bitmap of 2^15 (6+9) bits (almost like in SVTM, but for fixed bit width). Well, IntegerSet2 is always faster than IntegerSet and always uses significantly less memory (radix uses more memory than IntegerSet in couple of tests and uses comparable in others). IntegerSet2 is not always faster than radix. It is more like radix. That it because both are generic prefix trees with comparable amount of memory accesses. SVTM did the trick being not multilevel prefix tree, but just 1 level bitmap index to chunks. I believe, trie part of IntegerSet could be replaced with radix. Ie use radix as storage for pointers to CHUNKS. > BTW, how does svtm work when we add two sets of dead tuple TIDs to one > svtm? Dead tuple TIDs are unique sets but those sets could have TIDs > of the different offsets on the same block. The case I imagine is the > idea discussed on this thread[1]. With this idea, we store the > collected dead tuple TIDs somewhere and skip index vacuuming for some > reason (index skipping optimization, failsafe mode, or interruptions > etc.). Then, in the next lazy vacuum timing, we load the dead tuple > TIDs and start to scan the heap. During the heap scan in the second > lazy vacuum, it's possible that new dead tuples will be found on the > pages that we have already stored in svtm during the first lazy > vacuum. How can we efficiently update the chunk in the svtm? If we store tidmap to disk, then it will be serialized. Since SVTM/ IntegerSet2 are ordered, they could be loaded in order. Then we can just merge tuples in per page basis: deserialize page (or CHUNK), put new tuples, store again. Since both scan (scan of serilized map and scan of table) are in order, merging will be cheap enough. SVTM and IntegerSet2 already works in "buffered" way on insertion. (As well as IntegerSet that also does compression but in small parts). regards, Yura Sokolov y.sokolov@postgrespro.ru funny.falcon@gmail.com
Attachment
Yura Sokolov писал 2021-07-29 18:29: > I've attached IntegerSet2 patch for pgtools repo and benchmark results. > Branch https://github.com/funny-falcon/pgtools/tree/integerset2 Strange web-mail client... I never can be sure what it will attach... Reattach benchmark results > > regards, > > Yura Sokolov > y.sokolov@postgrespro.ru > funny.falcon@gmail.com
Attachment
On Thu, Jul 29, 2021 at 5:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Indeed. Given that the radix tree itself has other use cases, I have > no concern about using radix tree for vacuum's dead tuples storage. It > will be better to have one that can be generally used and has some > optimizations that are helpful also for vacuum's use case, rather than > having one that is very optimized only for vacuum's use case. What I'm about to say might be a really stupid idea, especially since I haven't looked at any of the code already posted, but what I'm wondering about is whether we need a full radix tree or maybe just a radix-like lookup aid. For example, suppose that for a relation <= 8MB in size, we create an array of 1024 elements indexed by block number. Each element of the array stores an offset into the dead TID array. When you need to probe for a TID, you look up blkno and blkno + 1 in the array and then bsearch only between those two offsets. For bigger relations, a two or three level structure could be built, or it could always be 3 levels. This could even be done on demand, so you initialize all of the elements to some special value that means "not computed yet" and then fill them the first time they're needed, perhaps with another special value that means "no TIDs in that block". I don't know if this is better, but I do kind of like the fact that the basic representation is just an array. It makes it really easy to predict how much memory will be needed for a given number of dead TIDs, and it's very DSM-friendly as well. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas писал 2021-07-29 20:15: > On Thu, Jul 29, 2021 at 5:11 AM Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> Indeed. Given that the radix tree itself has other use cases, I have >> no concern about using radix tree for vacuum's dead tuples storage. It >> will be better to have one that can be generally used and has some >> optimizations that are helpful also for vacuum's use case, rather than >> having one that is very optimized only for vacuum's use case. > > What I'm about to say might be a really stupid idea, especially since > I haven't looked at any of the code already posted, but what I'm > wondering about is whether we need a full radix tree or maybe just a > radix-like lookup aid. For example, suppose that for a relation <= 8MB > in size, we create an array of 1024 elements indexed by block number. > Each element of the array stores an offset into the dead TID array. > When you need to probe for a TID, you look up blkno and blkno + 1 in > the array and then bsearch only between those two offsets. For bigger > relations, a two or three level structure could be built, or it could > always be 3 levels. This could even be done on demand, so you > initialize all of the elements to some special value that means "not > computed yet" and then fill them the first time they're needed, > perhaps with another special value that means "no TIDs in that block". 8MB relation is not a problem, imo. There is no need to do anything to handle 8MB relation. Problem is 2TB relation. It has 256M pages and, lets suppose, 3G dead tuples. Then offset array will be 2GB and tuple offset array will be 6GB (2 byte offset per tuple). 8GB in total. We can make offset array only for higher 3 bytes of block number. We then will have 1M offset array weighted 8MB, and there will be array of 3byte tuple pointers (1 remaining byte from block number, and 2 bytes from Tuple) weighted 9GB. But using per-batch compression schemes, there could be amortized 4 byte per page and 1 byte per tuple: 1GB + 3GB = 4GB memory. Yes, it is not as guaranteed as in array approach. But 95% of time it is such low and even lower. And better: more tuples are dead - better compression works. Page with all tuples dead could be encoded as little as 5 bytes. Therefore, overall memory consumption is more stable and predictive. Lower memory consumption of tuple storage means there is less chance indexes should be scanned twice or more times. It gives more predictability in user experience. > I don't know if this is better, but I do kind of like the fact that > the basic representation is just an array. It makes it really easy to > predict how much memory will be needed for a given number of dead > TIDs, and it's very DSM-friendly as well. Whole thing could be encoded in one single array of bytes. Just give "pointer-to-array"+"array-size" to constructor, and use "bump allocator" inside. Complex logical structure doesn't imply "DSM-unfriendliness". Hmm.... I mean if it is suitably designed. In fact, my code uses bump allocator internally to avoid "per-allocation overhead" of "aset", "slab" or "generational". And IntegerSet2 version even uses it for all allocations since it has no reallocatable parts. Well, if datastructure has reallocatable parts, it could be less friendly to DSM. regards, --- Yura Sokolov y.sokolov@postgrespro.ru funny.falcon@gmail.com
Hi, On 2021-07-29 13:15:53 -0400, Robert Haas wrote: > I don't know if this is better, but I do kind of like the fact that > the basic representation is just an array. It makes it really easy to > predict how much memory will be needed for a given number of dead > TIDs, and it's very DSM-friendly as well. I think those advantages are far outstripped by the big disadvantage of needing to either size the array accurately from the start, or to reallocate the whole array. Our current pre-allocation behaviour is very wasteful for most vacuums but doesn't handle large work_mem at all, causing unnecessary index scans. Greetings, Andres Freund
On Thu, Jul 29, 2021 at 3:14 PM Andres Freund <andres@anarazel.de> wrote: > I think those advantages are far outstripped by the big disadvantage of > needing to either size the array accurately from the start, or to > reallocate the whole array. Our current pre-allocation behaviour is > very wasteful for most vacuums but doesn't handle large work_mem at all, > causing unnecessary index scans. I agree that the current pre-allocation behavior is bad, but I don't really see that as an issue with my idea. Fixing that would require allocating the array in chunks, but that doesn't really affect the core of the idea much, at least as I see it. But I accept that Yura has a very good point about the memory usage of what I was proposing. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-07-30 15:13:49 -0400, Robert Haas wrote: > On Thu, Jul 29, 2021 at 3:14 PM Andres Freund <andres@anarazel.de> wrote: > > I think those advantages are far outstripped by the big disadvantage of > > needing to either size the array accurately from the start, or to > > reallocate the whole array. Our current pre-allocation behaviour is > > very wasteful for most vacuums but doesn't handle large work_mem at all, > > causing unnecessary index scans. > > I agree that the current pre-allocation behavior is bad, but I don't > really see that as an issue with my idea. Fixing that would require > allocating the array in chunks, but that doesn't really affect the > core of the idea much, at least as I see it. Well, then it'd not really be the "simple array approach" anymore :) > But I accept that Yura has a very good point about the memory usage of > what I was proposing. The lower memory usage also often will result in a better cache utilization - which is a crucial factor for index vacuuming when the index order isn't correlated with the heap order. Cache misses really are a crucial performance factor there. Greetings, Andres Freund
On Fri, Jul 30, 2021 at 3:34 PM Andres Freund <andres@anarazel.de> wrote: > The lower memory usage also often will result in a better cache > utilization - which is a crucial factor for index vacuuming when the > index order isn't correlated with the heap order. Cache misses really > are a crucial performance factor there. Fair enough. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, Today I noticed the inefficiencies of our dead tuple storage once again, and started theorizing about a better storage method; which is when I remembered that this thread exists, and that this thread already has amazing results. Are there any plans to get the results of this thread from PoC to committable? Kind regards, Matthias van de Meent
Hi, On 2022-02-11 13:47:01 +0100, Matthias van de Meent wrote: > Today I noticed the inefficiencies of our dead tuple storage once > again, and started theorizing about a better storage method; which is > when I remembered that this thread exists, and that this thread > already has amazing results. > > Are there any plans to get the results of this thread from PoC to committable? I'm not currently planning to work on it personally. It'd would be awesome if somebody did... Greetings, Andres Freund
On Sun, Feb 13, 2022 at 11:02 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-02-11 13:47:01 +0100, Matthias van de Meent wrote: > > Today I noticed the inefficiencies of our dead tuple storage once > > again, and started theorizing about a better storage method; which is > > when I remembered that this thread exists, and that this thread > > already has amazing results. > > > > Are there any plans to get the results of this thread from PoC to committable? > > I'm not currently planning to work on it personally. It'd would be awesome if > somebody did... Actually, I'm working on simplifying and improving radix tree implementation for PG16 dev cycle. From the discussion so far I think it's better to have a data structure that can be used for general-purpose and is also good for storing TID, not very specific to store TID. So I think radix tree would be a potent candidate. I have done the insertion and search implementation. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On 2022-02-13 12:36:13 +0900, Masahiko Sawada wrote: > Actually, I'm working on simplifying and improving radix tree > implementation for PG16 dev cycle. From the discussion so far I think > it's better to have a data structure that can be used for > general-purpose and is also good for storing TID, not very specific to > store TID. So I think radix tree would be a potent candidate. I have > done the insertion and search implementation. Awesome!
Hi, On Sun, Feb 13, 2022 at 12:39 PM Andres Freund <andres@anarazel.de> wrote: > > On 2022-02-13 12:36:13 +0900, Masahiko Sawada wrote: > > Actually, I'm working on simplifying and improving radix tree > > implementation for PG16 dev cycle. From the discussion so far I think > > it's better to have a data structure that can be used for > > general-purpose and is also good for storing TID, not very specific to > > store TID. So I think radix tree would be a potent candidate. I have > > done the insertion and search implementation. > > Awesome! To move this project forward, I've implemented radix tree implementation from scratch while studying Andres's implementation. It supports insertion, search, and iteration but not deletion yet. In my implementation, I use Datum as the value so internal and lead nodes have the same data structure, simplifying the implementation. The iteration on the radix tree returns keys with the value in ascending order of the key. The patch has regression tests for radix tree but is still in PoC state: left many debugging codes, not supported SSE2 SIMD instructions, added -mavx2 flag is hard-coded. I've measured the size and loading and lookup performance for each candidate data structure with two test cases: dense and sparse, by using the test tool[1]. Here are the results: * Case1 - Dense (simulating the case where there are 1000 consecutive pages each of which has 100 dead tuples, at 100 page intervals.) select prepare( 1000000, -- max block 100, -- # of dead tuples per page 1, -- dead tuples interval within a page 1000, -- # of consecutive pages having dead tuples 1100 -- page interval ); name size attach lookup array 520 MB 248.60 ms 89891.92 ms hash 3188 MB 28029.59 ms 50850.32 ms intset 85 MB 644.96 ms 39801.17 ms tbm 96 MB 474.06 ms 6641.38 ms radix 37 MB 173.03 ms 9145.97 ms radix_tree 36 MB 184.51 ms 9729.94 ms * Case2 - Sparse (simulating a case where there are pages that have 2 dead tuples every 1000 pages.) select prepare( 10000000, -- max block 2, -- # of dead tuples per page 50, -- dead tuples interval within a page 1, -- # of consecutive pages having dead tuples 1000 -- page interval ); name size attach lookup array 125 kB 0.53 ms 82183.61 ms hash 1032 kB 1.31 ms 28128.33 ms intset 222 kB 0.51 ms 87775.68 ms tbm 768 MB 1.24 ms 98674.60 ms radix 1080 kB 1.66 ms 20698.07 ms radix_tree 949 kB 1.50 ms 21465.23 ms Each test virtually generates TIDs and loads them to the data structure, and then searches for virtual index TIDs. 'array' is a sorted array which is the current method, 'hash' is HTAB, 'intset' is IntegerSet, and 'tbm' is TIDBitmap. The last two results are radix tree implementations: 'radix' is Andres's radix tree implementation and 'radix_tree' is my radix tree implementation. In both radix tree tests, I converted TIDs into an int64 and store the lower 6 bits in the value part of the radix tree. Overall, radix tree implementations have good numbers. Once we got an agreement on moving in this direction, I'll start a new thread for that and move the implementation further; there are many things to do and discuss: deletion, API design, SIMD support, more tests etc. Regards, [1] https://github.com/MasahikoSawada/pgtools/tree/master/bdbench [2] https://www.postgresql.org/message-id/CAFiTN-visUO9VTz2%2Bh224z5QeUjKhKNdSfjaCucPhYJdbzxx0g%40mail.gmail.com -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Overall, radix tree implementations have good numbers. Once we got an > agreement on moving in this direction, I'll start a new thread for > that and move the implementation further; there are many things to do > and discuss: deletion, API design, SIMD support, more tests etc. +1 (FWIW, I think the current thread is still fine.) -- John Naylor EDB: http://www.enterprisedb.com
On Tue, May 10, 2022 at 6:58 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Overall, radix tree implementations have good numbers. Once we got an > > agreement on moving in this direction, I'll start a new thread for > > that and move the implementation further; there are many things to do > > and discuss: deletion, API design, SIMD support, more tests etc. > > +1 > Thanks! I've attached an updated version patch. It is still WIP but I've implemented deletion and improved test cases and comments. > (FWIW, I think the current thread is still fine.) Okay, agreed. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
On Wed, May 25, 2022 at 11:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, May 10, 2022 at 6:58 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > Overall, radix tree implementations have good numbers. Once we got an > > > agreement on moving in this direction, I'll start a new thread for > > > that and move the implementation further; there are many things to do > > > and discuss: deletion, API design, SIMD support, more tests etc. > > > > +1 > > > > Thanks! > > I've attached an updated version patch. It is still WIP but I've > implemented deletion and improved test cases and comments. I've attached an updated version patch that changes the configure script. I'm still studying how to support AVX2 on msvc build. Also, added more regression tests. The integration with lazy vacuum and parallel vacuum is missing for now. In order to support parallel vacuum, we need to have the radix tree support to be created on DSA. Added this item to the next CF. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
On Thu, Jun 16, 2022 at 11:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I've attached an updated version patch that changes the configure > script. I'm still studying how to support AVX2 on msvc build. Also, > added more regression tests. Thanks for the update, I will take a closer look at the patch in the near future, possibly next week. For now, though, I'd like to question why we even need to use 32-byte registers in the first place. For one, the paper referenced has 16-pointer nodes, but none for 32 (next level is 48 and uses a different method to find the index of the next pointer). Andres' prototype has 32-pointer nodes, but in a quick read of his patch a couple weeks ago I don't recall a reason mentioned for it. Even if 32-pointer nodes are better from a memory perspective, I imagine it should be possible to use two SSE2 registers to find the index. It'd be locally slightly more complex, but not much. It might not even cost much more in cycles since AVX2 would require indirecting through a function pointer. It's much more convenient if we don't need a runtime check. There are also thermal and power disadvantages when using AXV2 in some workloads. I'm not sure that's the case here, but if it is, we'd better be getting something in return. One more thing in general: In an earlier version, I noticed that Andres used the slab allocator and documented why. The last version of your patch that I saw had the same allocator, but not the "why". Especially in early stages of review, we want to document design decisions so it's more clear for the reader. -- John Naylor EDB: http://www.enterprisedb.com
On 2022-06-16 Th 00:56, Masahiko Sawada wrote: > > I've attached an updated version patch that changes the configure > script. I'm still studying how to support AVX2 on msvc build. Also, > added more regression tests. I think you would need to add '/arch:AVX2' to the compiler flags in MSBuildProject.pm. See <https://docs.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-170> cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Hi, On Thu, Jun 16, 2022 at 4:30 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Thu, Jun 16, 2022 at 11:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've attached an updated version patch that changes the configure > > script. I'm still studying how to support AVX2 on msvc build. Also, > > added more regression tests. > > Thanks for the update, I will take a closer look at the patch in the > near future, possibly next week. Thanks! > For now, though, I'd like to question > why we even need to use 32-byte registers in the first place. For one, > the paper referenced has 16-pointer nodes, but none for 32 (next level > is 48 and uses a different method to find the index of the next > pointer). Andres' prototype has 32-pointer nodes, but in a quick read > of his patch a couple weeks ago I don't recall a reason mentioned for > it. I might be wrong but since AVX2 instruction set is introduced in Haswell microarchitecture in 2013 and the referenced paper is published in the same year, the art didn't use AVX2 instruction set. 32-pointer nodes are better from a memory perspective as you mentioned. Andres' prototype supports both 16-pointer nodes and 32-pointer nodes (out of 6 node types). This would provide better memory usage but on the other hand, it would also bring overhead of switching the node type. Anyway, it's an important design decision to support which size of node to support. It should be done based on experiment results and documented. > Even if 32-pointer nodes are better from a memory perspective, I > imagine it should be possible to use two SSE2 registers to find the > index. It'd be locally slightly more complex, but not much. It might > not even cost much more in cycles since AVX2 would require indirecting > through a function pointer. It's much more convenient if we don't need > a runtime check. Right. > There are also thermal and power disadvantages when > using AXV2 in some workloads. I'm not sure that's the case here, but > if it is, we'd better be getting something in return. Good point. > One more thing in general: In an earlier version, I noticed that > Andres used the slab allocator and documented why. The last version of > your patch that I saw had the same allocator, but not the "why". > Especially in early stages of review, we want to document design > decisions so it's more clear for the reader. Indeed. I'll add comments in the next version patch. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Jun 20, 2022 at 7:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: [v3 patch] Hi Masahiko, Since there are new files, and they are pretty large, I've attached most specific review comments and questions as a diff rather than in the email body. This is not a full review, which will take more time -- this is a first pass mostly to aid my understanding, and discuss some of the design and performance implications. I tend to think it's a good idea to avoid most cosmetic review until it's close to commit, but I did mention a couple things that might enhance readability during review. As I mentioned to you off-list, I have some thoughts on the nodes using SIMD: > On Thu, Jun 16, 2022 at 4:30 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > For now, though, I'd like to question > > why we even need to use 32-byte registers in the first place. For one, > > the paper referenced has 16-pointer nodes, but none for 32 (next level > > is 48 and uses a different method to find the index of the next > > pointer). Andres' prototype has 32-pointer nodes, but in a quick read > > of his patch a couple weeks ago I don't recall a reason mentioned for > > it. > > I might be wrong but since AVX2 instruction set is introduced in > Haswell microarchitecture in 2013 and the referenced paper is > published in the same year, the art didn't use AVX2 instruction set. Sure, but with a bit of work the same technique could be done on that node size with two 16-byte registers. > 32-pointer nodes are better from a memory perspective as you > mentioned. Andres' prototype supports both 16-pointer nodes and > 32-pointer nodes (out of 6 node types). This would provide better > memory usage but on the other hand, it would also bring overhead of > switching the node type. Right, using more node types provides smaller increments of node size. Just changing node type can be better or worse, depending on the input. > Anyway, it's an important design decision to > support which size of node to support. It should be done based on > experiment results and documented. Agreed. I would add that in the first step, we want something straightforward to read and easy to integrate into our codebase. I suspect other optimizations would be worth a lot more than using AVX2: - collapsing inner nodes - taking care when constructing the key (more on this when we integrate with VACUUM) ...and a couple Andres mentioned: - memory management: in https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de - node dispatch: https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de Therefore, I would suggest that we use SSE2 only, because: - portability is very easy - to avoid a performance hit from indirecting through a function pointer When the PG16 cycle opens, I will work separately on ensuring the portability of using SSE2, so you can focus on other aspects. I think it would be a good idea to have both node16 and node32 for testing. During benchmarking we can delete one or the other and play with the other thresholds a bit. Ideally, node16 and node32 would have the same code with a different loop count (1 or 2). More generally, there is too much duplication of code (noted by Andres in his PoC), and there are many variable names with the node size embedded. This is a bit tricky to make more general, so we don't need to try it yet, but ideally we would have something similar to: switch (node->kind) // todo: inspect tagged pointer { case RADIX_TREE_NODE_KIND_4: idx = node_search_eq(node, chunk, 4); do_action(node, idx, 4, ...); break; case RADIX_TREE_NODE_KIND_32: idx = node_search_eq(node, chunk, 32); do_action(node, idx, 32, ...); ... } static pg_alwaysinline void node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout) { if (node_fanout <= SIMPLE_LOOP_THRESHOLD) // do simple loop with (node_simple *) node; else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD) // do vectorized loop where available with (node_vec *) node; ... } ...and let the compiler do loop unrolling and branch removal. Not sure how difficult this is to do, but something to think about. Another thought: for non-x86 platforms, the SIMD nodes degenerate to "simple loop", and looping over up to 32 elements is not great (although possibly okay). We could do binary search, but that has bad branch prediction. -- John Naylor EDB: http://www.enterprisedb.com
Attachment
> Another thought: for non-x86 platforms, the SIMD nodes degenerate to > "simple loop", and looping over up to 32 elements is not great > (although possibly okay). We could do binary search, but that has bad > branch prediction. I am not sure that for relevant non-x86 platforms SIMD / vector instructions would not be used (though it would be a good idea to verify) Do you know any modern platforms that do not have SIMD ? I would definitely test before assuming binary search is better. Often other approaches like counting search over such small vectors is much better when the vector fits in cache (or even a cache line) and you always visit all items as this will completely avoid branch predictions and allows compiler to vectorize and / or unroll the loop as needed. Cheers Hannu
Hi, On 2022-06-27 18:12:13 +0700, John Naylor wrote: > Another thought: for non-x86 platforms, the SIMD nodes degenerate to > "simple loop", and looping over up to 32 elements is not great > (although possibly okay). We could do binary search, but that has bad > branch prediction. I'd be quite quite surprised if binary search were cheaper. Particularly on less fancy platforms. - Andres
On Mon, Jun 27, 2022 at 10:23 PM Hannu Krosing <hannuk@google.com> wrote: > > > Another thought: for non-x86 platforms, the SIMD nodes degenerate to > > "simple loop", and looping over up to 32 elements is not great > > (although possibly okay). We could do binary search, but that has bad > > branch prediction. > > I am not sure that for relevant non-x86 platforms SIMD / vector > instructions would not be used (though it would be a good idea to > verify) By that logic, we can also dispense with intrinsics on x86 because the compiler will autovectorize there too (if I understand your claim correctly). I'm not quite convinced of that in this case. > I would definitely test before assuming binary search is better. I wasn't very clear in my language, but I did reject binary search as having bad branch prediction. -- John Naylor EDB: http://www.enterprisedb.com
Hi, On 2022-06-28 11:17:42 +0700, John Naylor wrote: > On Mon, Jun 27, 2022 at 10:23 PM Hannu Krosing <hannuk@google.com> wrote: > > > > > Another thought: for non-x86 platforms, the SIMD nodes degenerate to > > > "simple loop", and looping over up to 32 elements is not great > > > (although possibly okay). We could do binary search, but that has bad > > > branch prediction. > > > > I am not sure that for relevant non-x86 platforms SIMD / vector > > instructions would not be used (though it would be a good idea to > > verify) > > By that logic, we can also dispense with intrinsics on x86 because the > compiler will autovectorize there too (if I understand your claim > correctly). I'm not quite convinced of that in this case. Last time I checked (maybe a year ago?) none of the popular compilers could autovectorize that code pattern. Greetings, Andres Freund
Hi, On Mon, Jun 27, 2022 at 8:12 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Jun 20, 2022 at 7:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > [v3 patch] > > Hi Masahiko, > > Since there are new files, and they are pretty large, I've attached > most specific review comments and questions as a diff rather than in > the email body. This is not a full review, which will take more time > -- this is a first pass mostly to aid my understanding, and discuss > some of the design and performance implications. > > I tend to think it's a good idea to avoid most cosmetic review until > it's close to commit, but I did mention a couple things that might > enhance readability during review. Thank you for reviewing the patch! > > As I mentioned to you off-list, I have some thoughts on the nodes using SIMD: > > > On Thu, Jun 16, 2022 at 4:30 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > For now, though, I'd like to question > > > why we even need to use 32-byte registers in the first place. For one, > > > the paper referenced has 16-pointer nodes, but none for 32 (next level > > > is 48 and uses a different method to find the index of the next > > > pointer). Andres' prototype has 32-pointer nodes, but in a quick read > > > of his patch a couple weeks ago I don't recall a reason mentioned for > > > it. > > > > I might be wrong but since AVX2 instruction set is introduced in > > Haswell microarchitecture in 2013 and the referenced paper is > > published in the same year, the art didn't use AVX2 instruction set. > > Sure, but with a bit of work the same technique could be done on that > node size with two 16-byte registers. > > > 32-pointer nodes are better from a memory perspective as you > > mentioned. Andres' prototype supports both 16-pointer nodes and > > 32-pointer nodes (out of 6 node types). This would provide better > > memory usage but on the other hand, it would also bring overhead of > > switching the node type. > > Right, using more node types provides smaller increments of node size. > Just changing node type can be better or worse, depending on the > input. > > > Anyway, it's an important design decision to > > support which size of node to support. It should be done based on > > experiment results and documented. > > Agreed. I would add that in the first step, we want something > straightforward to read and easy to integrate into our codebase. Agreed. > I > suspect other optimizations would be worth a lot more than using AVX2: > - collapsing inner nodes > - taking care when constructing the key (more on this when we > integrate with VACUUM) > ...and a couple Andres mentioned: > - memory management: in > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de > - node dispatch: > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de > > Therefore, I would suggest that we use SSE2 only, because: > - portability is very easy > - to avoid a performance hit from indirecting through a function pointer Okay, I'll try these optimizations and see if the performance becomes better. > > When the PG16 cycle opens, I will work separately on ensuring the > portability of using SSE2, so you can focus on other aspects. Thanks! > I think it would be a good idea to have both node16 and node32 for testing. > During benchmarking we can delete one or the other and play with the > other thresholds a bit. I've done benchmark tests while changing the node types. The code base is v3 patch that doesn't have the optimization you mentioned below (memory management and node dispatch) but I added the code to use SSE2 for node-16 and node-32. The 'name' in the below result indicates the kind of instruction set (AVX2 or SSE2) and the node type used. For instance, sse2_4_32_48_256 means the radix tree has four kinds of node types for each which have 4, 32, 48, and 256 pointers, respectively, and use SSE2 instruction set. * Case1 - Dense (simulating the case where there are 1000 consecutive pages each of which has 100 dead tuples, at 100 page intervals.) select prepare( 1000000, -- max block 100, -- # of dead tuples per page 1, -- dead tuples interval within a page 1000, -- # of consecutive pages having dead tuples 1100 -- page interval ); name size attach lookup avx2_4_32_128_256 1154 MB 6742.53 ms 47765.63 ms avx2_4_32_48_256 1839 MB 4239.35 ms 40528.39 ms sse2_4_16_128_256 1154 MB 6994.43 ms 40383.85 ms sse2_4_16_32_128_256 1154 MB 7239.35 ms 43542.39 ms sse2_4_16_48_256 1839 MB 4404.63 ms 36048.96 ms sse2_4_32_128_256 1154 MB 6688.50 ms 44902.64 ms * Case2 - Sparse (simulating a case where there are pages that have 2 dead tuples every 1000 pages.) select prepare( 10000000, -- max block 2, -- # of dead tuples per page 50, -- dead tuples interval within a page 1, -- # of consecutive pages having dead tuples 1000 -- page interval ); name size attach lookup avx2_4_32_128_256 1535 kB 1.85 ms 17427.42 ms avx2_4_32_48_256 1472 kB 2.01 ms 22176.75 ms sse2_4_16_128_256 1582 kB 2.16 ms 15391.12 ms sse2_4_16_32_128_256 1535 kB 2.14 ms 18757.86 ms sse2_4_16_48_256 1489 kB 1.91 ms 19210.39 ms sse2_4_32_128_256 1535 kB 2.05 ms 17777.55 ms The statistics of the number of each node types are: * avx2_4_32_128_256 (dense and sparse) * nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31 * nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1 * avx2_4_32_48_256 * nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n48 = 227, n256 = 916433 * nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n48 = 159, n256 = 50 * sse2_4_16_128_256 * nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n128 = 916914, n256 = 31 * nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n128 = 256, n256 = 1 * sse2_4_16_32_128_256 * nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n32 = 285, n128 = 916629, n256 = 31 * nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n32 = 48, n128 = 208, n256 = 1 * sse2_4_16_48_256 * nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433 * nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50 * sse2_4_32_128_256 * nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31 * nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1 Observations are: In both test cases, There is not much difference between using AVX2 and SSE2. The more mode types, the more time it takes for loading the data (see sse2_4_16_32_128_256). In dense case, since most nodes have around 100 children, the radix tree that has node-128 had a good figure in terms of memory usage. On the other hand, the radix tree that doesn't have node-128 has a better number in terms of insertion performance. This is probably because we need to iterate over 'isset' flags from the beginning of the array in order to find an empty slot when inserting new data. We do the same thing also for node-48 but it was better than node-128 as it's up to 48. In terms of lookup performance, the results vary but I could not find any common pattern that makes the performance better or worse. Getting more statistics such as the number of each node type per tree level might help me. > Ideally, node16 and node32 would have the same code with a different > loop count (1 or 2). More generally, there is too much duplication of > code (noted by Andres in his PoC), and there are many variable names > with the node size embedded. This is a bit tricky to make more > general, so we don't need to try it yet, but ideally we would have > something similar to: > > switch (node->kind) // todo: inspect tagged pointer > { > case RADIX_TREE_NODE_KIND_4: > idx = node_search_eq(node, chunk, 4); > do_action(node, idx, 4, ...); > break; > case RADIX_TREE_NODE_KIND_32: > idx = node_search_eq(node, chunk, 32); > do_action(node, idx, 32, ...); > ... > } > > static pg_alwaysinline void > node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout) > { > if (node_fanout <= SIMPLE_LOOP_THRESHOLD) > // do simple loop with (node_simple *) node; > else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD) > // do vectorized loop where available with (node_vec *) node; > ... > } > > ...and let the compiler do loop unrolling and branch removal. Not sure > how difficult this is to do, but something to think about. Agreed. I'll update my patch based on your review comments and use SSE2. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I > > suspect other optimizations would be worth a lot more than using AVX2: > > - collapsing inner nodes > > - taking care when constructing the key (more on this when we > > integrate with VACUUM) > > ...and a couple Andres mentioned: > > - memory management: in > > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de > > - node dispatch: > > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de > > > > Therefore, I would suggest that we use SSE2 only, because: > > - portability is very easy > > - to avoid a performance hit from indirecting through a function pointer > > Okay, I'll try these optimizations and see if the performance becomes better. FWIW, I think it's fine if we delay these until after committing a good-enough version. The exception is key construction and I think that deserves some attention now (more on this below). > I've done benchmark tests while changing the node types. The code base > is v3 patch that doesn't have the optimization you mentioned below > (memory management and node dispatch) but I added the code to use SSE2 > for node-16 and node-32. Great, this is helpful to visualize what's going on! > * sse2_4_16_48_256 > * nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433 > * nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50 > > * sse2_4_32_128_256 > * nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31 > * nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1 > Observations are: > > In both test cases, There is not much difference between using AVX2 > and SSE2. The more mode types, the more time it takes for loading the > data (see sse2_4_16_32_128_256). Good to know. And as Andres mentioned in his PoC, more node types would be a barrier for pointer tagging, since 32-bit platforms only have two spare bits in the pointer. > In dense case, since most nodes have around 100 children, the radix > tree that has node-128 had a good figure in terms of memory usage. On Looking at the node stats, and then your benchmark code, I think key construction is a major influence, maybe more than node type. The key/value scheme tested now makes sense: blockhi || blocklo || 9 bits of item offset (with the leaf nodes containing a bit map of the lowest few bits of this whole thing) We want the lower fanout nodes at the top of the tree and higher fanout ones at the bottom. Note some consequences: If the table has enough columns such that much fewer than 100 tuples fit on a page (maybe 30 or 40), then in the dense case the nodes above the leaves will have lower fanout (maybe they will fit in a node32). Also, the bitmap values in the leaves will be more empty. In other words, many tables in the wild *resemble* the sparse case a bit, even if truly all tuples on the page are dead. Note also that the dense case in the benchmark above has ~4500 times more keys than the sparse case, and uses about ~1000 times more memory. But the runtime is only 2-3 times longer. That's interesting to me. To optimize for the sparse case, it seems to me that the key/value would be blockhi || 9 bits of item offset || blocklo I believe that would make the leaf nodes more dense, with fewer inner nodes, and could drastically speed up the sparse case, and maybe many realistic dense cases. I'm curious to hear your thoughts. > the other hand, the radix tree that doesn't have node-128 has a better > number in terms of insertion performance. This is probably because we > need to iterate over 'isset' flags from the beginning of the array in > order to find an empty slot when inserting new data. We do the same > thing also for node-48 but it was better than node-128 as it's up to > 48. I mentioned in my diff, but for those following along, I think we can improve that by iterating over the bytes and if it's 0xFF all 8 bits are set already so keep looking... > In terms of lookup performance, the results vary but I could not find > any common pattern that makes the performance better or worse. Getting > more statistics such as the number of each node type per tree level > might help me. I think that's a sign that the choice of node types might not be terribly important for these two cases. That's good if that's true in general -- a future performance-critical use of this code might tweak things for itself without upsetting vacuum. -- John Naylor EDB: http://www.enterprisedb.com
On Tue, Jun 28, 2022 at 10:10 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > I > > > suspect other optimizations would be worth a lot more than using AVX2: > > > - collapsing inner nodes > > > - taking care when constructing the key (more on this when we > > > integrate with VACUUM) > > > ...and a couple Andres mentioned: > > > - memory management: in > > > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de > > > - node dispatch: > > > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de > > > > > > Therefore, I would suggest that we use SSE2 only, because: > > > - portability is very easy > > > - to avoid a performance hit from indirecting through a function pointer > > > > Okay, I'll try these optimizations and see if the performance becomes better. > > FWIW, I think it's fine if we delay these until after committing a > good-enough version. The exception is key construction and I think > that deserves some attention now (more on this below). Agreed. > > > I've done benchmark tests while changing the node types. The code base > > is v3 patch that doesn't have the optimization you mentioned below > > (memory management and node dispatch) but I added the code to use SSE2 > > for node-16 and node-32. > > Great, this is helpful to visualize what's going on! > > > * sse2_4_16_48_256 > > * nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433 > > * nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50 > > > > * sse2_4_32_128_256 > > * nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31 > > * nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1 > > > Observations are: > > > > In both test cases, There is not much difference between using AVX2 > > and SSE2. The more mode types, the more time it takes for loading the > > data (see sse2_4_16_32_128_256). > > Good to know. And as Andres mentioned in his PoC, more node types > would be a barrier for pointer tagging, since 32-bit platforms only > have two spare bits in the pointer. > > > In dense case, since most nodes have around 100 children, the radix > > tree that has node-128 had a good figure in terms of memory usage. On > > Looking at the node stats, and then your benchmark code, I think key > construction is a major influence, maybe more than node type. The > key/value scheme tested now makes sense: > > blockhi || blocklo || 9 bits of item offset > > (with the leaf nodes containing a bit map of the lowest few bits of > this whole thing) > > We want the lower fanout nodes at the top of the tree and higher > fanout ones at the bottom. So more inner nodes can fit in CPU cache, right? > > Note some consequences: If the table has enough columns such that much > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the > dense case the nodes above the leaves will have lower fanout (maybe > they will fit in a node32). Also, the bitmap values in the leaves will > be more empty. In other words, many tables in the wild *resemble* the > sparse case a bit, even if truly all tuples on the page are dead. > > Note also that the dense case in the benchmark above has ~4500 times > more keys than the sparse case, and uses about ~1000 times more > memory. But the runtime is only 2-3 times longer. That's interesting > to me. > > To optimize for the sparse case, it seems to me that the key/value would be > > blockhi || 9 bits of item offset || blocklo > > I believe that would make the leaf nodes more dense, with fewer inner > nodes, and could drastically speed up the sparse case, and maybe many > realistic dense cases. Does it have an effect on the number of inner nodes? > I'm curious to hear your thoughts. Thank you for your analysis. It's worth trying. We use 9 bits for item offset but most pages don't use all bits in practice. So probably it might be better to move the most significant bit of item offset to the left of blockhi. Or more simply: 9 bits of item offset || blockhi || blocklo > > > the other hand, the radix tree that doesn't have node-128 has a better > > number in terms of insertion performance. This is probably because we > > need to iterate over 'isset' flags from the beginning of the array in > > order to find an empty slot when inserting new data. We do the same > > thing also for node-48 but it was better than node-128 as it's up to > > 48. > > I mentioned in my diff, but for those following along, I think we can > improve that by iterating over the bytes and if it's 0xFF all 8 bits > are set already so keep looking... Right. Using 0xFF also makes the code readable so I'll change that. > > > In terms of lookup performance, the results vary but I could not find > > any common pattern that makes the performance better or worse. Getting > > more statistics such as the number of each node type per tree level > > might help me. > > I think that's a sign that the choice of node types might not be > terribly important for these two cases. That's good if that's true in > general -- a future performance-critical use of this code might tweak > things for itself without upsetting vacuum. Agreed. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Hi, I just noticed that I had a reply forgotten in drafts... On 2022-05-10 10:51:46 +0900, Masahiko Sawada wrote: > To move this project forward, I've implemented radix tree > implementation from scratch while studying Andres's implementation. It > supports insertion, search, and iteration but not deletion yet. In my > implementation, I use Datum as the value so internal and lead nodes > have the same data structure, simplifying the implementation. The > iteration on the radix tree returns keys with the value in ascending > order of the key. The patch has regression tests for radix tree but is > still in PoC state: left many debugging codes, not supported SSE2 SIMD > instructions, added -mavx2 flag is hard-coded. Very cool - thanks for picking this up. Greetings, Andres Freund
Hi, On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote: > diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c > new file mode 100644 > index 0000000000..bf87f932fd > --- /dev/null > +++ b/src/backend/lib/radixtree.c > @@ -0,0 +1,1763 @@ > +/*------------------------------------------------------------------------- > + * > + * radixtree.c > + * Implementation for adaptive radix tree. > + * > + * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful > + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas > + * Neumann, 2013. > + * > + * There are some differences from the proposed implementation. For instance, > + * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit > + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper. > + * Also, there is no support for path compression and lazy path expansion. The > + * radix tree supports fixed length of the key so we don't expect the tree level > + * wouldn't be high. I think we're going to need path compression at some point, fwiw. I'd bet on it being beneficial even for the tid case. > + * The key is a 64-bit unsigned integer and the value is a Datum. I don't think it's a good idea to define the value type to be a datum. > +/* > + * As we descend a radix tree, we push the node to the stack. The stack is used > + * at deletion. > + */ > +typedef struct radix_tree_stack_data > +{ > + radix_tree_node *node; > + struct radix_tree_stack_data *parent; > +} radix_tree_stack_data; > +typedef radix_tree_stack_data *radix_tree_stack; I think it's a very bad idea for traversal to need allocations. I really want to eventually use this for shared structures (eventually with lock-free searches at least), and needing to do allocations while traversing the tree is a no-go for that. Particularly given that the tree currently has a fixed depth, can't you just allocate this on the stack once? > +/* > + * Allocate a new node with the given node kind. > + */ > +static radix_tree_node * > +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind) > +{ > + radix_tree_node *newnode; > + > + newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind], > + radix_tree_node_info[kind].size); > + newnode->kind = kind; > + > + /* update the statistics */ > + tree->mem_used += GetMemoryChunkSpace(newnode); > + tree->cnt[kind]++; > + > + return newnode; > +} Why are you tracking the memory usage at this level of detail? It's *much* cheaper to track memory usage via the memory contexts? Since they're dedicated for the radix tree, that ought to be sufficient? > + else if (idx != n4->n.count) > + { > + /* > + * the key needs to be inserted in the middle of the > + * array, make space for the new key. > + */ > + memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]), > + sizeof(uint8) * (n4->n.count - idx)); > + memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]), > + sizeof(radix_tree_node *) * (n4->n.count - idx)); > + } Maybe we could add a static inline helper for these memmoves? Both because it's repetitive (for different node types) and because the last time I looked gcc was generating quite bad code for this. And having to put workarounds into multiple places is obviously worse than having to do it in one place. > +/* > + * Insert the key with the val. > + * > + * found_p is set to true if the key already present, otherwise false, if > + * it's not NULL. > + * > + * XXX: do we need to support update_if_exists behavior? > + */ Yes, I think that's needed - hence using bfm_set() instead of insert() in the prototype. > +void > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p) > +{ > + int shift; > + bool replaced; > + radix_tree_node *node; > + radix_tree_node *parent = tree->root; > + > + /* Empty tree, create the root */ > + if (!tree->root) > + radix_tree_new_root(tree, key, val); > + > + /* Extend the tree if necessary */ > + if (key > tree->max_val) > + radix_tree_extend(tree, key); FWIW, the reason I used separate functions for these in the prototype is that it turns out to generate a lot better code, because it allows non-inlined function calls to be sibling calls - thereby avoiding the need for a dedicated stack frame. That's not possible once you need a palloc or such, so splitting off those call paths into dedicated functions is useful. Greetings, Andres Freund
Hi, On 2022-06-28 15:24:11 +0900, Masahiko Sawada wrote: > In both test cases, There is not much difference between using AVX2 > and SSE2. The more mode types, the more time it takes for loading the > data (see sse2_4_16_32_128_256). Yea, at some point the compiler starts using a jump table instead of branches, and that turns out to be a good bit more expensive. And even with branches, it obviously adds hard to predict branches. IIRC I fought a bit with the compiler to avoid some of that cost, it's possible that got "lost" in Sawada-san's patch. Sawada-san, what led you to discard the 1 and 16 node types? IIRC the 1 node one is not unimportant until we have path compression. Right now the node struct sizes are: 4 - 48 bytes 32 - 296 bytes 128 - 1304 bytes 256 - 2088 bytes I guess radix_tree_node_128->isset is just 16 bytes compared to 1288 other bytes, but needing that separate isset array somehow is sad :/. I wonder if a smaller "free index" would do the trick? Point to the element + 1 where we searched last and start a plain loop there. Particularly in an insert-only workload that'll always work, and in other cases it'll still often work I think. One thing I was wondering about is trying to choose node types in roughly-power-of-two struct sizes. It's pretty easy to end up with significant fragmentation in the slabs right now when inserting as you go, because some of the smaller node types will be freed but not enough to actually free blocks of memory. If we instead have ~power-of-two sizes we could just use a single slab of the max size, and carve out the smaller node types out of that largest allocation. Btw, that fragmentation is another reason why I think it's better to track memory usage via memory contexts, rather than doing so based on GetMemoryChunkSpace(). > > Ideally, node16 and node32 would have the same code with a different > > loop count (1 or 2). More generally, there is too much duplication of > > code (noted by Andres in his PoC), and there are many variable names > > with the node size embedded. This is a bit tricky to make more > > general, so we don't need to try it yet, but ideally we would have > > something similar to: > > > > switch (node->kind) // todo: inspect tagged pointer > > { > > case RADIX_TREE_NODE_KIND_4: > > idx = node_search_eq(node, chunk, 4); > > do_action(node, idx, 4, ...); > > break; > > case RADIX_TREE_NODE_KIND_32: > > idx = node_search_eq(node, chunk, 32); > > do_action(node, idx, 32, ...); > > ... > > } FWIW, that should be doable with an inline function, if you pass it the memory to the "array" rather than the node directly. Not so sure it's a good idea to do dispatch between node types / search methods inside the helper, as you suggest below: > > static pg_alwaysinline void > > node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout) > > { > > if (node_fanout <= SIMPLE_LOOP_THRESHOLD) > > // do simple loop with (node_simple *) node; > > else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD) > > // do vectorized loop where available with (node_vec *) node; > > ... > > } Greetings, Andres Freund
On Mon, Jul 4, 2022 at 2:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Jun 28, 2022 at 10:10 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > I > > > > suspect other optimizations would be worth a lot more than using AVX2: > > > > - collapsing inner nodes > > > > - taking care when constructing the key (more on this when we > > > > integrate with VACUUM) > > > > ...and a couple Andres mentioned: > > > > - memory management: in > > > > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de > > > > - node dispatch: > > > > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de > > > > > > > > Therefore, I would suggest that we use SSE2 only, because: > > > > - portability is very easy > > > > - to avoid a performance hit from indirecting through a function pointer > > > > > > Okay, I'll try these optimizations and see if the performance becomes better. > > > > FWIW, I think it's fine if we delay these until after committing a > > good-enough version. The exception is key construction and I think > > that deserves some attention now (more on this below). > > Agreed. > > > > > > I've done benchmark tests while changing the node types. The code base > > > is v3 patch that doesn't have the optimization you mentioned below > > > (memory management and node dispatch) but I added the code to use SSE2 > > > for node-16 and node-32. > > > > Great, this is helpful to visualize what's going on! > > > > > * sse2_4_16_48_256 > > > * nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433 > > > * nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50 > > > > > > * sse2_4_32_128_256 > > > * nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31 > > > * nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1 > > > > > Observations are: > > > > > > In both test cases, There is not much difference between using AVX2 > > > and SSE2. The more mode types, the more time it takes for loading the > > > data (see sse2_4_16_32_128_256). > > > > Good to know. And as Andres mentioned in his PoC, more node types > > would be a barrier for pointer tagging, since 32-bit platforms only > > have two spare bits in the pointer. > > > > > In dense case, since most nodes have around 100 children, the radix > > > tree that has node-128 had a good figure in terms of memory usage. On > > > > Looking at the node stats, and then your benchmark code, I think key > > construction is a major influence, maybe more than node type. The > > key/value scheme tested now makes sense: > > > > blockhi || blocklo || 9 bits of item offset > > > > (with the leaf nodes containing a bit map of the lowest few bits of > > this whole thing) > > > > We want the lower fanout nodes at the top of the tree and higher > > fanout ones at the bottom. > > So more inner nodes can fit in CPU cache, right? > > > > > Note some consequences: If the table has enough columns such that much > > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the > > dense case the nodes above the leaves will have lower fanout (maybe > > they will fit in a node32). Also, the bitmap values in the leaves will > > be more empty. In other words, many tables in the wild *resemble* the > > sparse case a bit, even if truly all tuples on the page are dead. > > > > Note also that the dense case in the benchmark above has ~4500 times > > more keys than the sparse case, and uses about ~1000 times more > > memory. But the runtime is only 2-3 times longer. That's interesting > > to me. > > > > To optimize for the sparse case, it seems to me that the key/value would be > > > > blockhi || 9 bits of item offset || blocklo > > > > I believe that would make the leaf nodes more dense, with fewer inner > > nodes, and could drastically speed up the sparse case, and maybe many > > realistic dense cases. > > Does it have an effect on the number of inner nodes? > > > I'm curious to hear your thoughts. > > Thank you for your analysis. It's worth trying. We use 9 bits for item > offset but most pages don't use all bits in practice. So probably it > might be better to move the most significant bit of item offset to the > left of blockhi. Or more simply: > > 9 bits of item offset || blockhi || blocklo > > > > > > the other hand, the radix tree that doesn't have node-128 has a better > > > number in terms of insertion performance. This is probably because we > > > need to iterate over 'isset' flags from the beginning of the array in > > > order to find an empty slot when inserting new data. We do the same > > > thing also for node-48 but it was better than node-128 as it's up to > > > 48. > > > > I mentioned in my diff, but for those following along, I think we can > > improve that by iterating over the bytes and if it's 0xFF all 8 bits > > are set already so keep looking... > > Right. Using 0xFF also makes the code readable so I'll change that. > > > > > > In terms of lookup performance, the results vary but I could not find > > > any common pattern that makes the performance better or worse. Getting > > > more statistics such as the number of each node type per tree level > > > might help me. > > > > I think that's a sign that the choice of node types might not be > > terribly important for these two cases. That's good if that's true in > > general -- a future performance-critical use of this code might tweak > > things for itself without upsetting vacuum. > > Agreed. > I've attached an updated patch that incorporated comments from John. Here are some comments I could not address and the reason: +// bitfield is uint32, so we don't need UINT64_C bitfield &= ((UINT64_C(1) << node->n.count) - 1); Since node->n.count could be 32, I think we need to use UINT64CONST() here. /* Macros for radix tree nodes */ +// not sure why are we doing casts here? #define IS_LEAF_NODE(n) (((radix_tree_node *) (n))->shift == 0) #define IS_EMPTY_NODE(n) (((radix_tree_node *) (n))->count == 0) I've left the casts as I use IS_LEAF_NODE for rt_node_4/16/32/128/256. Also, I've dropped the configure script support for AVX2, and support for SSE2 is missing. I'll update it later. I've not addressed the comments I got from Andres yet so I'll update the patch according to the discussion but the current patch would be more readable than the previous one thanks to the comments from John. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote: > > diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c > > new file mode 100644 > > index 0000000000..bf87f932fd > > --- /dev/null > > +++ b/src/backend/lib/radixtree.c > > @@ -0,0 +1,1763 @@ > > +/*------------------------------------------------------------------------- > > + * > > + * radixtree.c > > + * Implementation for adaptive radix tree. > > + * > > + * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful > > + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas > > + * Neumann, 2013. > > + * > > + * There are some differences from the proposed implementation. For instance, > > + * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit > > + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper. > > + * Also, there is no support for path compression and lazy path expansion. The > > + * radix tree supports fixed length of the key so we don't expect the tree level > > + * wouldn't be high. > > I think we're going to need path compression at some point, fwiw. I'd bet on > it being beneficial even for the tid case. > > > > + * The key is a 64-bit unsigned integer and the value is a Datum. > > I don't think it's a good idea to define the value type to be a datum. A datum value is convenient to represent both a pointer and a value so I used it to avoid defining node types for inner and leaf nodes separately. Since a datum could be 4 bytes or 8 bytes depending it might not be good for some platforms. But what kind of aspects do you not like the idea of using datum? > > > > +/* > > + * As we descend a radix tree, we push the node to the stack. The stack is used > > + * at deletion. > > + */ > > +typedef struct radix_tree_stack_data > > +{ > > + radix_tree_node *node; > > + struct radix_tree_stack_data *parent; > > +} radix_tree_stack_data; > > +typedef radix_tree_stack_data *radix_tree_stack; > > I think it's a very bad idea for traversal to need allocations. I really want > to eventually use this for shared structures (eventually with lock-free > searches at least), and needing to do allocations while traversing the tree is > a no-go for that. > > Particularly given that the tree currently has a fixed depth, can't you just > allocate this on the stack once? Yes, we can do that. > > > +/* > > + * Allocate a new node with the given node kind. > > + */ > > +static radix_tree_node * > > +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind) > > +{ > > + radix_tree_node *newnode; > > + > > + newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind], > > + radix_tree_node_info[kind].size); > > + newnode->kind = kind; > > + > > + /* update the statistics */ > > + tree->mem_used += GetMemoryChunkSpace(newnode); > > + tree->cnt[kind]++; > > + > > + return newnode; > > +} > > Why are you tracking the memory usage at this level of detail? It's *much* > cheaper to track memory usage via the memory contexts? Since they're dedicated > for the radix tree, that ought to be sufficient? Indeed. I'll use MemoryContextMemAllocated instead. > > > > + else if (idx != n4->n.count) > > + { > > + /* > > + * the key needs to be inserted in the middle of the > > + * array, make space for the new key. > > + */ > > + memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]), > > + sizeof(uint8) * (n4->n.count - idx)); > > + memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]), > > + sizeof(radix_tree_node *) * (n4->n.count - idx)); > > + } > > Maybe we could add a static inline helper for these memmoves? Both because > it's repetitive (for different node types) and because the last time I looked > gcc was generating quite bad code for this. And having to put workarounds into > multiple places is obviously worse than having to do it in one place. Agreed, I'll update it. > > > > +/* > > + * Insert the key with the val. > > + * > > + * found_p is set to true if the key already present, otherwise false, if > > + * it's not NULL. > > + * > > + * XXX: do we need to support update_if_exists behavior? > > + */ > > Yes, I think that's needed - hence using bfm_set() instead of insert() in the > prototype. Agreed. > > > > +void > > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p) > > +{ > > + int shift; > > + bool replaced; > > + radix_tree_node *node; > > + radix_tree_node *parent = tree->root; > > + > > + /* Empty tree, create the root */ > > + if (!tree->root) > > + radix_tree_new_root(tree, key, val); > > + > > + /* Extend the tree if necessary */ > > + if (key > tree->max_val) > > + radix_tree_extend(tree, key); > > FWIW, the reason I used separate functions for these in the prototype is that > it turns out to generate a lot better code, because it allows non-inlined > function calls to be sibling calls - thereby avoiding the need for a dedicated > stack frame. That's not possible once you need a palloc or such, so splitting > off those call paths into dedicated functions is useful. Thank you for the info. How much does using sibling call optimization help the performance in this case? I think that these two cases are used only a limited number of times: inserting the first key and extending the tree. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tue, Jul 5, 2022 at 7:00 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-06-28 15:24:11 +0900, Masahiko Sawada wrote: > > In both test cases, There is not much difference between using AVX2 > > and SSE2. The more mode types, the more time it takes for loading the > > data (see sse2_4_16_32_128_256). > > Yea, at some point the compiler starts using a jump table instead of branches, > and that turns out to be a good bit more expensive. And even with branches, it > obviously adds hard to predict branches. IIRC I fought a bit with the compiler > to avoid some of that cost, it's possible that got "lost" in Sawada-san's > patch. > > > Sawada-san, what led you to discard the 1 and 16 node types? IIRC the 1 node > one is not unimportant until we have path compression. I wanted to start with a smaller number of node types for simplicity. 16 node type has been added to v4 patch I submitted[1]. I think it's trade-off between better memory and the overhead of growing (and shrinking) the node type. I'm going to add more node types once we turn out based on the benchmark that it's beneficial. > > Right now the node struct sizes are: > 4 - 48 bytes > 32 - 296 bytes > 128 - 1304 bytes > 256 - 2088 bytes > > I guess radix_tree_node_128->isset is just 16 bytes compared to 1288 other > bytes, but needing that separate isset array somehow is sad :/. I wonder if a > smaller "free index" would do the trick? Point to the element + 1 where we > searched last and start a plain loop there. Particularly in an insert-only > workload that'll always work, and in other cases it'll still often work I > think. radix_tree_node_128->isset is used to distinguish between null-pointer in inner nodes and 0 in leaf nodes. So I guess we can have a flag to indicate a leaf or an inner so that we can interpret (Datum) 0 as either null-pointer or 0. Or if we define different data types for inner and leaf nodes probably we don't need it. > One thing I was wondering about is trying to choose node types in > roughly-power-of-two struct sizes. It's pretty easy to end up with significant > fragmentation in the slabs right now when inserting as you go, because some of > the smaller node types will be freed but not enough to actually free blocks of > memory. If we instead have ~power-of-two sizes we could just use a single slab > of the max size, and carve out the smaller node types out of that largest > allocation. You meant to manage memory allocation (and free) for smaller node types by ourselves? How about using different block size for different node types? > > Btw, that fragmentation is another reason why I think it's better to track > memory usage via memory contexts, rather than doing so based on > GetMemoryChunkSpace(). Agreed. > > > > > Ideally, node16 and node32 would have the same code with a different > > > loop count (1 or 2). More generally, there is too much duplication of > > > code (noted by Andres in his PoC), and there are many variable names > > > with the node size embedded. This is a bit tricky to make more > > > general, so we don't need to try it yet, but ideally we would have > > > something similar to: > > > > > > switch (node->kind) // todo: inspect tagged pointer > > > { > > > case RADIX_TREE_NODE_KIND_4: > > > idx = node_search_eq(node, chunk, 4); > > > do_action(node, idx, 4, ...); > > > break; > > > case RADIX_TREE_NODE_KIND_32: > > > idx = node_search_eq(node, chunk, 32); > > > do_action(node, idx, 32, ...); > > > ... > > > } > > FWIW, that should be doable with an inline function, if you pass it the memory > to the "array" rather than the node directly. Not so sure it's a good idea to > do dispatch between node types / search methods inside the helper, as you > suggest below: > > > > > static pg_alwaysinline void > > > node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout) > > > { > > > if (node_fanout <= SIMPLE_LOOP_THRESHOLD) > > > // do simple loop with (node_simple *) node; > > > else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD) > > > // do vectorized loop where available with (node_vec *) node; > > > ... > > > } Yeah, It's worth trying at some point. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Hi, On 2022-07-05 16:33:17 +0900, Masahiko Sawada wrote: > On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres@anarazel.de> wrote: > A datum value is convenient to represent both a pointer and a value so > I used it to avoid defining node types for inner and leaf nodes > separately. I'm not convinced that's a good goal. I think we're going to want to have different key and value types, and trying to unify leaf and inner nodes is going to make that impossible. Consider e.g. using it for something like a buffer mapping table - your key might be way too wide to fit it sensibly into 64bit. > Since a datum could be 4 bytes or 8 bytes depending it might not be good for > some platforms. Right - thats another good reason why it's problematic. A lot of key types aren't going to be 4/8 bytes dependent on 32/64bit, but either / or. > > > +void > > > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p) > > > +{ > > > + int shift; > > > + bool replaced; > > > + radix_tree_node *node; > > > + radix_tree_node *parent = tree->root; > > > + > > > + /* Empty tree, create the root */ > > > + if (!tree->root) > > > + radix_tree_new_root(tree, key, val); > > > + > > > + /* Extend the tree if necessary */ > > > + if (key > tree->max_val) > > > + radix_tree_extend(tree, key); > > > > FWIW, the reason I used separate functions for these in the prototype is that > > it turns out to generate a lot better code, because it allows non-inlined > > function calls to be sibling calls - thereby avoiding the need for a dedicated > > stack frame. That's not possible once you need a palloc or such, so splitting > > off those call paths into dedicated functions is useful. > > Thank you for the info. How much does using sibling call optimization > help the performance in this case? I think that these two cases are > used only a limited number of times: inserting the first key and > extending the tree. It's not that it helps in the cases moved into separate functions - it's that not having that code in the "normal" paths keeps the normal path faster. Greetings, Andres Freund
Hi, On 2022-07-05 16:33:29 +0900, Masahiko Sawada wrote: > > One thing I was wondering about is trying to choose node types in > > roughly-power-of-two struct sizes. It's pretty easy to end up with significant > > fragmentation in the slabs right now when inserting as you go, because some of > > the smaller node types will be freed but not enough to actually free blocks of > > memory. If we instead have ~power-of-two sizes we could just use a single slab > > of the max size, and carve out the smaller node types out of that largest > > allocation. > > You meant to manage memory allocation (and free) for smaller node > types by ourselves? For all of them basically. Using a single slab allocator and then subdividing the "common block size" into however many chunks that fit into a single node type. > How about using different block size for different node types? Not following... Greetings, Andres Freund
On Mon, Jul 4, 2022 at 12:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Looking at the node stats, and then your benchmark code, I think key > > construction is a major influence, maybe more than node type. The > > key/value scheme tested now makes sense: > > > > blockhi || blocklo || 9 bits of item offset > > > > (with the leaf nodes containing a bit map of the lowest few bits of > > this whole thing) > > > > We want the lower fanout nodes at the top of the tree and higher > > fanout ones at the bottom. > > So more inner nodes can fit in CPU cache, right? My thinking is, on average, there will be more dense space utilization in the leaf bitmaps, and fewer inner nodes. I'm not quite sure about cache, since with my idea a search might have to visit more nodes to get the common negative result (indexed tid not found in vacuum's list). > > Note some consequences: If the table has enough columns such that much > > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the > > dense case the nodes above the leaves will have lower fanout (maybe > > they will fit in a node32). Also, the bitmap values in the leaves will > > be more empty. In other words, many tables in the wild *resemble* the > > sparse case a bit, even if truly all tuples on the page are dead. > > > > Note also that the dense case in the benchmark above has ~4500 times > > more keys than the sparse case, and uses about ~1000 times more > > memory. But the runtime is only 2-3 times longer. That's interesting > > to me. > > > > To optimize for the sparse case, it seems to me that the key/value would be > > > > blockhi || 9 bits of item offset || blocklo > > > > I believe that would make the leaf nodes more dense, with fewer inner > > nodes, and could drastically speed up the sparse case, and maybe many > > realistic dense cases. > > Does it have an effect on the number of inner nodes? > > > I'm curious to hear your thoughts. > > Thank you for your analysis. It's worth trying. We use 9 bits for item > offset but most pages don't use all bits in practice. So probably it > might be better to move the most significant bit of item offset to the > left of blockhi. Or more simply: > > 9 bits of item offset || blockhi || blocklo A concern here is most tids won't use many bits in blockhi either, most often far fewer, so this would make the tree higher, I think. Each value of blockhi represents 0.5GB of heap (32TB max). Even with very large tables I'm guessing most pages of interest to vacuum are concentrated in a few of these 0.5GB "segments". And it's possible path compression would change the tradeoffs here. -- John Naylor EDB: http://www.enterprisedb.com
On Tue, Jul 5, 2022 at 5:09 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-07-05 16:33:17 +0900, Masahiko Sawada wrote: > > On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres@anarazel.de> wrote: > > A datum value is convenient to represent both a pointer and a value so > > I used it to avoid defining node types for inner and leaf nodes > > separately. > > I'm not convinced that's a good goal. I think we're going to want to have > different key and value types, and trying to unify leaf and inner nodes is > going to make that impossible. > > Consider e.g. using it for something like a buffer mapping table - your key > might be way too wide to fit it sensibly into 64bit. Right. It seems to be better to have an interface so that the user of the radix tree can specify the arbitrary key size (and perhaps value size too?) on creation. And we can have separate leaf node types that have values instead of pointers. If the value size is less than pointer size, we can have values within leaf nodes but if it’s bigger probably the leaf node can have pointers to memory where to store the value. > > > > Since a datum could be 4 bytes or 8 bytes depending it might not be good for > > some platforms. > > Right - thats another good reason why it's problematic. A lot of key types > aren't going to be 4/8 bytes dependent on 32/64bit, but either / or. > > > > > > +void > > > > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p) > > > > +{ > > > > + int shift; > > > > + bool replaced; > > > > + radix_tree_node *node; > > > > + radix_tree_node *parent = tree->root; > > > > + > > > > + /* Empty tree, create the root */ > > > > + if (!tree->root) > > > > + radix_tree_new_root(tree, key, val); > > > > + > > > > + /* Extend the tree if necessary */ > > > > + if (key > tree->max_val) > > > > + radix_tree_extend(tree, key); > > > > > > FWIW, the reason I used separate functions for these in the prototype is that > > > it turns out to generate a lot better code, because it allows non-inlined > > > function calls to be sibling calls - thereby avoiding the need for a dedicated > > > stack frame. That's not possible once you need a palloc or such, so splitting > > > off those call paths into dedicated functions is useful. > > > > Thank you for the info. How much does using sibling call optimization > > help the performance in this case? I think that these two cases are > > used only a limited number of times: inserting the first key and > > extending the tree. > > It's not that it helps in the cases moved into separate functions - it's that > not having that code in the "normal" paths keeps the normal path faster. Thanks, understood. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tue, Jul 5, 2022 at 5:49 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Jul 4, 2022 at 12:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Looking at the node stats, and then your benchmark code, I think key > > > construction is a major influence, maybe more than node type. The > > > key/value scheme tested now makes sense: > > > > > > blockhi || blocklo || 9 bits of item offset > > > > > > (with the leaf nodes containing a bit map of the lowest few bits of > > > this whole thing) > > > > > > We want the lower fanout nodes at the top of the tree and higher > > > fanout ones at the bottom. > > > > So more inner nodes can fit in CPU cache, right? > > My thinking is, on average, there will be more dense space utilization > in the leaf bitmaps, and fewer inner nodes. I'm not quite sure about > cache, since with my idea a search might have to visit more nodes to > get the common negative result (indexed tid not found in vacuum's > list). > > > > Note some consequences: If the table has enough columns such that much > > > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the > > > dense case the nodes above the leaves will have lower fanout (maybe > > > they will fit in a node32). Also, the bitmap values in the leaves will > > > be more empty. In other words, many tables in the wild *resemble* the > > > sparse case a bit, even if truly all tuples on the page are dead. > > > > > > Note also that the dense case in the benchmark above has ~4500 times > > > more keys than the sparse case, and uses about ~1000 times more > > > memory. But the runtime is only 2-3 times longer. That's interesting > > > to me. > > > > > > To optimize for the sparse case, it seems to me that the key/value would be > > > > > > blockhi || 9 bits of item offset || blocklo > > > > > > I believe that would make the leaf nodes more dense, with fewer inner > > > nodes, and could drastically speed up the sparse case, and maybe many > > > realistic dense cases. > > > > Does it have an effect on the number of inner nodes? > > > > > I'm curious to hear your thoughts. > > > > Thank you for your analysis. It's worth trying. We use 9 bits for item > > offset but most pages don't use all bits in practice. So probably it > > might be better to move the most significant bit of item offset to the > > left of blockhi. Or more simply: > > > > 9 bits of item offset || blockhi || blocklo > > A concern here is most tids won't use many bits in blockhi either, > most often far fewer, so this would make the tree higher, I think. > Each value of blockhi represents 0.5GB of heap (32TB max). Even with > very large tables I'm guessing most pages of interest to vacuum are > concentrated in a few of these 0.5GB "segments". Right. I guess that the tree height is affected by where garbages are, right? For example, even if all garbage in the table is concentrated in 0.5GB, if they exist between 2^17 and 2^18 block, we use the first byte of blockhi. If the table is larger than 128GB, the second byte of the blockhi could be used depending on where the garbage exists. Another variation of how to store TID would be that we use the block number as a key and store a bitmap of the offset as a value. We can use Bitmapset for example, or an approach like Roaring bitmap. I think that at this stage it's better to define the design first. For example, key size and value size, and these sizes are fixed or can be set the arbitary size? Given the use case of buffer mapping, we would need a wider key to store RelFileNode, ForkNumber, and BlockNumber. On the other hand, limiting the key size is 64 bit integer makes the logic simple, and possibly it could still be used in buffer mapping cases by using a tree of a tree. For value size, if we support different value sizes specified by the user, we can either embed multiple values in the leaf node (called Multi-value leaves in ART paper) or introduce a leaf node that stores one value (called Single-value leaves). > And it's possible path compression would change the tradeoffs here. Agreed. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Fri, Jul 8, 2022 at 9:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I guess that the tree height is affected by where garbages are, right? > For example, even if all garbage in the table is concentrated in > 0.5GB, if they exist between 2^17 and 2^18 block, we use the first > byte of blockhi. If the table is larger than 128GB, the second byte of > the blockhi could be used depending on where the garbage exists. Right. > Another variation of how to store TID would be that we use the block > number as a key and store a bitmap of the offset as a value. We can > use Bitmapset for example, I like the idea of using existing code to set/check a bitmap if it's convenient. But (in case that was implied here) I'd really like to stay away from variable-length values, which would require "Single-value leaves" (slow). I also think it's fine to treat the key/value as just bits, and not care where exactly they came from, as we've been talking about. > or an approach like Roaring bitmap. This would require two new data structures instead of one. That doesn't seem like a path to success. > I think that at this stage it's better to define the design first. For > example, key size and value size, and these sizes are fixed or can be > set the arbitary size? I don't think we need to start over. Andres' prototype had certain design decisions built in for the intended use case (although maybe not clearly documented as such). Subsequent patches in this thread substantially changed many design aspects. If there were any changes that made things wonderful for vacuum, it wasn't explained, but Andres did explain how some of these changes were not good for other uses. Going to fixed 64-bit keys and values should still allow many future applications, so let's do that if there's no reason not to. > For value size, if we support > different value sizes specified by the user, we can either embed > multiple values in the leaf node (called Multi-value leaves in ART > paper) I don't think "Multi-value leaves" allow for variable-length values, FWIW. And now I see I also used this term wrong in my earlier review comment -- v3/4 don't actually use "multi-value leaves", but Andres' does (going by the multiple leaf types). From the paper: "Multi-value leaves: The values are stored in one of four different leaf node types, which mirror the structure of inner nodes, but contain values instead of pointers." (It seems v3/v4 could be called a variation of "Combined pointer/value slots: If values fit into pointers, no separate node types are necessary. Instead, each pointer storage location in an inner node can either store a pointer or a value." But without the advantage of variable length keys). -- John Naylor EDB: http://www.enterprisedb.com
On Fri, Jul 8, 2022 at 3:43 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Fri, Jul 8, 2022 at 9:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I guess that the tree height is affected by where garbages are, right? > > For example, even if all garbage in the table is concentrated in > > 0.5GB, if they exist between 2^17 and 2^18 block, we use the first > > byte of blockhi. If the table is larger than 128GB, the second byte of > > the blockhi could be used depending on where the garbage exists. > > Right. > > > Another variation of how to store TID would be that we use the block > > number as a key and store a bitmap of the offset as a value. We can > > use Bitmapset for example, > > I like the idea of using existing code to set/check a bitmap if it's > convenient. But (in case that was implied here) I'd really like to > stay away from variable-length values, which would require > "Single-value leaves" (slow). I also think it's fine to treat the > key/value as just bits, and not care where exactly they came from, as > we've been talking about. > > > or an approach like Roaring bitmap. > > This would require two new data structures instead of one. That > doesn't seem like a path to success. Agreed. > > > I think that at this stage it's better to define the design first. For > > example, key size and value size, and these sizes are fixed or can be > > set the arbitary size? > > I don't think we need to start over. Andres' prototype had certain > design decisions built in for the intended use case (although maybe > not clearly documented as such). Subsequent patches in this thread > substantially changed many design aspects. If there were any changes > that made things wonderful for vacuum, it wasn't explained, but Andres > did explain how some of these changes were not good for other uses. > Going to fixed 64-bit keys and values should still allow many future > applications, so let's do that if there's no reason not to. I thought Andres pointed out that given that we store BufferTag (or part of that) into the key, the fixed 64-bit keys might not be enough for buffer mapping use cases. If we want to use wider keys more than 64-bit, we would need to consider it. > > > For value size, if we support > > different value sizes specified by the user, we can either embed > > multiple values in the leaf node (called Multi-value leaves in ART > > paper) > > I don't think "Multi-value leaves" allow for variable-length values, > FWIW. And now I see I also used this term wrong in my earlier review > comment -- v3/4 don't actually use "multi-value leaves", but Andres' > does (going by the multiple leaf types). From the paper: "Multi-value > leaves: The values are stored in one of four different leaf node > types, which mirror the structure of inner nodes, but contain values > instead of pointers." Right, but sorry I meant the user specifies the arbitrary fixed-size value length on creation like we do in dynahash.c. > > (It seems v3/v4 could be called a variation of "Combined pointer/value > slots: If values fit into pointers, no separate node types are > necessary. Instead, each pointer storage location in an inner node can > either store a pointer or a value." But without the advantage of > variable length keys). Agreed. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tue, Jul 12, 2022 at 8:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I think that at this stage it's better to define the design first. For > > > example, key size and value size, and these sizes are fixed or can be > > > set the arbitary size? > > > > I don't think we need to start over. Andres' prototype had certain > > design decisions built in for the intended use case (although maybe > > not clearly documented as such). Subsequent patches in this thread > > substantially changed many design aspects. If there were any changes > > that made things wonderful for vacuum, it wasn't explained, but Andres > > did explain how some of these changes were not good for other uses. > > Going to fixed 64-bit keys and values should still allow many future > > applications, so let's do that if there's no reason not to. > > I thought Andres pointed out that given that we store BufferTag (or > part of that) into the key, the fixed 64-bit keys might not be enough > for buffer mapping use cases. If we want to use wider keys more than > 64-bit, we would need to consider it. It sounds like you've answered your own question, then. If so, I'm curious what your current thinking is. If we *did* want to have maximum flexibility, then "single-value leaves" method would be the way to go, since it seems to be the easiest way to have variable-length both keys and values. I do have a concern that the extra pointer traversal would be a drag on performance, and also require lots of small memory allocations. If we happened to go that route, your idea upthread of using a bitmapset of item offsets in the leaves sounds like a good fit for that. I also have some concerns about also simultaneously trying to design for the use for buffer mappings. I certainly want to make this good for as many future uses as possible, and I'd really like to preserve any optimizations already fought for. However, to make concrete progress on the thread subject, I also don't think it's the most productive use of time to get tied up about the fine details of something that will not likely happen for several years at the earliest. -- John Naylor EDB: http://www.enterprisedb.com
On Thu, Jul 14, 2022 at 1:17 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Tue, Jul 12, 2022 at 8:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > I think that at this stage it's better to define the design first. For > > > > example, key size and value size, and these sizes are fixed or can be > > > > set the arbitary size? > > > > > > I don't think we need to start over. Andres' prototype had certain > > > design decisions built in for the intended use case (although maybe > > > not clearly documented as such). Subsequent patches in this thread > > > substantially changed many design aspects. If there were any changes > > > that made things wonderful for vacuum, it wasn't explained, but Andres > > > did explain how some of these changes were not good for other uses. > > > Going to fixed 64-bit keys and values should still allow many future > > > applications, so let's do that if there's no reason not to. > > > > I thought Andres pointed out that given that we store BufferTag (or > > part of that) into the key, the fixed 64-bit keys might not be enough > > for buffer mapping use cases. If we want to use wider keys more than > > 64-bit, we would need to consider it. > > It sounds like you've answered your own question, then. If so, I'm > curious what your current thinking is. > > If we *did* want to have maximum flexibility, then "single-value > leaves" method would be the way to go, since it seems to be the > easiest way to have variable-length both keys and values. I do have a > concern that the extra pointer traversal would be a drag on > performance, and also require lots of small memory allocations. Agreed. > I also have some concerns about also simultaneously trying to design > for the use for buffer mappings. I certainly want to make this good > for as many future uses as possible, and I'd really like to preserve > any optimizations already fought for. However, to make concrete > progress on the thread subject, I also don't think it's the most > productive use of time to get tied up about the fine details of > something that will not likely happen for several years at the > earliest. I’d like to keep the first version simple. We can improve it and add more optimizations later. Using radix tree for vacuum TID storage would still be a big win comparing to using a flat array, even without all these optimizations. In terms of single-value leaves method, I'm also concerned about an extra pointer traversal and extra memory allocation. It's most flexible but multi-value leaves method is also flexible enough for many use cases. Using the single-value method seems to be too much as the first step for me. Overall, using 64-bit keys and 64-bit values would be a reasonable choice for me as the first step . It can cover wider use cases including vacuum TID use cases. And possibly it can cover use cases by combining a hash table or using tree of tree, for example. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Hi, On 2022-07-08 11:09:44 +0900, Masahiko Sawada wrote: > I think that at this stage it's better to define the design first. For > example, key size and value size, and these sizes are fixed or can be > set the arbitary size? Given the use case of buffer mapping, we would > need a wider key to store RelFileNode, ForkNumber, and BlockNumber. On > the other hand, limiting the key size is 64 bit integer makes the > logic simple, and possibly it could still be used in buffer mapping > cases by using a tree of a tree. For value size, if we support > different value sizes specified by the user, we can either embed > multiple values in the leaf node (called Multi-value leaves in ART > paper) or introduce a leaf node that stores one value (called > Single-value leaves). FWIW, I think the best path forward would be to do something similar to the simplehash.h approach, so it can be customized to the specific user. Greetings, Andres Freund
On Tue, Jul 19, 2022 at 9:24 AM Andres Freund <andres@anarazel.de> wrote:
> FWIW, I think the best path forward would be to do something similar to the
> simplehash.h approach, so it can be customized to the specific user.
I figured that would come up at some point. It may be worth doing in the future, but I think it's way too much to ask for the first use case.
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Jul 18, 2022 at 9:10 PM John Naylor <john.naylor@enterprisedb.com> wrote: > On Tue, Jul 19, 2022 at 9:24 AM Andres Freund <andres@anarazel.de> wrote: > > FWIW, I think the best path forward would be to do something similar to the > > simplehash.h approach, so it can be customized to the specific user. > > I figured that would come up at some point. It may be worth doing in the future, but I think it's way too much to ask forthe first use case. I have a prototype patch that creates a read-only snapshot of the visibility map, and has vacuumlazy.c work off of that when determining with pages to skip. The patch also gets rid of the SKIP_PAGES_THRESHOLD stuff. This is very effective with TPC-C, principally because it really cuts down on the number of scanned_pages that are scanned only because the VM bit is unset concurrently by DML. The window for this is very large when the table is large (and naturally takes a long time to scan), resulting in many more "dead but not yet removable" tuples being encountered than necessary. Which itself causes bogus information in the FSM -- information about the space that VACUUM could free from the page, which is often highly misleading. There are remaining questions about how to do this properly. Right now I'm just copying pages from the VM into local memory, right after OldestXmin is first acquired -- we "lock in" a snapshot of the VM at the earliest opportunity, which is what lazy_scan_skip() actually works off now. There needs to be some consideration given to the resource management aspects of this -- it needs to use memory sensibly, which the current prototype patch doesn't do at all. I'm probably going to seriously pursue this as a project soon, and will probably need some kind of data structure for the local copy. The raw pages are usually quite space inefficient, considering we only need an immutable snapshot of the VM. I wonder if it makes sense to use this as part of this project. It will be possible to know the exact heap pages that will become scanned_pages before scanning even one page with this design (perhaps with caveats about low memory conditions). It could also be very effective as a way of speeding up TID lookups in the reasonably common case where most scanned_pages don't have any LP_DEAD items -- just look it up in our local/materialized copy of the VM first. But even when LP_DEAD items are spread fairly evenly, it could still give us reliable information about the distribution of LP_DEAD items very early on. Maybe the two data structures could even be combined in some way? You can use more memory for the local copy of the VM if you know that you won't need the memory for dead_items. It's kinda the same problem, in a way. -- Peter Geoghegan
On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I’d like to keep the first version simple. We can improve it and add
> more optimizations later. Using radix tree for vacuum TID storage
> would still be a big win comparing to using a flat array, even without
> all these optimizations. In terms of single-value leaves method, I'm
> also concerned about an extra pointer traversal and extra memory
> allocation. It's most flexible but multi-value leaves method is also
> flexible enough for many use cases. Using the single-value method
> seems to be too much as the first step for me.
>
> Overall, using 64-bit keys and 64-bit values would be a reasonable
> choice for me as the first step . It can cover wider use cases
> including vacuum TID use cases. And possibly it can cover use cases by
> combining a hash table or using tree of tree, for example.
These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to preserve optimization work already done, so +1 from me.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Jul 19, 2022 at 1:30 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I’d like to keep the first version simple. We can improve it and add > > more optimizations later. Using radix tree for vacuum TID storage > > would still be a big win comparing to using a flat array, even without > > all these optimizations. In terms of single-value leaves method, I'm > > also concerned about an extra pointer traversal and extra memory > > allocation. It's most flexible but multi-value leaves method is also > > flexible enough for many use cases. Using the single-value method > > seems to be too much as the first step for me. > > > > Overall, using 64-bit keys and 64-bit values would be a reasonable > > choice for me as the first step . It can cover wider use cases > > including vacuum TID use cases. And possibly it can cover use cases by > > combining a hash table or using tree of tree, for example. > > These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to preserveoptimization work already done, so +1 from me. Thanks. I've updated the patch. It now implements 64-bit keys, 64-bit values, and the multi-value leaves method. I've tried to remove duplicated codes but we might find a better way to do that. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Jul 19, 2022 at 1:30 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > > > > > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > I’d like to keep the first version simple. We can improve it and add > > > more optimizations later. Using radix tree for vacuum TID storage > > > would still be a big win comparing to using a flat array, even without > > > all these optimizations. In terms of single-value leaves method, I'm > > > also concerned about an extra pointer traversal and extra memory > > > allocation. It's most flexible but multi-value leaves method is also > > > flexible enough for many use cases. Using the single-value method > > > seems to be too much as the first step for me. > > > > > > Overall, using 64-bit keys and 64-bit values would be a reasonable > > > choice for me as the first step . It can cover wider use cases > > > including vacuum TID use cases. And possibly it can cover use cases by > > > combining a hash table or using tree of tree, for example. > > > > These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to preserveoptimization work already done, so +1 from me. > > Thanks. > > I've updated the patch. It now implements 64-bit keys, 64-bit values, > and the multi-value leaves method. I've tried to remove duplicated > codes but we might find a better way to do that. > With the recent changes related to simd, I'm going to split the patch into at least two parts: introduce other simd optimized functions used by the radix tree and the radix tree implementation. Particularly we need two functions for radix tree: a function like pg_lfind32 but for 8 bits integers and return the index, and a function that returns the index of the first element that is >= key. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Aug 15, 2022 at 12:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Jul 19, 2022 at 1:30 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > > > > > > > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > I’d like to keep the first version simple. We can improve it and add > > > > more optimizations later. Using radix tree for vacuum TID storage > > > > would still be a big win comparing to using a flat array, even without > > > > all these optimizations. In terms of single-value leaves method, I'm > > > > also concerned about an extra pointer traversal and extra memory > > > > allocation. It's most flexible but multi-value leaves method is also > > > > flexible enough for many use cases. Using the single-value method > > > > seems to be too much as the first step for me. > > > > > > > > Overall, using 64-bit keys and 64-bit values would be a reasonable > > > > choice for me as the first step . It can cover wider use cases > > > > including vacuum TID use cases. And possibly it can cover use cases by > > > > combining a hash table or using tree of tree, for example. > > > > > > These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to preserveoptimization work already done, so +1 from me. > > > > Thanks. > > > > I've updated the patch. It now implements 64-bit keys, 64-bit values, > > and the multi-value leaves method. I've tried to remove duplicated > > codes but we might find a better way to do that. > > > > With the recent changes related to simd, I'm going to split the patch > into at least two parts: introduce other simd optimized functions used > by the radix tree and the radix tree implementation. Particularly we > need two functions for radix tree: a function like pg_lfind32 but for > 8 bits integers and return the index, and a function that returns the > index of the first element that is >= key. I recommend looking at https://www.postgresql.org/message-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA%40mail.gmail.com since I did the work just now for searching bytes and returning a bool, buth = and <=. Should be pretty close. Also, i believe if you left this for last as a possible refactoring, it might save some work. In any case, I'll take a look at the latest patch next month. -- John Naylor EDB: http://www.enterprisedb.com
On Mon, Aug 15, 2022 at 10:39 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Aug 15, 2022 at 12:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, Jul 19, 2022 at 1:30 PM John Naylor > > > <john.naylor@enterprisedb.com> wrote: > > > > > > > > > > > > > > > > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > I’d like to keep the first version simple. We can improve it and add > > > > > more optimizations later. Using radix tree for vacuum TID storage > > > > > would still be a big win comparing to using a flat array, even without > > > > > all these optimizations. In terms of single-value leaves method, I'm > > > > > also concerned about an extra pointer traversal and extra memory > > > > > allocation. It's most flexible but multi-value leaves method is also > > > > > flexible enough for many use cases. Using the single-value method > > > > > seems to be too much as the first step for me. > > > > > > > > > > Overall, using 64-bit keys and 64-bit values would be a reasonable > > > > > choice for me as the first step . It can cover wider use cases > > > > > including vacuum TID use cases. And possibly it can cover use cases by > > > > > combining a hash table or using tree of tree, for example. > > > > > > > > These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier topreserve optimization work already done, so +1 from me. > > > > > > Thanks. > > > > > > I've updated the patch. It now implements 64-bit keys, 64-bit values, > > > and the multi-value leaves method. I've tried to remove duplicated > > > codes but we might find a better way to do that. > > > > > > > With the recent changes related to simd, I'm going to split the patch > > into at least two parts: introduce other simd optimized functions used > > by the radix tree and the radix tree implementation. Particularly we > > need two functions for radix tree: a function like pg_lfind32 but for > > 8 bits integers and return the index, and a function that returns the > > index of the first element that is >= key. > > I recommend looking at > > https://www.postgresql.org/message-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA%40mail.gmail.com > > since I did the work just now for searching bytes and returning a > bool, buth = and <=. Should be pretty close. Also, i believe if you > left this for last as a possible refactoring, it might save some work. > In any case, I'll take a look at the latest patch next month. I've updated the radix tree patch. It's now separated into two patches. 0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find better names) that are similar to the pg_lfind8() family but they return the index of the key in the vector instead of true/false. The patch includes regression tests. 0002 patch is the main radix tree implementation. I've removed some duplicated codes of node manipulation. For instance, since node-4, node-16, and node-32 have a similar structure with different fanouts, I introduced the common function for them. In addition to two patches, I've attached the third patch. It's not part of radix tree implementation but introduces a contrib module bench_radix_tree, a tool for radix tree performance benchmarking. It measures loading and lookup performance of both the radix tree and a flat array. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Aug 15, 2022 at 10:39 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > bool, buth = and <=. Should be pretty close. Also, i believe if you > > left this for last as a possible refactoring, it might save some work. v6 demonstrates why this should have been put off towards the end. (more below) > > In any case, I'll take a look at the latest patch next month. Since the CF entry said "Needs Review", I began looking at v5 again this week. Hopefully not too much has changed, but in the future I strongly recommend setting to "Waiting on Author" if a new version is forthcoming. I realize many here share updated patches at any time, but I'd like to discourage the practice especially for large patches. > I've updated the radix tree patch. It's now separated into two patches. > > 0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find > better names) that are similar to the pg_lfind8() family but they > return the index of the key in the vector instead of true/false. The > patch includes regression tests. I don't want to do a full review of this just yet, but I'll just point out some problems from a quick glance. +/* + * Return the index of the first element in the vector that is greater than + * or eual to the given scalar. Return sizeof(Vector8) if there is no such + * element. That's a bizarre API to indicate non-existence. + * + * Note that this function assumes the elements in the vector are sorted. + */ That is *completely* unacceptable for a general-purpose function. +#else /* USE_NO_SIMD */ + Vector8 r = 0; + uint8 *rp = (uint8 *) &r; + + for (Size i = 0; i < sizeof(Vector8); i++) + rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0; I don't think we should try to force the non-simd case to adopt the special semantics of vector comparisons. It's much easier to just use the same logic as the assert builds. +#ifdef USE_SSE2 + return (uint32) _mm_movemask_epi8(v); +#elif defined(USE_NEON) + static const uint8 mask[16] = { + 1 << 0, 1 << 1, 1 << 2, 1 << 3, + 1 << 4, 1 << 5, 1 << 6, 1 << 7, + 1 << 0, 1 << 1, 1 << 2, 1 << 3, + 1 << 4, 1 << 5, 1 << 6, 1 << 7, + }; + + uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7)); + uint8x16_t maskedhi = vextq_u8(masked, masked, 8); + + return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi)); For Arm, we need to be careful here. This article goes into a lot of detail for this situation: https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon Here again, I'd rather put this off and focus on getting the "large details" in good enough shape so we can got towards integrating with vacuum. > In addition to two patches, I've attached the third patch. It's not > part of radix tree implementation but introduces a contrib module > bench_radix_tree, a tool for radix tree performance benchmarking. It > measures loading and lookup performance of both the radix tree and a > flat array. Excellent! This was high on my wish list. -- John Naylor EDB: http://www.enterprisedb.com
On Fri, Sep 16, 2022 at 02:54:14PM +0700, John Naylor wrote: > Here again, I'd rather put this off and focus on getting the "large > details" in good enough shape so we can got towards integrating with > vacuum. I started a new thread for the SIMD patch [0] so that this thread can remain focused on the radix tree stuff. [0] https://www.postgresql.org/message-id/20220917052903.GA3172400%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Sep 16, 2022 at 4:54 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Aug 15, 2022 at 10:39 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > bool, buth = and <=. Should be pretty close. Also, i believe if you > > > left this for last as a possible refactoring, it might save some work. > > v6 demonstrates why this should have been put off towards the end. (more below) > > > > In any case, I'll take a look at the latest patch next month. > > Since the CF entry said "Needs Review", I began looking at v5 again > this week. Hopefully not too much has changed, but in the future I > strongly recommend setting to "Waiting on Author" if a new version is > forthcoming. I realize many here share updated patches at any time, > but I'd like to discourage the practice especially for large patches. Understood. Sorry for the inconveniences. > > > I've updated the radix tree patch. It's now separated into two patches. > > > > 0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find > > better names) that are similar to the pg_lfind8() family but they > > return the index of the key in the vector instead of true/false. The > > patch includes regression tests. > > I don't want to do a full review of this just yet, but I'll just point > out some problems from a quick glance. > > +/* > + * Return the index of the first element in the vector that is greater than > + * or eual to the given scalar. Return sizeof(Vector8) if there is no such > + * element. > > That's a bizarre API to indicate non-existence. > > + * > + * Note that this function assumes the elements in the vector are sorted. > + */ > > That is *completely* unacceptable for a general-purpose function. > > +#else /* USE_NO_SIMD */ > + Vector8 r = 0; > + uint8 *rp = (uint8 *) &r; > + > + for (Size i = 0; i < sizeof(Vector8); i++) > + rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0; > > I don't think we should try to force the non-simd case to adopt the > special semantics of vector comparisons. It's much easier to just use > the same logic as the assert builds. > > +#ifdef USE_SSE2 > + return (uint32) _mm_movemask_epi8(v); > +#elif defined(USE_NEON) > + static const uint8 mask[16] = { > + 1 << 0, 1 << 1, 1 << 2, 1 << 3, > + 1 << 4, 1 << 5, 1 << 6, 1 << 7, > + 1 << 0, 1 << 1, 1 << 2, 1 << 3, > + 1 << 4, 1 << 5, 1 << 6, 1 << 7, > + }; > + > + uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) > vshrq_n_s8(v, 7)); > + uint8x16_t maskedhi = vextq_u8(masked, masked, 8); > + > + return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi)); > > For Arm, we need to be careful here. This article goes into a lot of > detail for this situation: > > https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon > > Here again, I'd rather put this off and focus on getting the "large > details" in good enough shape so we can got towards integrating with > vacuum. Thank you for the comments! These above comments are addressed by Nathan in a newly derived thread. I'll work on the patch. I'll consider how to integrate with vacuum as the next step. One concern for me is how to limit the memory usage to maintenance_work_mem. Unlike using a flat array, memory space for adding one TID varies depending on the situation. If we want strictly not to allow using memory more than maintenance_work_mem, probably we need to estimate the memory consumption in a conservative way. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Sep 20, 2022 at 3:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Sep 16, 2022 at 4:54 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > Here again, I'd rather put this off and focus on getting the "large
> > details" in good enough shape so we can got towards integrating with
> > vacuum.
>
> Thank you for the comments! These above comments are addressed by
> Nathan in a newly derived thread. I'll work on the patch.
I still seem to be out-voted on when to tackle this particular optimization, so I've extended the v6 benchmark code with a hackish function that populates a fixed number of keys, but with different fanouts. (diff attached as a text file)
I didn't take particular care to make this scientific, but the following seems pretty reproducible. Note what happens to load and search performance when node16 has 15 entries versus 16:
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+--------+------------------+------------+--------------
15 | 327680 | 3776512 | 39 | 20
(1 row)
num_keys = 327680, height = 4, n4 = 1, n16 = 23408, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+--------+------------------+------------+--------------
16 | 327680 | 3514368 | 25 | 11
(1 row)
num_keys = 327680, height = 4, n4 = 0, n16 = 21846, n32 = 0, n128 = 0, n256 = 0
In trying to wrap the SIMD code behind layers of abstraction, the latest patch (and Nathan's cleanup) threw it away in almost all cases. To explain, we need to talk about how vectorized code deals with the "tail" that is too small for the register:
1. Use a one-by-one algorithm, like we do for the pg_lfind* variants.
2. Read some junk into the register and mask off false positives from the result.
There are advantages to both depending on the situation.
Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15 elements or less, it will iterate over them one-by-one exactly like a node4. Only when full with 16 will the vector path be taken. When another entry is added, the elements are copied to the next bigger node, so there's a *small* window where it's fast.
In short, this code needs to be lower level so that we still have full control while being portable. I will work on this, and also the related code for node dispatch.
Since v6 has some good infrastructure to do low-level benchmarking, I also want to do some experiments with memory management.
(I have further comments about the code, but I will put that off until later)
> I'll consider how to integrate with vacuum as the next step. One
> concern for me is how to limit the memory usage to
> maintenance_work_mem. Unlike using a flat array, memory space for
> adding one TID varies depending on the situation. If we want strictly
> not to allow using memory more than maintenance_work_mem, probably we
> need to estimate the memory consumption in a conservative way.
+1
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Fri, Sep 16, 2022 at 4:54 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > Here again, I'd rather put this off and focus on getting the "large
> > details" in good enough shape so we can got towards integrating with
> > vacuum.
>
> Thank you for the comments! These above comments are addressed by
> Nathan in a newly derived thread. I'll work on the patch.
I still seem to be out-voted on when to tackle this particular optimization, so I've extended the v6 benchmark code with a hackish function that populates a fixed number of keys, but with different fanouts. (diff attached as a text file)
I didn't take particular care to make this scientific, but the following seems pretty reproducible. Note what happens to load and search performance when node16 has 15 entries versus 16:
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+--------+------------------+------------+--------------
15 | 327680 | 3776512 | 39 | 20
(1 row)
num_keys = 327680, height = 4, n4 = 1, n16 = 23408, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+--------+------------------+------------+--------------
16 | 327680 | 3514368 | 25 | 11
(1 row)
num_keys = 327680, height = 4, n4 = 0, n16 = 21846, n32 = 0, n128 = 0, n256 = 0
In trying to wrap the SIMD code behind layers of abstraction, the latest patch (and Nathan's cleanup) threw it away in almost all cases. To explain, we need to talk about how vectorized code deals with the "tail" that is too small for the register:
1. Use a one-by-one algorithm, like we do for the pg_lfind* variants.
2. Read some junk into the register and mask off false positives from the result.
There are advantages to both depending on the situation.
Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15 elements or less, it will iterate over them one-by-one exactly like a node4. Only when full with 16 will the vector path be taken. When another entry is added, the elements are copied to the next bigger node, so there's a *small* window where it's fast.
In short, this code needs to be lower level so that we still have full control while being portable. I will work on this, and also the related code for node dispatch.
Since v6 has some good infrastructure to do low-level benchmarking, I also want to do some experiments with memory management.
(I have further comments about the code, but I will put that off until later)
> I'll consider how to integrate with vacuum as the next step. One
> concern for me is how to limit the memory usage to
> maintenance_work_mem. Unlike using a flat array, memory space for
> adding one TID varies depending on the situation. If we want strictly
> not to allow using memory more than maintenance_work_mem, probably we
> need to estimate the memory consumption in a conservative way.
+1
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote: > In trying to wrap the SIMD code behind layers of abstraction, the latest > patch (and Nathan's cleanup) threw it away in almost all cases. To explain, > we need to talk about how vectorized code deals with the "tail" that is too > small for the register: > > 1. Use a one-by-one algorithm, like we do for the pg_lfind* variants. > 2. Read some junk into the register and mask off false positives from the > result. > > There are advantages to both depending on the situation. > > Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15 > elements or less, it will iterate over them one-by-one exactly like a > node4. Only when full with 16 will the vector path be taken. When another > entry is added, the elements are copied to the next bigger node, so there's > a *small* window where it's fast. > > In short, this code needs to be lower level so that we still have full > control while being portable. I will work on this, and also the related > code for node dispatch. Is it possible to use approach #2 here, too? AFAICT space is allocated for all of the chunks, so there wouldn't be any danger in searching all them and discarding any results >= node->count. Granted, we're depending on the number of chunks always being a multiple of elements-per-vector in order to avoid the tail path, but that seems like a reasonably safe assumption that can be covered with comments. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Sep 22, 2022 at 1:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote:
>
> > In short, this code needs to be lower level so that we still have full
> > control while being portable. I will work on this, and also the related
> > code for node dispatch.
>
> Is it possible to use approach #2 here, too? AFAICT space is allocated for
> all of the chunks, so there wouldn't be any danger in searching all them
> and discarding any results >= node->count.
> > control while being portable. I will work on this, and also the related
> > code for node dispatch.
>
> Is it possible to use approach #2 here, too? AFAICT space is allocated for
> all of the chunks, so there wouldn't be any danger in searching all them
> and discarding any results >= node->count.
Sure, the caller could pass the maximum node capacity, and then check if the returned index is within the range of the node count.
> Granted, we're depending on the
> number of chunks always being a multiple of elements-per-vector in order to
> avoid the tail path, but that seems like a reasonably safe assumption that
> can be covered with comments.
Actually, we don't need to depend on that at all. When I said "junk" above, that can be any bytes, as long as we're not reading off the end of allocated memory. We'll never do that here, since the child pointers/values follow. In that case, the caller can hard-code the size (it would even happen to work now to multiply rt_node_kind by 16, to be sneaky). One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably inside a power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which would be faster and possibly would solve the fragmentation problem Andres referred to in
> number of chunks always being a multiple of elements-per-vector in order to
> avoid the tail path, but that seems like a reasonably safe assumption that
> can be covered with comments.
Actually, we don't need to depend on that at all. When I said "junk" above, that can be any bytes, as long as we're not reading off the end of allocated memory. We'll never do that here, since the child pointers/values follow. In that case, the caller can hard-code the size (it would even happen to work now to multiply rt_node_kind by 16, to be sneaky). One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably inside a power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which would be faster and possibly would solve the fragmentation problem Andres referred to in
While on the subject, I wonder how important it is to keep the chunks in the small nodes in sorted order. That adds branches and memmove calls, and is the whole reason for the recent "pg_lfind_ge" function.
--
John Naylor
EDB: http://www.enterprisedb.com
EDB: http://www.enterprisedb.com
On Thu, Sep 22, 2022 at 1:46 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Thu, Sep 22, 2022 at 1:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > > > On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote: > > > > > In short, this code needs to be lower level so that we still have full > > > control while being portable. I will work on this, and also the related > > > code for node dispatch. > > > > Is it possible to use approach #2 here, too? AFAICT space is allocated for > > all of the chunks, so there wouldn't be any danger in searching all them > > and discarding any results >= node->count. > > Sure, the caller could pass the maximum node capacity, and then check if the returned index is within the range of thenode count. > > > Granted, we're depending on the > > number of chunks always being a multiple of elements-per-vector in order to > > avoid the tail path, but that seems like a reasonably safe assumption that > > can be covered with comments. > > Actually, we don't need to depend on that at all. When I said "junk" above, that can be any bytes, as long as we're notreading off the end of allocated memory. We'll never do that here, since the child pointers/values follow. In that case,the caller can hard-code the size (it would even happen to work now to multiply rt_node_kind by 16, to be sneaky).One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably insidea power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which wouldbe faster and possibly would solve the fragmentation problem Andres referred to in > > https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de > > While on the subject, I wonder how important it is to keep the chunks in the small nodes in sorted order. That adds branchesand memmove calls, and is the whole reason for the recent "pg_lfind_ge" function. Good point. While keeping the chunks in the small nodes in sorted order is useful for visiting all keys in sorted order, additional branches and memmove calls could be slow. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Sep 22, 2022 at 1:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Sep 22, 2022 at 1:46 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > While on the subject, I wonder how important it is to keep the chunks in the small nodes in sorted order. That adds branches and memmove calls, and is the whole reason for the recent "pg_lfind_ge" function.
>
> Good point. While keeping the chunks in the small nodes in sorted
> order is useful for visiting all keys in sorted order, additional
> branches and memmove calls could be slow.
Right, the ordering is a property that some users will need, so best to keep it. Although the node128 doesn't have that property -- too slow to do so, I think.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Sep 22, 2022 at 7:52 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
>
> On Thu, Sep 22, 2022 at 1:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Good point. While keeping the chunks in the small nodes in sorted
> > order is useful for visiting all keys in sorted order, additional
> > branches and memmove calls could be slow.
>
> Right, the ordering is a property that some users will need, so best to keep it. Although the node128 doesn't have that property -- too slow to do so, I think.
Nevermind, I must have been mixing up keys and values there...
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Sep 22, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote:
> One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably inside a power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which would be faster and possibly would solve the fragmentation problem Andres referred to in
> https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
While calculating node sizes that fit within a power-of-two size, I noticed the current base node is a bit wasteful, taking up 8 bytes. The node kind only has a small number of values, so it doesn't really make sense to use an enum here in the struct (in fact, Andres' prototype used a uint8 for node_kind). We could use a bitfield for the count and kind:
uint16 -- kind and count bitfield
uint8 shift;
uint8 chunk;
That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, the bitfield can just go back to being count only.
Here are the v6 node kinds:
node4: 8 + 4 +(4) + 4*8 = 48 bytes
node16: 8 + 16 + 16*8 = 152
node32: 8 + 32 + 32*8 = 296
node128: 8 + 256 + 128/8 + 128*8 = 1304
node256: 8 + 256/8 + 256*8 = 2088
And here are the possible ways we could optimize nodes for space using aset allocation. Parentheses are padding bytes. Even if my math has mistakes, the numbers shouldn't be too far off:
While calculating node sizes that fit within a power-of-two size, I noticed the current base node is a bit wasteful, taking up 8 bytes. The node kind only has a small number of values, so it doesn't really make sense to use an enum here in the struct (in fact, Andres' prototype used a uint8 for node_kind). We could use a bitfield for the count and kind:
uint16 -- kind and count bitfield
uint8 shift;
uint8 chunk;
That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, the bitfield can just go back to being count only.
Here are the v6 node kinds:
node4: 8 + 4 +(4) + 4*8 = 48 bytes
node16: 8 + 16 + 16*8 = 152
node32: 8 + 32 + 32*8 = 296
node128: 8 + 256 + 128/8 + 128*8 = 1304
node256: 8 + 256/8 + 256*8 = 2088
And here are the possible ways we could optimize nodes for space using aset allocation. Parentheses are padding bytes. Even if my math has mistakes, the numbers shouldn't be too far off:
node3: 4 + 3 +(1) + 3*8 = 32 bytes
node6: 4 + 6 +(6) + 6*8 = 64
node13: 4 + 13 +(7) + 13*8 = 128
node28: 4 + 28 + 28*8 = 256
node31: 4 + 256 + 32/8 + 31*8 = 512 (XXX not good)
node94: 4 + 256 + 96/8 + 94*8 = 1024
node220: 4 + 256 + 224/8 + 220*8 = 2048
node256: = 4096
The main disadvantage is that node256 would balloon in size.
node6: 4 + 6 +(6) + 6*8 = 64
node13: 4 + 13 +(7) + 13*8 = 128
node28: 4 + 28 + 28*8 = 256
node31: 4 + 256 + 32/8 + 31*8 = 512 (XXX not good)
node94: 4 + 256 + 96/8 + 94*8 = 1024
node220: 4 + 256 + 224/8 + 220*8 = 2048
node256: = 4096
The main disadvantage is that node256 would balloon in size.
--
John Naylor
EDB: http://www.enterprisedb.com
EDB: http://www.enterprisedb.com
On Fri, Sep 23, 2022 at 12:11 AM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Thu, Sep 22, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote: > > One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably insidea power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which wouldbe faster and possibly would solve the fragmentation problem Andres referred to in > > > https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de > > While calculating node sizes that fit within a power-of-two size, I noticed the current base node is a bit wasteful, takingup 8 bytes. The node kind only has a small number of values, so it doesn't really make sense to use an enum here inthe struct (in fact, Andres' prototype used a uint8 for node_kind). We could use a bitfield for the count and kind: > > uint16 -- kind and count bitfield > uint8 shift; > uint8 chunk; > > That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, the bitfield can just go back to being count only. Good point, agreed. > > Here are the v6 node kinds: > > node4: 8 + 4 +(4) + 4*8 = 48 bytes > node16: 8 + 16 + 16*8 = 152 > node32: 8 + 32 + 32*8 = 296 > node128: 8 + 256 + 128/8 + 128*8 = 1304 > node256: 8 + 256/8 + 256*8 = 2088 > > And here are the possible ways we could optimize nodes for space using aset allocation. Parentheses are padding bytes.Even if my math has mistakes, the numbers shouldn't be too far off: > > node3: 4 + 3 +(1) + 3*8 = 32 bytes > node6: 4 + 6 +(6) + 6*8 = 64 > node13: 4 + 13 +(7) + 13*8 = 128 > node28: 4 + 28 + 28*8 = 256 > node31: 4 + 256 + 32/8 + 31*8 = 512 (XXX not good) > node94: 4 + 256 + 96/8 + 94*8 = 1024 > node220: 4 + 256 + 224/8 + 220*8 = 2048 > node256: = 4096 > > The main disadvantage is that node256 would balloon in size. Yeah, node31 and node256 are bloated. We probably could use slab for node256 independently. It's worth trying a benchmark to see how it affects the performance and the tree size. BTW We need to consider not only aset/slab but also DSA since we allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses the following size classes: static const uint16 dsa_size_classes[] = { sizeof(dsa_area_span), 0, /* special size classes */ 8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */ 80, 96, 112, 128, /* 4 classes separated by 16 bytes */ 160, 192, 224, 256, /* 4 classes separated by 32 bytes */ 320, 384, 448, 512, /* 4 classes separated by 64 bytes */ 640, 768, 896, 1024, /* 4 classes separated by 128 bytes */ 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */ 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */ 5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */ }; node256 will be classed as 2616, which is still not good. Anyway, I'll implement DSA support for radix tree. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Sep 28, 2022 at 10:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> BTW We need to consider not only aset/slab but also DSA since we
> allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> the following size classes:
>
> static const uint16 dsa_size_classes[] = {
> [...]
Thanks for that info -- I wasn't familiar with the details of DSA. For the non-parallel case, I plan to at least benchmark using aset because I gather it's the most heavily optimized. I'm thinking that will allow other problem areas to be more prominent. I'll also want to compare total context size compared to slab to see if possibly less fragmentation makes up for other wastage.
Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memory efficiency and number of branches when searching/inserting. My current thinking is there is too much coupling between size class and data type. Each size class currently uses a different data type and a different algorithm to search and set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branch prediction [1] and (I imagine) code density.
I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in the struct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migrating a node to a larger size class of the same kind, we can simply repalloc() to the next size. To show what I mean, consider this new table:
node2: 5 + 6 +(5)+ 2*8 = 32 bytes
node6: 5 + 6 +(5)+ 6*8 = 64
node12: 5 + 27 + 12*8 = 128
node27: 5 + 27 + 27*8 = 248(->256)
node91: 5 + 256 + 28 +(7)+ 91*8 = 1024
node219: 5 + 256 + 28 +(7)+219*8 = 2048
node256: 5 + 32 +(3)+256*8 = 2088(->4096)
Seven size classes are grouped into the four kinds.
The common base at the front is here 5 bytes because there is a new uint8 field for "capacity", which we can ignore for node256 since we assume we can always insert/update that node. The control data is the same in each pair, and so the offset to the pointer/value array is the same. Thus, migration would look something like:
case FOO_KIND:
if (unlikely(count == capacity))
{
if (capacity == XYZ) /* for smaller size class of the pair */
{
<repalloc to next size class>;
capacity = next-higher-capacity;
goto do_insert;
}
else
<migrate data to next node kind>;
}
else
{
do_insert:
<...>;
break;
}
/* FALLTHROUGH */
...
One disadvantage is that this wastes some space by reserving the full set of control data in the smaller size class of the pair, but it's usually small compared to array size. Somewhat unrelated, we could still implement Andres' idea [1] to dispense with the isset array in inner nodes of the indirect array type (now node128), since we can just test if the pointer is null.
[1] https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
--
John Naylor
EDB: http://www.enterprisedb.com
> BTW We need to consider not only aset/slab but also DSA since we
> allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> the following size classes:
>
> static const uint16 dsa_size_classes[] = {
> [...]
Thanks for that info -- I wasn't familiar with the details of DSA. For the non-parallel case, I plan to at least benchmark using aset because I gather it's the most heavily optimized. I'm thinking that will allow other problem areas to be more prominent. I'll also want to compare total context size compared to slab to see if possibly less fragmentation makes up for other wastage.
Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memory efficiency and number of branches when searching/inserting. My current thinking is there is too much coupling between size class and data type. Each size class currently uses a different data type and a different algorithm to search and set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branch prediction [1] and (I imagine) code density.
I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in the struct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migrating a node to a larger size class of the same kind, we can simply repalloc() to the next size. To show what I mean, consider this new table:
node2: 5 + 6 +(5)+ 2*8 = 32 bytes
node6: 5 + 6 +(5)+ 6*8 = 64
node12: 5 + 27 + 12*8 = 128
node27: 5 + 27 + 27*8 = 248(->256)
node91: 5 + 256 + 28 +(7)+ 91*8 = 1024
node219: 5 + 256 + 28 +(7)+219*8 = 2048
node256: 5 + 32 +(3)+256*8 = 2088(->4096)
Seven size classes are grouped into the four kinds.
The common base at the front is here 5 bytes because there is a new uint8 field for "capacity", which we can ignore for node256 since we assume we can always insert/update that node. The control data is the same in each pair, and so the offset to the pointer/value array is the same. Thus, migration would look something like:
case FOO_KIND:
if (unlikely(count == capacity))
{
if (capacity == XYZ) /* for smaller size class of the pair */
{
<repalloc to next size class>;
capacity = next-higher-capacity;
goto do_insert;
}
else
<migrate data to next node kind>;
}
else
{
do_insert:
<...>;
break;
}
/* FALLTHROUGH */
...
One disadvantage is that this wastes some space by reserving the full set of control data in the smaller size class of the pair, but it's usually small compared to array size. Somewhat unrelated, we could still implement Andres' idea [1] to dispense with the isset array in inner nodes of the indirect array type (now node128), since we can just test if the pointer is null.
[1] https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Sep 28, 2022 at 1:18 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> [stuff about size classes]
I kind of buried the lede here on one thing: If we only have 4 kinds regardless of the number of size classes, we can use 2 bits of the pointer for dispatch, which would only require 4-byte alignment. That should make that technique more portable.
On Wed, Sep 28, 2022 at 3:18 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Wed, Sep 28, 2022 at 10:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > BTW We need to consider not only aset/slab but also DSA since we > > allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses > > the following size classes: > > > > static const uint16 dsa_size_classes[] = { > > [...] > > Thanks for that info -- I wasn't familiar with the details of DSA. For the non-parallel case, I plan to at least benchmarkusing aset because I gather it's the most heavily optimized. I'm thinking that will allow other problem areas tobe more prominent. I'll also want to compare total context size compared to slab to see if possibly less fragmentationmakes up for other wastage. Thanks! > > Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memoryefficiency and number of branches when searching/inserting. My current thinking is there is too much coupling betweensize class and data type. Each size class currently uses a different data type and a different algorithm to searchand set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branchprediction [1] and (I imagine) code density. > > I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in thestruct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migratinga node to a larger size class of the same kind, we can simply repalloc() to the next size. Interesting idea. Using flexible array members for values would be good also for the case in the future where we want to support other value types than uint64. With this idea, we can just repalloc() to grow to the larger size in a pair but I'm slightly concerned that the more size class we use, the more frequent the node needs to grow. If we want to support node shrink, the deletion is also affected. > To show what I mean, consider this new table: > > node2: 5 + 6 +(5)+ 2*8 = 32 bytes > node6: 5 + 6 +(5)+ 6*8 = 64 > > node12: 5 + 27 + 12*8 = 128 > node27: 5 + 27 + 27*8 = 248(->256) > > node91: 5 + 256 + 28 +(7)+ 91*8 = 1024 > node219: 5 + 256 + 28 +(7)+219*8 = 2048 > > node256: 5 + 32 +(3)+256*8 = 2088(->4096) > > Seven size classes are grouped into the four kinds. > > The common base at the front is here 5 bytes because there is a new uint8 field for "capacity", which we can ignore fornode256 since we assume we can always insert/update that node. The control data is the same in each pair, and so the offsetto the pointer/value array is the same. Thus, migration would look something like: I think we can use a bitfield for capacity. That way, we can pack count (9bits), kind (2bits)and capacity (4bits) in uint16. > Somewhat unrelated, we could still implement Andres' idea [1] to dispense with the isset array in inner nodes of the indirectarray type (now node128), since we can just test if the pointer is null. Right. I didn't do that to use the common logic for inner node128 and leaf node128. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 2022-09-16 15:00:31 +0900, Masahiko Sawada wrote: > I've updated the radix tree patch. It's now separated into two patches. cfbot notices a compiler warning: https://cirrus-ci.com/task/6247907681632256?logs=gcc_warning#L446 [11:03:05.343] radixtree.c: In function ‘rt_iterate_next’: [11:03:05.343] radixtree.c:1758:15: error: ‘slot’ may be used uninitialized in this function [-Werror=maybe-uninitialized] [11:03:05.343] 1758 | *value_p = *((uint64 *) slot); [11:03:05.343] | ^~~~~~~~~~~~~~~~~~ Greetings, Andres Freund
On Mon, Oct 3, 2022 at 2:04 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-09-16 15:00:31 +0900, Masahiko Sawada wrote: > > I've updated the radix tree patch. It's now separated into two patches. > > cfbot notices a compiler warning: > https://cirrus-ci.com/task/6247907681632256?logs=gcc_warning#L446 > > [11:03:05.343] radixtree.c: In function ‘rt_iterate_next’: > [11:03:05.343] radixtree.c:1758:15: error: ‘slot’ may be used uninitialized in this function [-Werror=maybe-uninitialized] > [11:03:05.343] 1758 | *value_p = *((uint64 *) slot); > [11:03:05.343] | ^~~~~~~~~~~~~~~~~~ > Thanks, I'll fix it in the next version patch. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Sep 23, 2022 at 12:11 AM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > > > On Thu, Sep 22, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote: > > > One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably insidea power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which wouldbe faster and possibly would solve the fragmentation problem Andres referred to in > > > > > https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de > > > > While calculating node sizes that fit within a power-of-two size, I noticed the current base node is a bit wasteful,taking up 8 bytes. The node kind only has a small number of values, so it doesn't really make sense to use an enumhere in the struct (in fact, Andres' prototype used a uint8 for node_kind). We could use a bitfield for the count andkind: > > > > uint16 -- kind and count bitfield > > uint8 shift; > > uint8 chunk; > > > > That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, the bitfield can just go back to being countonly. > > Good point, agreed. > > > > > Here are the v6 node kinds: > > > > node4: 8 + 4 +(4) + 4*8 = 48 bytes > > node16: 8 + 16 + 16*8 = 152 > > node32: 8 + 32 + 32*8 = 296 > > node128: 8 + 256 + 128/8 + 128*8 = 1304 > > node256: 8 + 256/8 + 256*8 = 2088 > > > > And here are the possible ways we could optimize nodes for space using aset allocation. Parentheses are padding bytes.Even if my math has mistakes, the numbers shouldn't be too far off: > > > > node3: 4 + 3 +(1) + 3*8 = 32 bytes > > node6: 4 + 6 +(6) + 6*8 = 64 > > node13: 4 + 13 +(7) + 13*8 = 128 > > node28: 4 + 28 + 28*8 = 256 > > node31: 4 + 256 + 32/8 + 31*8 = 512 (XXX not good) > > node94: 4 + 256 + 96/8 + 94*8 = 1024 > > node220: 4 + 256 + 224/8 + 220*8 = 2048 > > node256: = 4096 > > > > The main disadvantage is that node256 would balloon in size. > > Yeah, node31 and node256 are bloated. We probably could use slab for > node256 independently. It's worth trying a benchmark to see how it > affects the performance and the tree size. > > BTW We need to consider not only aset/slab but also DSA since we > allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses > the following size classes: > > static const uint16 dsa_size_classes[] = { > sizeof(dsa_area_span), 0, /* special size classes */ > 8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */ > 80, 96, 112, 128, /* 4 classes separated by 16 bytes */ > 160, 192, 224, 256, /* 4 classes separated by 32 bytes */ > 320, 384, 448, 512, /* 4 classes separated by 64 bytes */ > 640, 768, 896, 1024, /* 4 classes separated by 128 bytes */ > 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */ > 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */ > 5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */ > }; > > node256 will be classed as 2616, which is still not good. > > Anyway, I'll implement DSA support for radix tree. > Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes to point to its child nodes, instead of C pointers (ig, backend-local address). I'm thinking of a straightforward approach as the first step; inner nodes have a union of rt_node* and dsa_pointer and we choose either one based on whether the radix tree is shared or not. We allocate and free the shared memory for individual nodes by dsa_allocate() and dsa_free(), respectively. Therefore we need to get a C pointer from dsa_pointer by using dsa_get_address() while descending the tree. I'm a bit concerned that calling dsa_get_address() for every descent could be performance overhead but I'm going to measure it anyway. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Oct 5, 2022 at 1:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Sep 23, 2022 at 12:11 AM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > Yeah, node31 and node256 are bloated. We probably could use slab for
> > node256 independently. It's worth trying a benchmark to see how it
> > affects the performance and the tree size.
This wasn't the focus of your current email, but while experimenting with v6 I had another thought about local allocation: If we use the default slab block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right? If so, since aset and DSA also waste at least a few hundred bytes, we could store a useless 256-byte slot array within node256. That way, node128 and node256 share the same start of pointers/values array, so there would be one less branch for getting that address. In v6, rt_node_get_values and rt_node_get_children are not inlined (asde: gcc uses a jump table for 5 kinds but not for 4), but possibly should be, and the smaller the better.
> Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
> to point to its child nodes, instead of C pointers (ig, backend-local
> address). I'm thinking of a straightforward approach as the first
> step; inner nodes have a union of rt_node* and dsa_pointer and we
> choose either one based on whether the radix tree is shared or not. We
> allocate and free the shared memory for individual nodes by
> dsa_allocate() and dsa_free(), respectively. Therefore we need to get
> a C pointer from dsa_pointer by using dsa_get_address() while
> descending the tree. I'm a bit concerned that calling
> dsa_get_address() for every descent could be performance overhead but
> I'm going to measure it anyway.
Are dsa pointers aligned the same as pointers to locally allocated memory? Meaning, is the offset portion always a multiple of 4 (or 8)? It seems that way from a glance, but I can't say for sure. If the lower 2 bits of a DSA pointer are never set, we can tag them the same way as a regular pointer. That same technique could help hide the latency of converting the pointer, by the same way it would hide the latency of loading parts of a node into CPU registers.
One concern is, handling both local and dsa cases in the same code requires more (predictable) branches and reduces code density. That might be a reason in favor of templating to handle each case in its own translation unit. But that might be overkill.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Oct 5, 2022 at 6:40 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Wed, Oct 5, 2022 at 1:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Fri, Sep 23, 2022 at 12:11 AM John Naylor > > > <john.naylor@enterprisedb.com> wrote: > > > Yeah, node31 and node256 are bloated. We probably could use slab for > > > node256 independently. It's worth trying a benchmark to see how it > > > affects the performance and the tree size. > > This wasn't the focus of your current email, but while experimenting with v6 I had another thought about local allocation:If we use the default slab block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right? If so, sinceaset and DSA also waste at least a few hundred bytes, we could store a useless 256-byte slot array within node256. Thatway, node128 and node256 share the same start of pointers/values array, so there would be one less branch for gettingthat address. In v6, rt_node_get_values and rt_node_get_children are not inlined (asde: gcc uses a jump table for5 kinds but not for 4), but possibly should be, and the smaller the better. It would be good for performance but I'm a bit concerned that it's highly optimized to the design of aset and DSA. Since size 2088 will be currently classed as 2616 in DSA, DSA wastes 528 bytes. However, if we introduce a new class of 2304 (=2048 + 256) bytes we cannot store a useless 256-byte and the assumption will be broken. > > > Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes > > to point to its child nodes, instead of C pointers (ig, backend-local > > address). I'm thinking of a straightforward approach as the first > > step; inner nodes have a union of rt_node* and dsa_pointer and we > > choose either one based on whether the radix tree is shared or not. We > > allocate and free the shared memory for individual nodes by > > dsa_allocate() and dsa_free(), respectively. Therefore we need to get > > a C pointer from dsa_pointer by using dsa_get_address() while > > descending the tree. I'm a bit concerned that calling > > dsa_get_address() for every descent could be performance overhead but > > I'm going to measure it anyway. > > Are dsa pointers aligned the same as pointers to locally allocated memory? Meaning, is the offset portion always a multipleof 4 (or 8)? I think so. > It seems that way from a glance, but I can't say for sure. If the lower 2 bits of a DSA pointer are never set, we can tagthem the same way as a regular pointer. That same technique could help hide the latency of converting the pointer, bythe same way it would hide the latency of loading parts of a node into CPU registers. > > One concern is, handling both local and dsa cases in the same code requires more (predictable) branches and reduces codedensity. That might be a reason in favor of templating to handle each case in its own translation unit. Right. We also need to support locking for shared radix tree, which would require more branches. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Oct 6, 2022 at 2:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Oct 5, 2022 at 6:40 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> > This wasn't the focus of your current email, but while experimenting with v6 I had another thought about local allocation: If we use the default slab block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right? If so, since aset and DSA also waste at least a few hundred bytes, we could store a useless 256-byte slot array within node256. That way, node128 and node256 share the same start of pointers/values array, so there would be one less branch for getting that address. In v6, rt_node_get_values and rt_node_get_children are not inlined (asde: gcc uses a jump table for 5 kinds but not for 4), but possibly should be, and the smaller the better.
>
> It would be good for performance but I'm a bit concerned that it's
> highly optimized to the design of aset and DSA. Since size 2088 will
> be currently classed as 2616 in DSA, DSA wastes 528 bytes. However, if
> we introduce a new class of 2304 (=2048 + 256) bytes we cannot store a
> useless 256-byte and the assumption will be broken.
A new DSA class is hypothetical. A better argument against my idea is that SLAB_DEFAULT_BLOCK_SIZE is arbitrary. FWIW, I looked at the prototype just now and the slab block sizes are:
Max(pg_nextpower2_32((MAXALIGN(inner_class_info[i].size) + 16) * 32), 1024)
...which would be 128kB for nodemax. I'm curious about the difference.
> > One concern is, handling both local and dsa cases in the same code requires more (predictable) branches and reduces code density. That might be a reason in favor of templating to handle each case in its own translation unit.
>
> Right. We also need to support locking for shared radix tree, which
> would require more branches.
Hmm, now it seems we'll likely want to template local vs. shared as a later step...
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> In addition to two patches, I've attached the third patch. It's not
> part of radix tree implementation but introduces a contrib module
> bench_radix_tree, a tool for radix tree performance benchmarking. It
> measures loading and lookup performance of both the radix tree and a
> flat array.
Hi Masahiko, I've been using these benchmarks, along with my own variations, to try various things that I've mentioned. I'm long overdue for an update, but the picture is not yet complete.
For now, I have two questions that I can't figure out on my own:
1. There seems to be some non-obvious limit on the number of keys that are loaded (or at least what the numbers report). This is independent of the number of tids per block. Example below:
john=# select * from bench_shuffle_search(0, 8*1000*1000);
NOTICE: num_keys = 8000000, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 250000, n256 = 981
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
8000000 | 268435456 | 48000000 | 661 | 29 | 276 | 389
john=# select * from bench_shuffle_search(0, 9*1000*1000);
NOTICE: num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 262144, n256 = 1028
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
8388608 | 276824064 | 54000000 | 718 | 33 | 311 | 446
The array is the right size, but nkeys hasn't kept pace. Can you reproduce this? Attached is the patch I'm using to show the stats when running the test. (Side note: The numbers look unfavorable for radix tree because I'm using 1 tid per block here.)
2. I found that bench_shuffle_search() is much *faster* for traditional binary search on an array than bench_seq_search(). I've found this to be true in every case. This seems counterintuitive to me -- any idea why this is? Example:
john=# select * from bench_seq_search(0, 1000000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 168 | 106 | 827 | 3348
john=# select * from bench_shuffle_search(0, 1000000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 171 | 107 | 827 | 1400
--
John Naylor
EDB: http://www.enterprisedb.com
> In addition to two patches, I've attached the third patch. It's not
> part of radix tree implementation but introduces a contrib module
> bench_radix_tree, a tool for radix tree performance benchmarking. It
> measures loading and lookup performance of both the radix tree and a
> flat array.
Hi Masahiko, I've been using these benchmarks, along with my own variations, to try various things that I've mentioned. I'm long overdue for an update, but the picture is not yet complete.
For now, I have two questions that I can't figure out on my own:
1. There seems to be some non-obvious limit on the number of keys that are loaded (or at least what the numbers report). This is independent of the number of tids per block. Example below:
john=# select * from bench_shuffle_search(0, 8*1000*1000);
NOTICE: num_keys = 8000000, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 250000, n256 = 981
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
8000000 | 268435456 | 48000000 | 661 | 29 | 276 | 389
john=# select * from bench_shuffle_search(0, 9*1000*1000);
NOTICE: num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 262144, n256 = 1028
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
8388608 | 276824064 | 54000000 | 718 | 33 | 311 | 446
The array is the right size, but nkeys hasn't kept pace. Can you reproduce this? Attached is the patch I'm using to show the stats when running the test. (Side note: The numbers look unfavorable for radix tree because I'm using 1 tid per block here.)
2. I found that bench_shuffle_search() is much *faster* for traditional binary search on an array than bench_seq_search(). I've found this to be true in every case. This seems counterintuitive to me -- any idea why this is? Example:
john=# select * from bench_seq_search(0, 1000000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 168 | 106 | 827 | 3348
john=# select * from bench_shuffle_search(0, 1000000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 171 | 107 | 827 | 1400
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
On Fri, Oct 7, 2022 at 2:29 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > In addition to two patches, I've attached the third patch. It's not > > part of radix tree implementation but introduces a contrib module > > bench_radix_tree, a tool for radix tree performance benchmarking. It > > measures loading and lookup performance of both the radix tree and a > > flat array. > > Hi Masahiko, I've been using these benchmarks, along with my own variations, to try various things that I've mentioned.I'm long overdue for an update, but the picture is not yet complete. Thanks! > For now, I have two questions that I can't figure out on my own: > > 1. There seems to be some non-obvious limit on the number of keys that are loaded (or at least what the numbers report).This is independent of the number of tids per block. Example below: > > john=# select * from bench_shuffle_search(0, 8*1000*1000); > NOTICE: num_keys = 8000000, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 250000, n256 = 981 > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > ---------+------------------+---------------------+------------+---------------+--------------+----------------- > 8000000 | 268435456 | 48000000 | 661 | 29 | 276 | 389 > > john=# select * from bench_shuffle_search(0, 9*1000*1000); > NOTICE: num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 262144, n256 = 1028 > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > ---------+------------------+---------------------+------------+---------------+--------------+----------------- > 8388608 | 276824064 | 54000000 | 718 | 33 | 311 | 446 > > The array is the right size, but nkeys hasn't kept pace. Can you reproduce this? Attached is the patch I'm using to showthe stats when running the test. (Side note: The numbers look unfavorable for radix tree because I'm using 1 tid perblock here.) Yes, I can reproduce this. In tid_to_key_off() we need to cast to uint64 when packing offset number and block number: tid_i = ItemPointerGetOffsetNumber(tid); tid_i |= ItemPointerGetBlockNumber(tid) << shift; > > 2. I found that bench_shuffle_search() is much *faster* for traditional binary search on an array than bench_seq_search().I've found this to be true in every case. This seems counterintuitive to me -- any idea why this is? Example: > > john=# select * from bench_seq_search(0, 1000000); > NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122 > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > ---------+------------------+---------------------+------------+---------------+--------------+----------------- > 1000000 | 10199040 | 180000000 | 168 | 106 | 827 | 3348 > > john=# select * from bench_shuffle_search(0, 1000000); > NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122 > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > ---------+------------------+---------------------+------------+---------------+--------------+----------------- > 1000000 | 10199040 | 180000000 | 171 | 107 | 827 | 1400 > Ugh, in shuffle_itemptrs(), we shuffled itemptrs instead of itemptr: for (int i = 0; i < nitems - 1; i++) { int j = shuffle_randrange(&state, i, nitems - 1); ItemPointerData t = itemptrs[j]; itemptrs[j] = itemptrs[i]; itemptrs[i] = t; With the fix, the results on my environment were: postgres(1:4093192)=# select * from bench_seq_search(0, 10000000); 2022-10-07 16:57:03.124 JST [4093192] LOG: num_keys = 10000000, height = 3, n4 = 0, n16 = 1, n32 = 312500, n128 = 0, n256 = 1226 nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms ----------+------------------+---------------------+------------+---------------+--------------+----------------- 10000000 | 101826560 | 1800000000 | 846 | 486 | 6096 | 21128 (1 row) Time: 28975.566 ms (00:28.976) postgres(1:4093192)=# select * from bench_shuffle_search(0, 10000000); 2022-10-07 16:57:37.476 JST [4093192] LOG: num_keys = 10000000, height = 3, n4 = 0, n16 = 1, n32 = 312500, n128 = 0, n256 = 1226 nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms ----------+------------------+---------------------+------------+---------------+--------------+----------------- 10000000 | 101826560 | 1800000000 | 845 | 484 | 32700 | 152583 (1 row) I've attached a patch to fix them. Also, I realized that bsearch() could be optimized out so I added code to prevent it: Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
The following is not quite a full review, but has plenty to think about. There is too much to cover at once, and I have to start somewhere...
My main concerns are that internal APIs:
1. are difficult to follow
2. lead to poor branch prediction and too many function calls
Some of the measurements are picking on the SIMD search code, but I go into details in order to demonstrate how a regression there can go completely unnoticed. Hopefully the broader themes are informative.
On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> [fixed benchmarks]
Thanks for that! Now I can show clear results on some aspects in a simple way. The attached patches (apply on top of v6) are not intended to be incorporated as-is quite yet, but do point the way to some reorganization that I think is necessary. I've done some testing on loading, but will leave it out for now in the interest of length.
0001-0003 are your performance test fix and and some small conveniences for testing. Binary search is turned off, for example, because we know it already. And the sleep call is so I can run perf in a different shell session, on only the search portion.
Note the v6 test loads all block numbers in the range. Since the test item ids are all below 64 (reasonable), there are always 32 leaf chunks, so all the leaves are node32 and completely full. This had the effect of never taking the byte-wise loop in the proposed pg_lsearch function. These two aspects make this an easy case for the branch predictor:
john=# select * from bench_seq_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 167 | 0 | 822 | 0
1,470,141,841 branches:u
63,693 branch-misses:u # 0.00% of all branches
john=# select * from bench_shuffle_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 168 | 0 | 2174 | 0
1,470,142,569 branches:u
15,023,983 branch-misses:u # 1.02% of all branches
0004 randomizes block selection in the load part of the search test so that each block has a 50% chance of being loaded. Note that now we have many node16s where we had none before. Although node 16 and node32 appear to share the same path in the switch statement of rt_node_search(), the chunk comparison and node_get_values() calls each must go through different branches. The shuffle case is most affected, but even the sequential case slows down. (The leaves are less full -> there are more of them, so memory use is larger, but it shouldn't matter much, in the sequential case at least)
john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 | 0 | 907 | 0
1,684,114,926 branches:u
1,989,901 branch-misses:u # 0.12% of all branches
john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 | 0 | 2890 | 0
1,684,115,844 branches:u
34,215,740 branch-misses:u # 2.03% of all branches
0005 replaces pg_lsearch with a branch-free SIMD search. Note that it retains full portability and gains predictable performance. For demonstration, it's used on all three linear-search types. Although I'm sure it'd be way too slow for node4, this benchmark hardly has any so it's ok.
john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 176 | 0 | 867 | 0
1,469,540,357 branches:u
96,678 branch-misses:u # 0.01% of all branches
john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 171 | 0 | 2530 | 0
1,469,540,533 branches:u
15,019,975 branch-misses:u # 1.02% of all branches
0006 removes node16, and 0007 avoids a function call to introspect node type. 0006 is really to make 0007 simpler to code. The crucial point here is that calling out to rt_node_get_values/children() to figure out what type we are is costly. With these patches, searching an unevenly populated load is the same or faster than the original sequential load, despite taking twice as much memory. (And, as I've noted before, decoupling size class from node kind would win the memory back.)
john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 171 | 0 | 717 | 0
1,349,614,294 branches:u
1,313 branch-misses:u # 0.00% of all branches
john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 172 | 0 | 2202 | 0
1,349,614,741 branches:u
30,592 branch-misses:u # 0.00% of all branches
Expanding this point, once a path branches based on node kind, there should be no reason to ever forget the kind. Ther abstractions in v6 have disadvantages. I understand the reasoning -- to reduce duplication of code. However, done this way, less code in the text editor leads to *more* code (i.e. costly function calls and branches) on the machine level.
I haven't looked at insert/load performance carefully, but it's clear it suffers from the same amnesia. prepare_node_for_insert() branches based on the kind. If it must call rt_node_grow(), that function has no idea where it came from and must branch again. When prepare_node_for_insert() returns we again have no idea what the kind is, so must branch again. And if we are one of the three linear-search nodes, we later do another function call, where we encounter a 5-way jump table because the caller could be anything at all.
Some of this could be worked around with always-inline functions to which we pass a const node kind, and let the compiler get rid of the branches etc. But many cases are probably not even worth doing that. For example, I don't think prepare_node_for_insert() is a useful abstraction to begin with. It returns an index, but only for linear nodes. Lookup nodes get a return value of zero. There is not enough commonality here.
Along the same lines, there are a number of places that have branches as a consequence of treating inner nodes and leaves with the same api:
rt_node_iterate_next
chunk_array_node_get_slot
node_128/256_get_slot
rt_node_search
I'm leaning towards splitting these out into specialized functions for each inner and leaf. This is a bit painful for the last one, but perhaps if we are resigned to templating the shared-mem case, maybe we can template some of the inner/leaf stuff. Something to think about for later, but for now I believe we have to accept some code duplication as a prerequisite for decent performance as well as readability.
For the next steps, we need to proceed cautiously because there is a lot in the air at the moment. Here are some aspects I would find desirable. If there are impracticalities I haven't thought of, we can discuss further. I don't pretend to know the practical consequences of every change I mention.
- If you have started coding the shared memory case, I'd advise to continue so we can see what that looks like. If that has not gotten beyond the design stage, I'd like to first see an attempt at tearing down some of the clumsier abstractions in the current patch.
- As a "smoke test", there should ideally be nothing as general as rt_node_get_children/values(). We should ideally always know what kind we are if we found out earlier.
- For distinguishing between linear nodes, perhaps some always-inline functions can help hide details. But at the same time, trying to treat them the same is not always worthwhile.
My main concerns are that internal APIs:
1. are difficult to follow
2. lead to poor branch prediction and too many function calls
Some of the measurements are picking on the SIMD search code, but I go into details in order to demonstrate how a regression there can go completely unnoticed. Hopefully the broader themes are informative.
On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> [fixed benchmarks]
Thanks for that! Now I can show clear results on some aspects in a simple way. The attached patches (apply on top of v6) are not intended to be incorporated as-is quite yet, but do point the way to some reorganization that I think is necessary. I've done some testing on loading, but will leave it out for now in the interest of length.
0001-0003 are your performance test fix and and some small conveniences for testing. Binary search is turned off, for example, because we know it already. And the sleep call is so I can run perf in a different shell session, on only the search portion.
Note the v6 test loads all block numbers in the range. Since the test item ids are all below 64 (reasonable), there are always 32 leaf chunks, so all the leaves are node32 and completely full. This had the effect of never taking the byte-wise loop in the proposed pg_lsearch function. These two aspects make this an easy case for the branch predictor:
john=# select * from bench_seq_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 167 | 0 | 822 | 0
1,470,141,841 branches:u
63,693 branch-misses:u # 0.00% of all branches
john=# select * from bench_shuffle_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 168 | 0 | 2174 | 0
1,470,142,569 branches:u
15,023,983 branch-misses:u # 1.02% of all branches
0004 randomizes block selection in the load part of the search test so that each block has a 50% chance of being loaded. Note that now we have many node16s where we had none before. Although node 16 and node32 appear to share the same path in the switch statement of rt_node_search(), the chunk comparison and node_get_values() calls each must go through different branches. The shuffle case is most affected, but even the sequential case slows down. (The leaves are less full -> there are more of them, so memory use is larger, but it shouldn't matter much, in the sequential case at least)
john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 | 0 | 907 | 0
1,684,114,926 branches:u
1,989,901 branch-misses:u # 0.12% of all branches
john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 | 0 | 2890 | 0
1,684,115,844 branches:u
34,215,740 branch-misses:u # 2.03% of all branches
0005 replaces pg_lsearch with a branch-free SIMD search. Note that it retains full portability and gains predictable performance. For demonstration, it's used on all three linear-search types. Although I'm sure it'd be way too slow for node4, this benchmark hardly has any so it's ok.
john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 176 | 0 | 867 | 0
1,469,540,357 branches:u
96,678 branch-misses:u # 0.01% of all branches
john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 171 | 0 | 2530 | 0
1,469,540,533 branches:u
15,019,975 branch-misses:u # 1.02% of all branches
0006 removes node16, and 0007 avoids a function call to introspect node type. 0006 is really to make 0007 simpler to code. The crucial point here is that calling out to rt_node_get_values/children() to figure out what type we are is costly. With these patches, searching an unevenly populated load is the same or faster than the original sequential load, despite taking twice as much memory. (And, as I've noted before, decoupling size class from node kind would win the memory back.)
john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 171 | 0 | 717 | 0
1,349,614,294 branches:u
1,313 branch-misses:u # 0.00% of all branches
john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 172 | 0 | 2202 | 0
1,349,614,741 branches:u
30,592 branch-misses:u # 0.00% of all branches
Expanding this point, once a path branches based on node kind, there should be no reason to ever forget the kind. Ther abstractions in v6 have disadvantages. I understand the reasoning -- to reduce duplication of code. However, done this way, less code in the text editor leads to *more* code (i.e. costly function calls and branches) on the machine level.
I haven't looked at insert/load performance carefully, but it's clear it suffers from the same amnesia. prepare_node_for_insert() branches based on the kind. If it must call rt_node_grow(), that function has no idea where it came from and must branch again. When prepare_node_for_insert() returns we again have no idea what the kind is, so must branch again. And if we are one of the three linear-search nodes, we later do another function call, where we encounter a 5-way jump table because the caller could be anything at all.
Some of this could be worked around with always-inline functions to which we pass a const node kind, and let the compiler get rid of the branches etc. But many cases are probably not even worth doing that. For example, I don't think prepare_node_for_insert() is a useful abstraction to begin with. It returns an index, but only for linear nodes. Lookup nodes get a return value of zero. There is not enough commonality here.
Along the same lines, there are a number of places that have branches as a consequence of treating inner nodes and leaves with the same api:
rt_node_iterate_next
chunk_array_node_get_slot
node_128/256_get_slot
rt_node_search
I'm leaning towards splitting these out into specialized functions for each inner and leaf. This is a bit painful for the last one, but perhaps if we are resigned to templating the shared-mem case, maybe we can template some of the inner/leaf stuff. Something to think about for later, but for now I believe we have to accept some code duplication as a prerequisite for decent performance as well as readability.
For the next steps, we need to proceed cautiously because there is a lot in the air at the moment. Here are some aspects I would find desirable. If there are impracticalities I haven't thought of, we can discuss further. I don't pretend to know the practical consequences of every change I mention.
- If you have started coding the shared memory case, I'd advise to continue so we can see what that looks like. If that has not gotten beyond the design stage, I'd like to first see an attempt at tearing down some of the clumsier abstractions in the current patch.
- As a "smoke test", there should ideally be nothing as general as rt_node_get_children/values(). We should ideally always know what kind we are if we found out earlier.
- For distinguishing between linear nodes, perhaps some always-inline functions can help hide details. But at the same time, trying to treat them the same is not always worthwhile.
- Start to separate treatment of inner/leaves and see how it goes.
- I firmly believe we only need 4 node *kinds*, and later we can decouple the size classes as a separate concept. I'm willing to put serious time into that once the broad details are right. I will also investigate pointer tagging if we can confirm that can work similarly for dsa pointers.
Regarding size class decoupling, I'll respond to a point made earlier:
On Fri, Sep 30, 2022 at 10:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> With this idea, we can just repalloc() to grow to the larger size in a
> pair but I'm slightly concerned that the more size class we use, the
> more frequent the node needs to grow.
Well, yes, but that's orthogonal. For example, v6 has 5 node kinds. Imagine that we have 4 node kinds, but the SIMD node kind used 2 size classes. Then the nodes would grow at *exactly* the same frequency as they do today. I listed many ways a size class could fit into a power-of-two (and there are more), but we have a choice in how many to actually use. It's a trade off between memory usage and complexity.
> If we want to support node
> shrink, the deletion is also affected.
Not necessarily. We don't have to shrink at the same granularity as growing. My evidence is simple: we don't shrink at all now. :-)
--
John Naylor
EDB: http://www.enterprisedb.com
- I firmly believe we only need 4 node *kinds*, and later we can decouple the size classes as a separate concept. I'm willing to put serious time into that once the broad details are right. I will also investigate pointer tagging if we can confirm that can work similarly for dsa pointers.
Regarding size class decoupling, I'll respond to a point made earlier:
On Fri, Sep 30, 2022 at 10:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> With this idea, we can just repalloc() to grow to the larger size in a
> pair but I'm slightly concerned that the more size class we use, the
> more frequent the node needs to grow.
Well, yes, but that's orthogonal. For example, v6 has 5 node kinds. Imagine that we have 4 node kinds, but the SIMD node kind used 2 size classes. Then the nodes would grow at *exactly* the same frequency as they do today. I listed many ways a size class could fit into a power-of-two (and there are more), but we have a choice in how many to actually use. It's a trade off between memory usage and complexity.
> If we want to support node
> shrink, the deletion is also affected.
Not necessarily. We don't have to shrink at the same granularity as growing. My evidence is simple: we don't shrink at all now. :-)
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Oct 10, 2022 at 12:16 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> Thanks for that! Now I can show clear results on some aspects in a simple way. The attached patches (apply on top of v6)
Forgot the patchset...
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
Hi, On Mon, Oct 10, 2022 at 2:16 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > The following is not quite a full review, but has plenty to think about. There is too much to cover at once, and I haveto start somewhere... > > My main concerns are that internal APIs: > > 1. are difficult to follow > 2. lead to poor branch prediction and too many function calls > > Some of the measurements are picking on the SIMD search code, but I go into details in order to demonstrate how a regressionthere can go completely unnoticed. Hopefully the broader themes are informative. > > On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > [fixed benchmarks] > > Thanks for that! Now I can show clear results on some aspects in a simple way. The attached patches (apply on top of v6)are not intended to be incorporated as-is quite yet, but do point the way to some reorganization that I think is necessary.I've done some testing on loading, but will leave it out for now in the interest of length. > > > 0001-0003 are your performance test fix and and some small conveniences for testing. Binary search is turned off, for example,because we know it already. And the sleep call is so I can run perf in a different shell session, on only the searchportion. > > Note the v6 test loads all block numbers in the range. Since the test item ids are all below 64 (reasonable), there arealways 32 leaf chunks, so all the leaves are node32 and completely full. This had the effect of never taking the byte-wiseloop in the proposed pg_lsearch function. These two aspects make this an easy case for the branch predictor: > > john=# select * from bench_seq_search(0, 1*1000*1000); > NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > ---------+------------------+---------------------+------------+---------------+--------------+----------------- > 1000000 | 10199040 | 180000000 | 167 | 0 | 822 | 0 > > 1,470,141,841 branches:u > 63,693 branch-misses:u # 0.00% of all branches > > john=# select * from bench_shuffle_search(0, 1*1000*1000); > NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > ---------+------------------+---------------------+------------+---------------+--------------+----------------- > 1000000 | 10199040 | 180000000 | 168 | 0 | 2174 | 0 > > 1,470,142,569 branches:u > 15,023,983 branch-misses:u # 1.02% of all branches > > > 0004 randomizes block selection in the load part of the search test so that each block has a 50% chance of being loaded. Note that now we have many node16s where we had none before. Although node 16 and node32 appear to share the samepath in the switch statement of rt_node_search(), the chunk comparison and node_get_values() calls each must go throughdifferent branches. The shuffle case is most affected, but even the sequential case slows down. (The leaves are lessfull -> there are more of them, so memory use is larger, but it shouldn't matter much, in the sequential case at least) > > john=# select * from bench_seq_search(0, 2*1000*1000); > NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > --------+------------------+---------------------+------------+---------------+--------------+----------------- > 999654 | 14893056 | 179937720 | 173 | 0 | 907 | 0 > > 1,684,114,926 branches:u > 1,989,901 branch-misses:u # 0.12% of all branches > > john=# select * from bench_shuffle_search(0, 2*1000*1000); > NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > --------+------------------+---------------------+------------+---------------+--------------+----------------- > 999654 | 14893056 | 179937720 | 173 | 0 | 2890 | 0 > > 1,684,115,844 branches:u > 34,215,740 branch-misses:u # 2.03% of all branches > > > 0005 replaces pg_lsearch with a branch-free SIMD search. Note that it retains full portability and gains predictable performance.For demonstration, it's used on all three linear-search types. Although I'm sure it'd be way too slow for node4,this benchmark hardly has any so it's ok. > > john=# select * from bench_seq_search(0, 2*1000*1000); > NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > --------+------------------+---------------------+------------+---------------+--------------+----------------- > 999654 | 14893056 | 179937720 | 176 | 0 | 867 | 0 > > 1,469,540,357 branches:u > 96,678 branch-misses:u # 0.01% of all branches > > john=# select * from bench_shuffle_search(0, 2*1000*1000); > NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > --------+------------------+---------------------+------------+---------------+--------------+----------------- > 999654 | 14893056 | 179937720 | 171 | 0 | 2530 | 0 > > 1,469,540,533 branches:u > 15,019,975 branch-misses:u # 1.02% of all branches > > > 0006 removes node16, and 0007 avoids a function call to introspect node type. 0006 is really to make 0007 simpler to code.The crucial point here is that calling out to rt_node_get_values/children() to figure out what type we are is costly.With these patches, searching an unevenly populated load is the same or faster than the original sequential load,despite taking twice as much memory. (And, as I've noted before, decoupling size class from node kind would win thememory back.) > > john=# select * from bench_seq_search(0, 2*1000*1000); > NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > --------+------------------+---------------------+------------+---------------+--------------+----------------- > 999654 | 20381696 | 179937720 | 171 | 0 | 717 | 0 > > 1,349,614,294 branches:u > 1,313 branch-misses:u # 0.00% of all branches > > john=# select * from bench_shuffle_search(0, 2*1000*1000); > NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > --------+------------------+---------------------+------------+---------------+--------------+----------------- > 999654 | 20381696 | 179937720 | 172 | 0 | 2202 | 0 > > 1,349,614,741 branches:u > 30,592 branch-misses:u # 0.00% of all branches > > Expanding this point, once a path branches based on node kind, there should be no reason to ever forget the kind. Therabstractions in v6 have disadvantages. I understand the reasoning -- to reduce duplication of code. However, done thisway, less code in the text editor leads to *more* code (i.e. costly function calls and branches) on the machine level. Right. When updating the patch from v4 to v5, I've eliminated the duplication of code between each node type as much as possible, which in turn produced more code on the machine level. The resulst of your experiment clearly showed the bad side of this work. FWIW I've also confirmed your changes in my environment (I've added the third argument to turn on and off the randomizes block selection proposed in 0004 patch): * w/o patches postgres(1:361692)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false); 2022-10-14 11:33:15.460 JST [361692] LOG: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122 NOTICE: sleeping for 2 seconds... nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms ---------+------------------+---------------------+------------+---------------+--------------+----------------- 1000000 | 10199040 | 180000000 | 87 | | 462 | (1 row) 1590104944 branches:u # 3.430 G/sec 65957 branch-misses:u # 0.00% of all branches postgres(1:361692)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true); 2022-10-14 11:33:28.934 JST [361692] LOG: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245 NOTICE: sleeping for 2 seconds... nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms --------+------------------+---------------------+------------+---------------+--------------+----------------- 999654 | 14893056 | 179937720 | 91 | | 497 | (1 row) 1748249456 branches:u # 3.506 G/sec 481074 branch-misses:u # 0.03% of all branches postgres(1:361692)=# select * from bench_shuffle_search(0, 1 * 1000 * 1000, false); 2022-10-14 11:33:38.378 JST [361692] LOG: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122 NOTICE: sleeping for 2 seconds... nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms ---------+------------------+---------------------+------------+---------------+--------------+----------------- 1000000 | 10199040 | 180000000 | 86 | | 1290 | (1 row) 1590105370 branches:u # 1.231 G/sec 15039443 branch-misses:u # 0.95% of all branches Time: 4166.346 ms (00:04.166) postgres(1:361692)=# select * from bench_shuffle_search(0, 2 * 1000 * 1000, true); 2022-10-14 11:33:51.556 JST [361692] LOG: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245 NOTICE: sleeping for 2 seconds... nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms --------+------------------+---------------------+------------+---------------+--------------+----------------- 999654 | 14893056 | 179937720 | 90 | | 1536 | (1 row) 1748250497 branches:u # 1.137 G/sec 28125016 branch-misses:u # 1.61% of all branches * w/ all patches postgres(1:360358)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false); 2022-10-14 11:29:27.232 JST [360358] LOG: num_keys = 1000000, height = 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122 NOTICE: sleeping for 2 seconds... nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms ---------+------------------+---------------------+------------+---------------+--------------+----------------- 1000000 | 10199040 | 180000000 | 81 | | 432 | (1 row) 1380062209 branches:u # 3.185 G/sec 1066 branch-misses:u # 0.00% of all branches postgres(1:360358)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true); 2022-10-14 11:29:46.380 JST [360358] LOG: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245 NOTICE: sleeping for 2 seconds... nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms --------+------------------+---------------------+------------+---------------+--------------+----------------- 999654 | 20381696 | 179937720 | 88 | | 438 | (1 row) 1379640815 branches:u # 3.133 G/sec 1332 branch-misses:u # 0.00% of all branches postgres(1:360358)=# select * from bench_shuffle_search(0, 1 * 1000 * 1000, false); 2022-10-14 11:30:00.943 JST [360358] LOG: num_keys = 1000000, height = 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122 NOTICE: sleeping for 2 seconds... nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms ---------+------------------+---------------------+------------+---------------+--------------+----------------- 1000000 | 10199040 | 180000000 | 81 | | 994 | (1 row) 1380062386 branches:u # 1.386 G/sec 18368 branch-misses:u # 0.00% of all branches postgres(1:360358)=# select * from bench_shuffle_search(0, 2 * 1000 * 1000, true); 2022-10-14 11:30:15.944 JST [360358] LOG: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245 NOTICE: sleeping for 2 seconds... nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms --------+------------------+---------------------+------------+---------------+--------------+----------------- 999654 | 20381696 | 179937720 | 88 | | 1098 | (1 row) 1379641503 branches:u # 1.254 G/sec 18973 branch-misses:u # 0.00% of all branches > I haven't looked at insert/load performance carefully, but it's clear it suffers from the same amnesia. prepare_node_for_insert()branches based on the kind. If it must call rt_node_grow(), that function has no idea where it camefrom and must branch again. When prepare_node_for_insert() returns we again have no idea what the kind is, so must branchagain. And if we are one of the three linear-search nodes, we later do another function call, where we encounter a5-way jump table because the caller could be anything at all. > > Some of this could be worked around with always-inline functions to which we pass a const node kind, and let the compilerget rid of the branches etc. But many cases are probably not even worth doing that. For example, I don't think prepare_node_for_insert()is a useful abstraction to begin with. It returns an index, but only for linear nodes. Lookup nodesget a return value of zero. There is not enough commonality here. Agreed. > > Along the same lines, there are a number of places that have branches as a consequence of treating inner nodes and leaveswith the same api: > > rt_node_iterate_next > chunk_array_node_get_slot > node_128/256_get_slot > rt_node_search > > I'm leaning towards splitting these out into specialized functions for each inner and leaf. This is a bit painful for thelast one, but perhaps if we are resigned to templating the shared-mem case, maybe we can template some of the inner/leafstuff. Something to think about for later, but for now I believe we have to accept some code duplication as a prerequisitefor decent performance as well as readability. Agreed. > > For the next steps, we need to proceed cautiously because there is a lot in the air at the moment. Here are some aspectsI would find desirable. If there are impracticalities I haven't thought of, we can discuss further. I don't pretendto know the practical consequences of every change I mention. > > - If you have started coding the shared memory case, I'd advise to continue so we can see what that looks like. If thathas not gotten beyond the design stage, I'd like to first see an attempt at tearing down some of the clumsier abstractionsin the current patch. > - As a "smoke test", there should ideally be nothing as general as rt_node_get_children/values(). We should ideally alwaysknow what kind we are if we found out earlier. > - For distinguishing between linear nodes, perhaps some always-inline functions can help hide details. But at the sametime, trying to treat them the same is not always worthwhile. > - Start to separate treatment of inner/leaves and see how it goes. Since I've not started coding the shared memory case seriously, I'm going to start with eliminating abstractions and splitting the treatment of inner and leaf nodes. > - I firmly believe we only need 4 node *kinds*, and later we can decouple the size classes as a separate concept. I'm willingto put serious time into that once the broad details are right. I will also investigate pointer tagging if we canconfirm that can work similarly for dsa pointers. I'll keep 4 node kinds. And we can later try to introduce classes into each node kind. > > Regarding size class decoupling, I'll respond to a point made earlier: > > On Fri, Sep 30, 2022 at 10:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > With this idea, we can just repalloc() to grow to the larger size in a > > pair but I'm slightly concerned that the more size class we use, the > > more frequent the node needs to grow. > > Well, yes, but that's orthogonal. For example, v6 has 5 node kinds. Imagine that we have 4 node kinds, but the SIMD nodekind used 2 size classes. Then the nodes would grow at *exactly* the same frequency as they do today. I listed many waysa size class could fit into a power-of-two (and there are more), but we have a choice in how many to actually use. It'sa trade off between memory usage and complexity. Agreed. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Oct 14, 2022 at 4:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi, > > On Mon, Oct 10, 2022 at 2:16 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > The following is not quite a full review, but has plenty to think about. There is too much to cover at once, and I haveto start somewhere... > > > > My main concerns are that internal APIs: > > > > 1. are difficult to follow > > 2. lead to poor branch prediction and too many function calls > > > > Some of the measurements are picking on the SIMD search code, but I go into details in order to demonstrate how a regressionthere can go completely unnoticed. Hopefully the broader themes are informative. > > > > On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > [fixed benchmarks] > > > > Thanks for that! Now I can show clear results on some aspects in a simple way. The attached patches (apply on top ofv6) are not intended to be incorporated as-is quite yet, but do point the way to some reorganization that I think is necessary.I've done some testing on loading, but will leave it out for now in the interest of length. > > > > > > 0001-0003 are your performance test fix and and some small conveniences for testing. Binary search is turned off, forexample, because we know it already. And the sleep call is so I can run perf in a different shell session, on only thesearch portion. > > > > Note the v6 test loads all block numbers in the range. Since the test item ids are all below 64 (reasonable), there arealways 32 leaf chunks, so all the leaves are node32 and completely full. This had the effect of never taking the byte-wiseloop in the proposed pg_lsearch function. These two aspects make this an easy case for the branch predictor: > > > > john=# select * from bench_seq_search(0, 1*1000*1000); > > NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122 > > NOTICE: sleeping for 2 seconds... > > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > > ---------+------------------+---------------------+------------+---------------+--------------+----------------- > > 1000000 | 10199040 | 180000000 | 167 | 0 | 822 | 0 > > > > 1,470,141,841 branches:u > > 63,693 branch-misses:u # 0.00% of all branches > > > > john=# select * from bench_shuffle_search(0, 1*1000*1000); > > NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122 > > NOTICE: sleeping for 2 seconds... > > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > > ---------+------------------+---------------------+------------+---------------+--------------+----------------- > > 1000000 | 10199040 | 180000000 | 168 | 0 | 2174 | 0 > > > > 1,470,142,569 branches:u > > 15,023,983 branch-misses:u # 1.02% of all branches > > > > > > 0004 randomizes block selection in the load part of the search test so that each block has a 50% chance of being loaded. Note that now we have many node16s where we had none before. Although node 16 and node32 appear to share the samepath in the switch statement of rt_node_search(), the chunk comparison and node_get_values() calls each must go throughdifferent branches. The shuffle case is most affected, but even the sequential case slows down. (The leaves are lessfull -> there are more of them, so memory use is larger, but it shouldn't matter much, in the sequential case at least) > > > > john=# select * from bench_seq_search(0, 2*1000*1000); > > NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245 > > NOTICE: sleeping for 2 seconds... > > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > > --------+------------------+---------------------+------------+---------------+--------------+----------------- > > 999654 | 14893056 | 179937720 | 173 | 0 | 907 | 0 > > > > 1,684,114,926 branches:u > > 1,989,901 branch-misses:u # 0.12% of all branches > > > > john=# select * from bench_shuffle_search(0, 2*1000*1000); > > NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245 > > NOTICE: sleeping for 2 seconds... > > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > > --------+------------------+---------------------+------------+---------------+--------------+----------------- > > 999654 | 14893056 | 179937720 | 173 | 0 | 2890 | 0 > > > > 1,684,115,844 branches:u > > 34,215,740 branch-misses:u # 2.03% of all branches > > > > > > 0005 replaces pg_lsearch with a branch-free SIMD search. Note that it retains full portability and gains predictableperformance. For demonstration, it's used on all three linear-search types. Although I'm sure it'd be way tooslow for node4, this benchmark hardly has any so it's ok. > > > > john=# select * from bench_seq_search(0, 2*1000*1000); > > NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245 > > NOTICE: sleeping for 2 seconds... > > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > > --------+------------------+---------------------+------------+---------------+--------------+----------------- > > 999654 | 14893056 | 179937720 | 176 | 0 | 867 | 0 > > > > 1,469,540,357 branches:u > > 96,678 branch-misses:u # 0.01% of all branches > > > > john=# select * from bench_shuffle_search(0, 2*1000*1000); > > NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245 > > NOTICE: sleeping for 2 seconds... > > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > > --------+------------------+---------------------+------------+---------------+--------------+----------------- > > 999654 | 14893056 | 179937720 | 171 | 0 | 2530 | 0 > > > > 1,469,540,533 branches:u > > 15,019,975 branch-misses:u # 1.02% of all branches > > > > > > 0006 removes node16, and 0007 avoids a function call to introspect node type. 0006 is really to make 0007 simpler tocode. The crucial point here is that calling out to rt_node_get_values/children() to figure out what type we are is costly.With these patches, searching an unevenly populated load is the same or faster than the original sequential load,despite taking twice as much memory. (And, as I've noted before, decoupling size class from node kind would win thememory back.) > > > > john=# select * from bench_seq_search(0, 2*1000*1000); > > NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245 > > NOTICE: sleeping for 2 seconds... > > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > > --------+------------------+---------------------+------------+---------------+--------------+----------------- > > 999654 | 20381696 | 179937720 | 171 | 0 | 717 | 0 > > > > 1,349,614,294 branches:u > > 1,313 branch-misses:u # 0.00% of all branches > > > > john=# select * from bench_shuffle_search(0, 2*1000*1000); > > NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245 > > NOTICE: sleeping for 2 seconds... > > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms > > --------+------------------+---------------------+------------+---------------+--------------+----------------- > > 999654 | 20381696 | 179937720 | 172 | 0 | 2202 | 0 > > > > 1,349,614,741 branches:u > > 30,592 branch-misses:u # 0.00% of all branches > > > > Expanding this point, once a path branches based on node kind, there should be no reason to ever forget the kind. Therabstractions in v6 have disadvantages. I understand the reasoning -- to reduce duplication of code. However, done thisway, less code in the text editor leads to *more* code (i.e. costly function calls and branches) on the machine level. > > Right. When updating the patch from v4 to v5, I've eliminated the > duplication of code between each node type as much as possible, which > in turn produced more code on the machine level. The resulst of your > experiment clearly showed the bad side of this work. FWIW I've also > confirmed your changes in my environment (I've added the third > argument to turn on and off the randomizes block selection proposed in > 0004 patch): > > * w/o patches > postgres(1:361692)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false); > 2022-10-14 11:33:15.460 JST [361692] LOG: num_keys = 1000000, height > = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | > array_load_ms | rt_search_ms | array_serach_ms > ---------+------------------+---------------------+------------+---------------+--------------+----------------- > 1000000 | 10199040 | 180000000 | 87 | > | 462 | > (1 row) > > 1590104944 branches:u # 3.430 G/sec > 65957 branch-misses:u # 0.00% of all branches > > postgres(1:361692)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true); > 2022-10-14 11:33:28.934 JST [361692] LOG: num_keys = 999654, height = > 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | > array_load_ms | rt_search_ms | array_serach_ms > --------+------------------+---------------------+------------+---------------+--------------+----------------- > 999654 | 14893056 | 179937720 | 91 | > | 497 | > (1 row) > > 1748249456 branches:u # 3.506 G/sec > 481074 branch-misses:u # 0.03% of all branches > > postgres(1:361692)=# select * from bench_shuffle_search(0, 1 * 1000 * > 1000, false); > 2022-10-14 11:33:38.378 JST [361692] LOG: num_keys = 1000000, height > = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | > array_load_ms | rt_search_ms | array_serach_ms > ---------+------------------+---------------------+------------+---------------+--------------+----------------- > 1000000 | 10199040 | 180000000 | 86 | > | 1290 | > (1 row) > > 1590105370 branches:u # 1.231 G/sec > 15039443 branch-misses:u # 0.95% of all branches > > Time: 4166.346 ms (00:04.166) > postgres(1:361692)=# select * from bench_shuffle_search(0, 2 * 1000 * > 1000, true); > 2022-10-14 11:33:51.556 JST [361692] LOG: num_keys = 999654, height = > 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | > array_load_ms | rt_search_ms | array_serach_ms > --------+------------------+---------------------+------------+---------------+--------------+----------------- > 999654 | 14893056 | 179937720 | 90 | > | 1536 | > (1 row) > > 1748250497 branches:u # 1.137 G/sec > 28125016 branch-misses:u # 1.61% of all branches > > * w/ all patches > postgres(1:360358)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false); > 2022-10-14 11:29:27.232 JST [360358] LOG: num_keys = 1000000, height > = 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | > array_load_ms | rt_search_ms | array_serach_ms > ---------+------------------+---------------------+------------+---------------+--------------+----------------- > 1000000 | 10199040 | 180000000 | 81 | > | 432 | > (1 row) > > 1380062209 branches:u # 3.185 G/sec > 1066 branch-misses:u # 0.00% of all branches > > postgres(1:360358)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true); > 2022-10-14 11:29:46.380 JST [360358] LOG: num_keys = 999654, height = > 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | > array_load_ms | rt_search_ms | array_serach_ms > --------+------------------+---------------------+------------+---------------+--------------+----------------- > 999654 | 20381696 | 179937720 | 88 | > | 438 | > (1 row) > > 1379640815 branches:u # 3.133 G/sec > 1332 branch-misses:u # 0.00% of all branches > > postgres(1:360358)=# select * from bench_shuffle_search(0, 1 * 1000 * > 1000, false); > 2022-10-14 11:30:00.943 JST [360358] LOG: num_keys = 1000000, height > = 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | > array_load_ms | rt_search_ms | array_serach_ms > ---------+------------------+---------------------+------------+---------------+--------------+----------------- > 1000000 | 10199040 | 180000000 | 81 | > | 994 | > (1 row) > > 1380062386 branches:u # 1.386 G/sec > 18368 branch-misses:u # 0.00% of all branches > > postgres(1:360358)=# select * from bench_shuffle_search(0, 2 * 1000 * > 1000, true); > 2022-10-14 11:30:15.944 JST [360358] LOG: num_keys = 999654, height = > 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245 > NOTICE: sleeping for 2 seconds... > nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | > array_load_ms | rt_search_ms | array_serach_ms > --------+------------------+---------------------+------------+---------------+--------------+----------------- > 999654 | 20381696 | 179937720 | 88 | > | 1098 | > (1 row) > > 1379641503 branches:u # 1.254 G/sec > 18973 branch-misses:u # 0.00% of all branches > > > I haven't looked at insert/load performance carefully, but it's clear it suffers from the same amnesia. prepare_node_for_insert()branches based on the kind. If it must call rt_node_grow(), that function has no idea where it camefrom and must branch again. When prepare_node_for_insert() returns we again have no idea what the kind is, so must branchagain. And if we are one of the three linear-search nodes, we later do another function call, where we encounter a5-way jump table because the caller could be anything at all. > > > > Some of this could be worked around with always-inline functions to which we pass a const node kind, and let the compilerget rid of the branches etc. But many cases are probably not even worth doing that. For example, I don't think prepare_node_for_insert()is a useful abstraction to begin with. It returns an index, but only for linear nodes. Lookup nodesget a return value of zero. There is not enough commonality here. > > Agreed. > > > > > Along the same lines, there are a number of places that have branches as a consequence of treating inner nodes and leaveswith the same api: > > > > rt_node_iterate_next > > chunk_array_node_get_slot > > node_128/256_get_slot > > rt_node_search > > > > I'm leaning towards splitting these out into specialized functions for each inner and leaf. This is a bit painful forthe last one, but perhaps if we are resigned to templating the shared-mem case, maybe we can template some of the inner/leafstuff. Something to think about for later, but for now I believe we have to accept some code duplication as a prerequisitefor decent performance as well as readability. > > Agreed. > > > > > For the next steps, we need to proceed cautiously because there is a lot in the air at the moment. Here are some aspectsI would find desirable. If there are impracticalities I haven't thought of, we can discuss further. I don't pretendto know the practical consequences of every change I mention. > > > > - If you have started coding the shared memory case, I'd advise to continue so we can see what that looks like. If thathas not gotten beyond the design stage, I'd like to first see an attempt at tearing down some of the clumsier abstractionsin the current patch. > > - As a "smoke test", there should ideally be nothing as general as rt_node_get_children/values(). We should ideally alwaysknow what kind we are if we found out earlier. > > - For distinguishing between linear nodes, perhaps some always-inline functions can help hide details. But at the sametime, trying to treat them the same is not always worthwhile. > > - Start to separate treatment of inner/leaves and see how it goes. > > Since I've not started coding the shared memory case seriously, I'm > going to start with eliminating abstractions and splitting the > treatment of inner and leaf nodes. I've attached updated PoC patches for discussion and cfbot. From the previous version, I mainly changed the following things: * Separate treatment of inner and leaf nodes * Pack both the node kind and node count to an uint16 value. I've also made a change in functions in bench_radix_tree test module: the third argument of bench_seq/shuffle_search() is a flag to turn on and off the randomizes block selection. The results of performance tests in my environment are: postgres(1:1665989)=# select * from bench_seq_search(0, 1* 1000 * 1000, false); 2022-10-24 14:29:40.705 JST [1665989] LOG: num_keys = 1000000, height = 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122 nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms ---------+------------------+---------------------+------------+---------------+--------------+----------------- 1000000 | 9871104 | 180000000 | 65 | | 248 | (1 row) postgres(1:1665989)=# select * from bench_seq_search(0, 2* 1000 * 1000, true); 2022-10-24 14:29:47.999 JST [1665989] LOG: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245 nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms --------+------------------+---------------------+------------+---------------+--------------+----------------- 999654 | 19680736 | 179937720 | 71 | | 237 | (1 row) postgres(1:1665989)=# select * from bench_shuffle_search(0, 1 * 1000 * 1000, false); 2022-10-24 14:29:55.955 JST [1665989] LOG: num_keys = 1000000, height = 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122 nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms ---------+------------------+---------------------+------------+---------------+--------------+----------------- 1000000 | 9871104 | 180000000 | 65 | | 641 | (1 row) postgres(1:1665989)=# select * from bench_shuffle_search(0, 2 * 1000 * 1000, true); 2022-10-24 14:30:04.140 JST [1665989] LOG: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245 nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms --------+------------------+---------------------+------------+---------------+--------------+----------------- 999654 | 19680736 | 179937720 | 71 | | 654 | (1 row) I've not done SIMD part seriously yet. But overall the performance seems good so far. If we agree with the current approach, I think we can proceed with the verification of decoupling node sizes from node kind. And I'll investigate DSA support. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I've attached updated PoC patches for discussion and cfbot. From the
> previous version, I mainly changed the following things:
>
> * Separate treatment of inner and leaf nodes
Overall, this looks much better!
> * Pack both the node kind and node count to an uint16 value.
For this, I did mention a bitfield earlier as something we "could" do, but it wasn't clear we should. After looking again at the node types, I must not have thought through this at all. Storing one byte instead of four for the full enum is a good step, but saving one more byte usually doesn't buy anything because of padding, with a few exceptions like this example:
node4: 4 + 4 + 4*8 = 40
node4: 5 + 4+(7) + 4*8 = 48 bytes
Even there, I'd rather not spend the extra cycles to access the members. And with my idea of decoupling size classes from kind, the variable-sized kinds will require another byte to store "capacity". Then, even if the kind gets encoded in a pointer tag, we'll still have 5 bytes in the base type. So I think we should assume 5 bytes from the start. (Might be 6 temporarily if I work on size decoupling first).
(Side note, if you have occasion to use bitfields again in the future, C99 has syntactic support for them, so no need to write your own shifting/masking code).
> I've not done SIMD part seriously yet. But overall the performance
> seems good so far. If we agree with the current approach, I think we
> can proceed with the verification of decoupling node sizes from node
> kind. And I'll investigate DSA support.
Sounds good. I have some additional comments about v7, and after these are addressed, we can proceed independently with the above two items. Seeing the DSA work will also inform me how invasive pointer tagging will be. There will still be some performance tuning and cosmetic work, but it's getting closer.
-------------------------
0001:
+#ifndef USE_NO_SIMD
+#include "port/pg_bitutils.h"
+#endif
Leftover from an earlier version?
+static inline int vector8_find(const Vector8 v, const uint8 c);
+static inline int vector8_find_ge(const Vector8 v, const uint8 c);
Leftovers, causing compiler warnings. (Also see new variable shadow warning)
+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
+
+ return r;
+#endif
As I mentioned a couple versions ago, this style is really awkward, and potential non-SIMD callers will be better off writing their own byte-wise loop rather than using this API. Especially since the "min" function exists only as a workaround for lack of unsigned comparison in (at least) SSE2. There is one existing function in this file with that idiom for non-assert code (for completeness), but even there, inputs of current interest to us use the uint64 algorithm.
0002:
+ /* XXX: should not to use vector8_highbit_mask */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
Hmm?
+/*
+ * Return index of the first element in chunks in the given node that is greater
+ * than or equal to 'key'. Return -1 if there is no such element.
+ */
+static inline int
+node_32_search_ge(rt_node_base_32 *node, uint8 chunk)
The caller must now have logic for inserting at the end:
+ int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
+ int16 count = NODE_GET_COUNT(n32);
+
+ if (insertpos < 0)
+ insertpos = count; /* insert to the tail */
It would be a bit more clear if node_*_search_ge() always returns the position we need (see the prototype for example). In fact, these functions are probably better named node*_get_insertpos().
+ if (likely(NODE_HAS_FREE_SLOT(n128)))
+ {
+ node_inner_128_insert(n128, chunk, child);
+ break;
+ }
+
+ /* grow node from 128 to 256 */
We want all the node-growing code to be pushed down to the bottom so that all branches of the hot path are close together. This provides better locality for the CPU frontend. Looking at the assembly, the above doesn't have the desired effect, so we need to write like this (also see prototype):
if (unlikely( ! has-free-slot))
grow-node;
else
{
...;
break;
}
/* FALLTHROUGH */
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ child = rt_node_add_new_child(tree, parent, node, key);
+
+ Assert(child);
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
Note that if we have to call rt_node_add_new_child(), each successive loop iteration must search it and find nothing there (the prototype had a separate function to handle this). Maybe it's not that critical yet, but something to keep in mind as we proceed. Maybe a comment about it to remind us.
+ /* there is no key to delete */
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
+ return false;
+
+ /* Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Delete the key from the leaf node and recursively delete the key in
+ * inner nodes if necessary.
+ */
+ Assert(NODE_IS_LEAF(stack[level]));
+ while (level >= 0)
+ {
+ rt_node *node = stack[level--];
+
+ if (NODE_IS_LEAF(node))
+ rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+ else
+ rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
Here we call rt_node_search_leaf() twice -- once to check for existence, and once to delete. All three search calls are inlined, so this wastes space. Let's try to delete the leaf, return if not found, otherwise handle the leaf bookkeepping and loop over the inner nodes. This might require some duplication of code.
+ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
Spelling
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
gcc generates better code with something like this (but not hard-coded) at the top:
if (count > 4)
pg_unreachable();
This would have to change when we implement shrinking of nodes, but might still be useful.
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
+ return false;
+
+ return true;
Maybe just "return rt_node_search_leaf(...)" ?
-- > I've attached updated PoC patches for discussion and cfbot. From the
> previous version, I mainly changed the following things:
>
> * Separate treatment of inner and leaf nodes
Overall, this looks much better!
> * Pack both the node kind and node count to an uint16 value.
For this, I did mention a bitfield earlier as something we "could" do, but it wasn't clear we should. After looking again at the node types, I must not have thought through this at all. Storing one byte instead of four for the full enum is a good step, but saving one more byte usually doesn't buy anything because of padding, with a few exceptions like this example:
node4: 4 + 4 + 4*8 = 40
node4: 5 + 4+(7) + 4*8 = 48 bytes
Even there, I'd rather not spend the extra cycles to access the members. And with my idea of decoupling size classes from kind, the variable-sized kinds will require another byte to store "capacity". Then, even if the kind gets encoded in a pointer tag, we'll still have 5 bytes in the base type. So I think we should assume 5 bytes from the start. (Might be 6 temporarily if I work on size decoupling first).
(Side note, if you have occasion to use bitfields again in the future, C99 has syntactic support for them, so no need to write your own shifting/masking code).
> I've not done SIMD part seriously yet. But overall the performance
> seems good so far. If we agree with the current approach, I think we
> can proceed with the verification of decoupling node sizes from node
> kind. And I'll investigate DSA support.
Sounds good. I have some additional comments about v7, and after these are addressed, we can proceed independently with the above two items. Seeing the DSA work will also inform me how invasive pointer tagging will be. There will still be some performance tuning and cosmetic work, but it's getting closer.
-------------------------
0001:
+#ifndef USE_NO_SIMD
+#include "port/pg_bitutils.h"
+#endif
Leftover from an earlier version?
+static inline int vector8_find(const Vector8 v, const uint8 c);
+static inline int vector8_find_ge(const Vector8 v, const uint8 c);
Leftovers, causing compiler warnings. (Also see new variable shadow warning)
+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
+
+ return r;
+#endif
As I mentioned a couple versions ago, this style is really awkward, and potential non-SIMD callers will be better off writing their own byte-wise loop rather than using this API. Especially since the "min" function exists only as a workaround for lack of unsigned comparison in (at least) SSE2. There is one existing function in this file with that idiom for non-assert code (for completeness), but even there, inputs of current interest to us use the uint64 algorithm.
0002:
+ /* XXX: should not to use vector8_highbit_mask */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
Hmm?
+/*
+ * Return index of the first element in chunks in the given node that is greater
+ * than or equal to 'key'. Return -1 if there is no such element.
+ */
+static inline int
+node_32_search_ge(rt_node_base_32 *node, uint8 chunk)
The caller must now have logic for inserting at the end:
+ int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
+ int16 count = NODE_GET_COUNT(n32);
+
+ if (insertpos < 0)
+ insertpos = count; /* insert to the tail */
It would be a bit more clear if node_*_search_ge() always returns the position we need (see the prototype for example). In fact, these functions are probably better named node*_get_insertpos().
+ if (likely(NODE_HAS_FREE_SLOT(n128)))
+ {
+ node_inner_128_insert(n128, chunk, child);
+ break;
+ }
+
+ /* grow node from 128 to 256 */
We want all the node-growing code to be pushed down to the bottom so that all branches of the hot path are close together. This provides better locality for the CPU frontend. Looking at the assembly, the above doesn't have the desired effect, so we need to write like this (also see prototype):
if (unlikely( ! has-free-slot))
grow-node;
else
{
...;
break;
}
/* FALLTHROUGH */
+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+ rt_node *child;
+
+ if (NODE_IS_LEAF(node))
+ break;
+
+ if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ child = rt_node_add_new_child(tree, parent, node, key);
+
+ Assert(child);
+
+ parent = node;
+ node = child;
+ shift -= RT_NODE_SPAN;
+ }
Note that if we have to call rt_node_add_new_child(), each successive loop iteration must search it and find nothing there (the prototype had a separate function to handle this). Maybe it's not that critical yet, but something to keep in mind as we proceed. Maybe a comment about it to remind us.
+ /* there is no key to delete */
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
+ return false;
+
+ /* Update the statistics */
+ tree->num_keys--;
+
+ /*
+ * Delete the key from the leaf node and recursively delete the key in
+ * inner nodes if necessary.
+ */
+ Assert(NODE_IS_LEAF(stack[level]));
+ while (level >= 0)
+ {
+ rt_node *node = stack[level--];
+
+ if (NODE_IS_LEAF(node))
+ rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+ else
+ rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+
+ /* If the node didn't become empty, we stop deleting the key */
+ if (!NODE_IS_EMPTY(node))
+ break;
+
+ /* The node became empty */
+ rt_free_node(tree, node);
+ }
Here we call rt_node_search_leaf() twice -- once to check for existence, and once to delete. All three search calls are inlined, so this wastes space. Let's try to delete the leaf, return if not found, otherwise handle the leaf bookkeepping and loop over the inner nodes. This might require some duplication of code.
+ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
Spelling
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
gcc generates better code with something like this (but not hard-coded) at the top:
if (count > 4)
pg_unreachable();
This would have to change when we implement shrinking of nodes, but might still be useful.
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
+ return false;
+
+ return true;
Maybe just "return rt_node_search_leaf(...)" ?
On Wed, Oct 26, 2022 at 8:06 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I've attached updated PoC patches for discussion and cfbot. From the > > previous version, I mainly changed the following things: > > Thank you for the comments! > > * Separate treatment of inner and leaf nodes > > Overall, this looks much better! > > > * Pack both the node kind and node count to an uint16 value. > > For this, I did mention a bitfield earlier as something we "could" do, but it wasn't clear we should. After looking againat the node types, I must not have thought through this at all. Storing one byte instead of four for the full enum isa good step, but saving one more byte usually doesn't buy anything because of padding, with a few exceptions like thisexample: > > node4: 4 + 4 + 4*8 = 40 > node4: 5 + 4+(7) + 4*8 = 48 bytes > > Even there, I'd rather not spend the extra cycles to access the members. And with my idea of decoupling size classes fromkind, the variable-sized kinds will require another byte to store "capacity". Then, even if the kind gets encoded ina pointer tag, we'll still have 5 bytes in the base type. So I think we should assume 5 bytes from the start. (Might be6 temporarily if I work on size decoupling first). True. I'm going to start with 6 bytes and will consider reducing it to 5 bytes. Encoding the kind in a pointer tag could be tricky given DSA support so currently I'm thinking to pack the node kind and node capacity classes to uint8. > > (Side note, if you have occasion to use bitfields again in the future, C99 has syntactic support for them, so no need towrite your own shifting/masking code). Thanks! > > > I've not done SIMD part seriously yet. But overall the performance > > seems good so far. If we agree with the current approach, I think we > > can proceed with the verification of decoupling node sizes from node > > kind. And I'll investigate DSA support. > > Sounds good. I have some additional comments about v7, and after these are addressed, we can proceed independently withthe above two items. Seeing the DSA work will also inform me how invasive pointer tagging will be. There will still besome performance tuning and cosmetic work, but it's getting closer. > I've made some progress on investigating DSA support. I've written draft patch for that and regression tests passed. I'll share it as a separate patch for discussion with v8 radix tree patch. While implementing DSA support, I realized that we may not need to use pointer tagging to distinguish between backend-local address or dsa_pointer. In order to get a backend-local address from dsa_pointer, we need to pass dsa_area like: node = dsa_get_address(tree->dsa, node_dp); As shown above, the dsa area used by the shared radix tree is stored in radix_tree struct, so we can know whether the radix tree is shared or not by checking (tree->dsa == NULL). That is, if it's shared we use a pointer to radix tree node as dsa_pointer, and if not we use a pointer as a backend-local pointer. We don't need to encode something in a pointer. > ------------------------- > 0001: > > +#ifndef USE_NO_SIMD > +#include "port/pg_bitutils.h" > +#endif > > Leftover from an earlier version? > > +static inline int vector8_find(const Vector8 v, const uint8 c); > +static inline int vector8_find_ge(const Vector8 v, const uint8 c); > > Leftovers, causing compiler warnings. (Also see new variable shadow warning) Will fix. > > +#else /* USE_NO_SIMD */ > + Vector8 r = 0; > + uint8 *rp = (uint8 *) &r; > + > + for (Size i = 0; i < sizeof(Vector8); i++) > + rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]); > + > + return r; > +#endif > > As I mentioned a couple versions ago, this style is really awkward, and potential non-SIMD callers will be better off writingtheir own byte-wise loop rather than using this API. Especially since the "min" function exists only as a workaroundfor lack of unsigned comparison in (at least) SSE2. There is one existing function in this file with that idiomfor non-assert code (for completeness), but even there, inputs of current interest to us use the uint64 algorithm. Agreed. Will remove non-SIMD code. > > 0002: > > + /* XXX: should not to use vector8_highbit_mask */ > + bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8)); > > Hmm? It's my outdated memo, will remove. > > +/* > + * Return index of the first element in chunks in the given node that is greater > + * than or equal to 'key'. Return -1 if there is no such element. > + */ > +static inline int > +node_32_search_ge(rt_node_base_32 *node, uint8 chunk) > > The caller must now have logic for inserting at the end: > > + int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk); > + int16 count = NODE_GET_COUNT(n32); > + > + if (insertpos < 0) > + insertpos = count; /* insert to the tail */ > > It would be a bit more clear if node_*_search_ge() always returns the position we need (see the prototype for example).In fact, these functions are probably better named node*_get_insertpos(). Agreed. > > + if (likely(NODE_HAS_FREE_SLOT(n128))) > + { > + node_inner_128_insert(n128, chunk, child); > + break; > + } > + > + /* grow node from 128 to 256 */ > > We want all the node-growing code to be pushed down to the bottom so that all branches of the hot path are close together.This provides better locality for the CPU frontend. Looking at the assembly, the above doesn't have the desiredeffect, so we need to write like this (also see prototype): > > if (unlikely( ! has-free-slot)) > grow-node; > else > { > ...; > break; > } > /* FALLTHROUGH */ Good point. Will change. > > + /* Descend the tree until a leaf node */ > + while (shift >= 0) > + { > + rt_node *child; > + > + if (NODE_IS_LEAF(node)) > + break; > + > + if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child)) > + child = rt_node_add_new_child(tree, parent, node, key); > + > + Assert(child); > + > + parent = node; > + node = child; > + shift -= RT_NODE_SPAN; > + } > > Note that if we have to call rt_node_add_new_child(), each successive loop iteration must search it and find nothing there(the prototype had a separate function to handle this). Maybe it's not that critical yet, but something to keep in mindas we proceed. Maybe a comment about it to remind us. Agreed. Currently rt_extend() is used to add upper nodes but probably we need another function to add lower nodes for this case. > > + /* there is no key to delete */ > + if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL)) > + return false; > + > + /* Update the statistics */ > + tree->num_keys--; > + > + /* > + * Delete the key from the leaf node and recursively delete the key in > + * inner nodes if necessary. > + */ > + Assert(NODE_IS_LEAF(stack[level])); > + while (level >= 0) > + { > + rt_node *node = stack[level--]; > + > + if (NODE_IS_LEAF(node)) > + rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL); > + else > + rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL); > + > + /* If the node didn't become empty, we stop deleting the key */ > + if (!NODE_IS_EMPTY(node)) > + break; > + > + /* The node became empty */ > + rt_free_node(tree, node); > + } > > Here we call rt_node_search_leaf() twice -- once to check for existence, and once to delete. All three search calls areinlined, so this wastes space. Let's try to delete the leaf, return if not found, otherwise handle the leaf bookkeeppingand loop over the inner nodes. This might require some duplication of code. Agreed. > > +ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child) > > Spelling WIll fix. > > +static inline void > +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children, > + uint8 *dst_chunks, rt_node **dst_children, int count) > +{ > + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count); > + memcpy(dst_children, src_children, sizeof(rt_node *) * count); > +} > > gcc generates better code with something like this (but not hard-coded) at the top: > > if (count > 4) > pg_unreachable(); Agreed. > > This would have to change when we implement shrinking of nodes, but might still be useful. > > + if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p)) > + return false; > + > + return true; > > Maybe just "return rt_node_search_leaf(...)" ? Agreed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Oct 27, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> True. I'm going to start with 6 bytes and will consider reducing it to
> 5 bytes.
Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As discussed, I will attempt the size class decoupling after v8 and see how it goes.
> Encoding the kind in a pointer tag could be tricky given DSA
If it turns out to be unworkable, that's life. If it's just tricky, that can certainly be put off for future work. I hope to at least test it out with local memory.
> support so currently I'm thinking to pack the node kind and node
> capacity classes to uint8.
That won't work, if we need 128 for capacity, leaving no bits left. I want the capacity to be a number we can directly compare with the count (we won't ever need to store 256 because that node will never grow). Also, further to my last message, we need to access the kind quickly, without more cycles.
> I've made some progress on investigating DSA support. I've written
> draft patch for that and regression tests passed. I'll share it as a
> separate patch for discussion with v8 radix tree patch.
Great!
> While implementing DSA support, I realized that we may not need to use
> pointer tagging to distinguish between backend-local address or
> dsa_pointer. In order to get a backend-local address from dsa_pointer,
> we need to pass dsa_area like:
I was not clear -- when I see how much code changes to accommodate DSA pointers, I imagine I will pretty much know the places that would be affected by tagging the pointer with the node kind.
Speaking of tests, there is currently no Meson support, but tests pass because this library is not used anywhere in the backend yet, and apparently the CI Meson builds don't know to run the regression test? That will need to be done too. However, it's okay to keep the benchmarking module in autoconf, since it won't be committed.
> > +static inline void
> > +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> > + uint8 *dst_chunks, rt_node **dst_children, int count)
> > +{
> > + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> > + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> > +}
> >
> > gcc generates better code with something like this (but not hard-coded) at the top:
> >
> > if (count > 4)
> > pg_unreachable();
Actually it just now occurred to me there's a bigger issue here: *We* know this code can only get here iff count==4, so why doesn't the compiler know that? I believe it boils down to
static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
In the assembly, I see it checks if there is room in the node by doing a runtime lookup in this array, which is not constant. This might not be important just yet, because I want to base the check on the proposed node capacity instead, but I mention it as a reminder to us to make sure we take all opportunities for the compiler to propagate constants.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> True. I'm going to start with 6 bytes and will consider reducing it to
> 5 bytes.
Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As discussed, I will attempt the size class decoupling after v8 and see how it goes.
> Encoding the kind in a pointer tag could be tricky given DSA
If it turns out to be unworkable, that's life. If it's just tricky, that can certainly be put off for future work. I hope to at least test it out with local memory.
> support so currently I'm thinking to pack the node kind and node
> capacity classes to uint8.
That won't work, if we need 128 for capacity, leaving no bits left. I want the capacity to be a number we can directly compare with the count (we won't ever need to store 256 because that node will never grow). Also, further to my last message, we need to access the kind quickly, without more cycles.
> I've made some progress on investigating DSA support. I've written
> draft patch for that and regression tests passed. I'll share it as a
> separate patch for discussion with v8 radix tree patch.
Great!
> While implementing DSA support, I realized that we may not need to use
> pointer tagging to distinguish between backend-local address or
> dsa_pointer. In order to get a backend-local address from dsa_pointer,
> we need to pass dsa_area like:
I was not clear -- when I see how much code changes to accommodate DSA pointers, I imagine I will pretty much know the places that would be affected by tagging the pointer with the node kind.
Speaking of tests, there is currently no Meson support, but tests pass because this library is not used anywhere in the backend yet, and apparently the CI Meson builds don't know to run the regression test? That will need to be done too. However, it's okay to keep the benchmarking module in autoconf, since it won't be committed.
> > +static inline void
> > +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> > + uint8 *dst_chunks, rt_node **dst_children, int count)
> > +{
> > + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> > + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> > +}
> >
> > gcc generates better code with something like this (but not hard-coded) at the top:
> >
> > if (count > 4)
> > pg_unreachable();
Actually it just now occurred to me there's a bigger issue here: *We* know this code can only get here iff count==4, so why doesn't the compiler know that? I believe it boils down to
static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
In the assembly, I see it checks if there is room in the node by doing a runtime lookup in this array, which is not constant. This might not be important just yet, because I want to base the check on the proposed node capacity instead, but I mention it as a reminder to us to make sure we take all opportunities for the compiler to propagate constants.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Oct 27, 2022 at 12:21 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Thu, Oct 27, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > True. I'm going to start with 6 bytes and will consider reducing it to > > 5 bytes. > > Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As discussed, I will attempt the size class decouplingafter v8 and see how it goes. > > > Encoding the kind in a pointer tag could be tricky given DSA > > If it turns out to be unworkable, that's life. If it's just tricky, that can certainly be put off for future work. I hopeto at least test it out with local memory. > > > support so currently I'm thinking to pack the node kind and node > > capacity classes to uint8. > > That won't work, if we need 128 for capacity, leaving no bits left. I want the capacity to be a number we can directlycompare with the count (we won't ever need to store 256 because that node will never grow). Also, further to my lastmessage, we need to access the kind quickly, without more cycles. Understood. > > > I've made some progress on investigating DSA support. I've written > > draft patch for that and regression tests passed. I'll share it as a > > separate patch for discussion with v8 radix tree patch. > > Great! > > > While implementing DSA support, I realized that we may not need to use > > pointer tagging to distinguish between backend-local address or > > dsa_pointer. In order to get a backend-local address from dsa_pointer, > > we need to pass dsa_area like: > > I was not clear -- when I see how much code changes to accommodate DSA pointers, I imagine I will pretty much know theplaces that would be affected by tagging the pointer with the node kind. > > Speaking of tests, there is currently no Meson support, but tests pass because this library is not used anywhere in thebackend yet, and apparently the CI Meson builds don't know to run the regression test? That will need to be done too.However, it's okay to keep the benchmarking module in autoconf, since it won't be committed. Updated to support Meson. > > > > +static inline void > > > +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children, > > > + uint8 *dst_chunks, rt_node **dst_children, int count) > > > +{ > > > + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count); > > > + memcpy(dst_children, src_children, sizeof(rt_node *) * count); > > > +} > > > > > > gcc generates better code with something like this (but not hard-coded) at the top: > > > > > > if (count > 4) > > > pg_unreachable(); > > Actually it just now occurred to me there's a bigger issue here: *We* know this code can only get here iff count==4, sowhy doesn't the compiler know that? I believe it boils down to > > static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = { > > In the assembly, I see it checks if there is room in the node by doing a runtime lookup in this array, which is not constant.This might not be important just yet, because I want to base the check on the proposed node capacity instead, butI mention it as a reminder to us to make sure we take all opportunities for the compiler to propagate constants. I've attached v8 patches. 0001, 0002, and 0003 patches incorporated the comments I got so far. 0004 patch is a DSA support patch for PoC. In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes to point its children, and we use rt_node_ptr as either rt_node* or dsa_pointer depending on whether the radix tree is shared or not (ie, by checking radix_tree->dsa == NULL). Regarding the performance, I've added another boolean argument to bench_seq/shuffle_search(), specifying whether to use the shared radix tree or not. Here are benchmark results in my environment, select * from bench_seq_search(0, 1* 1000 * 1000, false, false); nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms ---------+------------------+---------------------+------------+---------------+--------------+----------------- 1000000 | 9871240 | 180000000 | 67 | | 241 | (1 row) select * from bench_seq_search(0, 1* 1000 * 1000, false, true); nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms ---------+------------------+---------------------+------------+---------------+--------------+----------------- 1000000 | 14680064 | 180000000 | 81 | | 483 | (1 row) select * from bench_seq_search(0, 2* 1000 * 1000, true, false); nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms --------+------------------+---------------------+------------+---------------+--------------+----------------- 999654 | 19680872 | 179937720 | 74 | | 235 | (1 row) select * from bench_seq_search(0, 2* 1000 * 1000, true, true); nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms --------+------------------+---------------------+------------+---------------+--------------+----------------- 999654 | 23068672 | 179937720 | 86 | | 445 | (1 row) select * from bench_shuffle_search(0, 1* 1000 * 1000, false, false); nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms ---------+------------------+---------------------+------------+---------------+--------------+----------------- 1000000 | 9871240 | 180000000 | 67 | | 640 | (1 row) select * from bench_shuffle_search(0, 1* 1000 * 1000, false, true); nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms ---------+------------------+---------------------+------------+---------------+--------------+----------------- 1000000 | 14680064 | 180000000 | 81 | | 1002 | (1 row) select * from bench_shuffle_search(0, 2* 1000 * 1000, true, false); nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms --------+------------------+---------------------+------------+---------------+--------------+----------------- 999654 | 19680872 | 179937720 | 74 | | 697 | (1 row) select * from bench_shuffle_search(0, 2* 1000 * 1000, true, true); nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms --------+------------------+---------------------+------------+---------------+--------------+----------------- 999654 | 23068672 | 179937720 | 86 | | 1030 | (1 row) In non-shared radix tree cases (the forth argument is false), I don't see a visible performance degradation. On the other hand, in shared radix tree cases (the forth argument is true), I see visible overheads because of dsa_get_address(). Please note that the current shared radix tree implementation doesn't support any locking, so it cannot be read while written by someone. Also, only one process can iterate over the shared radix tree. When it comes to parallel vacuum, these don't become restriction as the leader process writes the radix tree while scanning heap and the radix tree is read by multiple processes while vacuuming indexes. And only the leader process can do heap vacuum by iterating the key-value pairs in the radix tree. If we want to use it for other cases too, we would need to support locking, RCU or something. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
> the comments I got so far. 0004 patch is a DSA support patch for PoC.
Thanks for the new patchset. This is not a full review, but I have some comments:
0001 and 0002 look okay on a quick scan -- I will use this as a base for further work that we discussed. However, before I do so I'd like to request another revision regarding the following:
> In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
> to point its children, and we use rt_node_ptr as either rt_node* or
> dsa_pointer depending on whether the radix tree is shared or not (ie,
> by checking radix_tree->dsa == NULL).
0004: Looks like a good start, but this patch has a large number of changes like these, making it hard to read:
- if (found && child_p)
- *child_p = child;
+ if (found && childp_p)
+ *childp_p = childp;
...
rt_node_inner_32 *new32;
+ rt_node_ptr new32p;
/* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);
It's difficult to keep in my head what all the variables refer to. I thought a bit about how to split this patch up to make this easier to read. Here's what I came up with:
typedef struct rt_node_ptr
{
uintptr_t encoded;
rt_node * decoded;
}
Note that there is nothing about "dsa or local". That's deliberate. That way, we can use the "encoded" field for a tagged pointer as well, as I hope we can do (at least for local pointers) in the future. So an intermediate patch would have "static inline void" functions node_ptr_encode() and node_ptr_decode(), which would only copy from one member to another. I suspect that: 1. The actual DSA changes will be *much* smaller and easier to reason about. 2. Experimenting with tagged pointers will be easier.
Also, quick question: 0004 has a new function rt_node_update_inner() -- is that necessary because of DSA?, or does this ideally belong in 0002? What's the reason for it?
Regarding the performance, I've
> added another boolean argument to bench_seq/shuffle_search(),
> specifying whether to use the shared radix tree or not. Here are
> benchmark results in my environment,
> [...]
> In non-shared radix tree cases (the forth argument is false), I don't
> see a visible performance degradation. On the other hand, in shared
> radix tree cases (the forth argument is true), I see visible overheads
> because of dsa_get_address().
Thanks, this is useful.
> Please note that the current shared radix tree implementation doesn't
> support any locking, so it cannot be read while written by someone.
I think at the very least we need a global lock to enforce this.
> Also, only one process can iterate over the shared radix tree. When it
> comes to parallel vacuum, these don't become restriction as the leader
> process writes the radix tree while scanning heap and the radix tree
> is read by multiple processes while vacuuming indexes. And only the
> leader process can do heap vacuum by iterating the key-value pairs in
> the radix tree. If we want to use it for other cases too, we would
> need to support locking, RCU or something.
A useful exercise here is to think about what we'd need to do parallel heap pruning. We don't need to go that far for v16 of course, but what's the simplest thing we can do to make that possible? Other use cases can change to more sophisticated schemes if need be.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
> the comments I got so far. 0004 patch is a DSA support patch for PoC.
Thanks for the new patchset. This is not a full review, but I have some comments:
0001 and 0002 look okay on a quick scan -- I will use this as a base for further work that we discussed. However, before I do so I'd like to request another revision regarding the following:
> In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
> to point its children, and we use rt_node_ptr as either rt_node* or
> dsa_pointer depending on whether the radix tree is shared or not (ie,
> by checking radix_tree->dsa == NULL).
0004: Looks like a good start, but this patch has a large number of changes like these, making it hard to read:
- if (found && child_p)
- *child_p = child;
+ if (found && childp_p)
+ *childp_p = childp;
...
rt_node_inner_32 *new32;
+ rt_node_ptr new32p;
/* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
- RT_NODE_KIND_32);
+ new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);
It's difficult to keep in my head what all the variables refer to. I thought a bit about how to split this patch up to make this easier to read. Here's what I came up with:
typedef struct rt_node_ptr
{
uintptr_t encoded;
rt_node * decoded;
}
Note that there is nothing about "dsa or local". That's deliberate. That way, we can use the "encoded" field for a tagged pointer as well, as I hope we can do (at least for local pointers) in the future. So an intermediate patch would have "static inline void" functions node_ptr_encode() and node_ptr_decode(), which would only copy from one member to another. I suspect that: 1. The actual DSA changes will be *much* smaller and easier to reason about. 2. Experimenting with tagged pointers will be easier.
Also, quick question: 0004 has a new function rt_node_update_inner() -- is that necessary because of DSA?, or does this ideally belong in 0002? What's the reason for it?
Regarding the performance, I've
> added another boolean argument to bench_seq/shuffle_search(),
> specifying whether to use the shared radix tree or not. Here are
> benchmark results in my environment,
> [...]
> In non-shared radix tree cases (the forth argument is false), I don't
> see a visible performance degradation. On the other hand, in shared
> radix tree cases (the forth argument is true), I see visible overheads
> because of dsa_get_address().
Thanks, this is useful.
> Please note that the current shared radix tree implementation doesn't
> support any locking, so it cannot be read while written by someone.
I think at the very least we need a global lock to enforce this.
> Also, only one process can iterate over the shared radix tree. When it
> comes to parallel vacuum, these don't become restriction as the leader
> process writes the radix tree while scanning heap and the radix tree
> is read by multiple processes while vacuuming indexes. And only the
> leader process can do heap vacuum by iterating the key-value pairs in
> the radix tree. If we want to use it for other cases too, we would
> need to support locking, RCU or something.
A useful exercise here is to think about what we'd need to do parallel heap pruning. We don't need to go that far for v16 of course, but what's the simplest thing we can do to make that possible? Other use cases can change to more sophisticated schemes if need be.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Nov 3, 2022 at 1:59 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I've attached v8 patches. 0001, 0002, and 0003 patches incorporated > > the comments I got so far. 0004 patch is a DSA support patch for PoC. > > Thanks for the new patchset. This is not a full review, but I have some comments: > > 0001 and 0002 look okay on a quick scan -- I will use this as a base for further work that we discussed. However, beforeI do so I'd like to request another revision regarding the following: > > > In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes > > to point its children, and we use rt_node_ptr as either rt_node* or > > dsa_pointer depending on whether the radix tree is shared or not (ie, > > by checking radix_tree->dsa == NULL). > Thank you for the comments! > 0004: Looks like a good start, but this patch has a large number of changes like these, making it hard to read: > > - if (found && child_p) > - *child_p = child; > + if (found && childp_p) > + *childp_p = childp; > ... > rt_node_inner_32 *new32; > + rt_node_ptr new32p; > > /* grow node from 4 to 32 */ > - new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4, > - RT_NODE_KIND_32); > + new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32); > + new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p); > > It's difficult to keep in my head what all the variables refer to. I thought a bit about how to split this patch up tomake this easier to read. Here's what I came up with: > > typedef struct rt_node_ptr > { > uintptr_t encoded; > rt_node * decoded; > } > > Note that there is nothing about "dsa or local". That's deliberate. That way, we can use the "encoded" field for a taggedpointer as well, as I hope we can do (at least for local pointers) in the future. So an intermediate patch would have"static inline void" functions node_ptr_encode() and node_ptr_decode(), which would only copy from one member to another.I suspect that: 1. The actual DSA changes will be *much* smaller and easier to reason about. 2. Experimenting withtagged pointers will be easier. Good idea. Will try in the next version patch. > > Also, quick question: 0004 has a new function rt_node_update_inner() -- is that necessary because of DSA?, or does thisideally belong in 0002? What's the reason for it? Oh, this was needed once when initially I'm writing DSA support but thinking about it again now I think we can remove it and use rt_node_insert_inner() with parent = NULL instead. > > Regarding the performance, I've > > added another boolean argument to bench_seq/shuffle_search(), > > specifying whether to use the shared radix tree or not. Here are > > benchmark results in my environment, > > > [...] > > > In non-shared radix tree cases (the forth argument is false), I don't > > see a visible performance degradation. On the other hand, in shared > > radix tree cases (the forth argument is true), I see visible overheads > > because of dsa_get_address(). > > Thanks, this is useful. > > > Please note that the current shared radix tree implementation doesn't > > support any locking, so it cannot be read while written by someone. > > I think at the very least we need a global lock to enforce this. > > > Also, only one process can iterate over the shared radix tree. When it > > comes to parallel vacuum, these don't become restriction as the leader > > process writes the radix tree while scanning heap and the radix tree > > is read by multiple processes while vacuuming indexes. And only the > > leader process can do heap vacuum by iterating the key-value pairs in > > the radix tree. If we want to use it for other cases too, we would > > need to support locking, RCU or something. > > A useful exercise here is to think about what we'd need to do parallel heap pruning. We don't need to go that far for v16of course, but what's the simplest thing we can do to make that possible? Other use cases can change to more sophisticatedschemes if need be. For parallel heap pruning, multiple workers will insert key-value pairs to the radix tree concurrently. The simplest solution would be a single lock to protect writes but the performance will not be good. Another solution would be that we can divide the tables into multiple ranges so that keys derived from TIDs are not conflicted with each other and have parallel workers process one or more ranges. That way, parallel vacuum workers can build *sub-trees* and the leader process can merge them. In use cases of lazy vacuum, since the write phase and read phase are separated the readers don't need to worry about concurrent updates. I've attached a draft patch for lazy vacuum integration that can be applied on top of v8 patches. The patch adds a new module called TIDStore, an efficient storage for TID backed by radix tree. Lazy vacuum and parallel vacuum use it instead of a TID array. The patch also introduces rt_detach() that was missed in 0002 patch. It's a very rough patch but I hope it helps in considering lazy vacuum integration, radix tree APIs, and shared radix tree functionality. There are some TODOs: * We need to reset the TIDStore and therefore reset the radix tree. It can easily be done by using MemoryContextReset() in non-shared radix tree cases, but in shared case, we need either to free all radix tree nodes recursively or introduce a way to release all allocated DSA memory. * We need to limit the size of TIDStore (mainly radix_tree) in maintenance_work_mem. * We need to change the counter-based information in pg_stat_progress_vacuum such as max_dead_tuples and num_dead_tuplesn. I think it would be better to show maximum bytes we can collect TIDs and its usage instead. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> For parallel heap pruning, multiple workers will insert key-value
> pairs to the radix tree concurrently. The simplest solution would be a
> single lock to protect writes but the performance will not be good.
> Another solution would be that we can divide the tables into multiple
> ranges so that keys derived from TIDs are not conflicted with each
> other and have parallel workers process one or more ranges. That way,
> parallel vacuum workers can build *sub-trees* and the leader process
> can merge them. In use cases of lazy vacuum, since the write phase and
> read phase are separated the readers don't need to worry about
> concurrent updates.
It's a good idea to use ranges for a different reason -- readahead. See commit 56788d2156fc3, which aimed to improve readahead for sequential scans. It might work to use that as a model: Each worker prunes a range of 64 pages, keeping the dead tids in a local array. At the end of the range: lock the tid store, enter the tids into the store, unlock, free the local array, and get the next range from the leader. It's possible contention won't be too bad, and I suspect using small local arrays as-we-go would be faster and use less memory than merging multiple sub-trees at the end.
> I've attached a draft patch for lazy vacuum integration that can be
> applied on top of v8 patches. The patch adds a new module called
> TIDStore, an efficient storage for TID backed by radix tree. Lazy
> vacuum and parallel vacuum use it instead of a TID array. The patch
> also introduces rt_detach() that was missed in 0002 patch. It's a very
> rough patch but I hope it helps in considering lazy vacuum
> integration, radix tree APIs, and shared radix tree functionality.
It does help, good to see this.
>
> For parallel heap pruning, multiple workers will insert key-value
> pairs to the radix tree concurrently. The simplest solution would be a
> single lock to protect writes but the performance will not be good.
> Another solution would be that we can divide the tables into multiple
> ranges so that keys derived from TIDs are not conflicted with each
> other and have parallel workers process one or more ranges. That way,
> parallel vacuum workers can build *sub-trees* and the leader process
> can merge them. In use cases of lazy vacuum, since the write phase and
> read phase are separated the readers don't need to worry about
> concurrent updates.
It's a good idea to use ranges for a different reason -- readahead. See commit 56788d2156fc3, which aimed to improve readahead for sequential scans. It might work to use that as a model: Each worker prunes a range of 64 pages, keeping the dead tids in a local array. At the end of the range: lock the tid store, enter the tids into the store, unlock, free the local array, and get the next range from the leader. It's possible contention won't be too bad, and I suspect using small local arrays as-we-go would be faster and use less memory than merging multiple sub-trees at the end.
> I've attached a draft patch for lazy vacuum integration that can be
> applied on top of v8 patches. The patch adds a new module called
> TIDStore, an efficient storage for TID backed by radix tree. Lazy
> vacuum and parallel vacuum use it instead of a TID array. The patch
> also introduces rt_detach() that was missed in 0002 patch. It's a very
> rough patch but I hope it helps in considering lazy vacuum
> integration, radix tree APIs, and shared radix tree functionality.
It does help, good to see this.
On Sat, Nov 5, 2022 at 6:23 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > For parallel heap pruning, multiple workers will insert key-value > > pairs to the radix tree concurrently. The simplest solution would be a > > single lock to protect writes but the performance will not be good. > > Another solution would be that we can divide the tables into multiple > > ranges so that keys derived from TIDs are not conflicted with each > > other and have parallel workers process one or more ranges. That way, > > parallel vacuum workers can build *sub-trees* and the leader process > > can merge them. In use cases of lazy vacuum, since the write phase and > > read phase are separated the readers don't need to worry about > > concurrent updates. > > It's a good idea to use ranges for a different reason -- readahead. See commit 56788d2156fc3, which aimed to improve readaheadfor sequential scans. It might work to use that as a model: Each worker prunes a range of 64 pages, keeping thedead tids in a local array. At the end of the range: lock the tid store, enter the tids into the store, unlock, free thelocal array, and get the next range from the leader. It's possible contention won't be too bad, and I suspect using smalllocal arrays as-we-go would be faster and use less memory than merging multiple sub-trees at the end. Seems a promising idea. I think it might work well even in the current parallel vacuum (ie., single writer). I mean, I think we can have a single lwlock for shared cases in the first version. If the overhead of acquiring the lwlock per insertion of key-value is not negligible, we might want to try this idea. Apart from that, I'm going to incorporate the comments on 0004 patch and try a pointer tagging. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Nov 4, 2022 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > For parallel heap pruning, multiple workers will insert key-value > pairs to the radix tree concurrently. The simplest solution would be a > single lock to protect writes but the performance will not be good. > Another solution would be that we can divide the tables into multiple > ranges so that keys derived from TIDs are not conflicted with each > other and have parallel workers process one or more ranges. That way, > parallel vacuum workers can build *sub-trees* and the leader process > can merge them. In use cases of lazy vacuum, since the write phase and > read phase are separated the readers don't need to worry about > concurrent updates. I think that the VM snapshot concept can eventually be used to implement parallel heap pruning. Since every page that will become a scanned_pages is known right from the start with VM snapshots, it will be relatively straightforward to partition these pages into distinct ranges with an equal number of pages, one per worker planned. The VM snapshot structure can also be used for I/O prefetching, which will be more important with parallel heap pruning (and with aio). Working off of an immutable structure that describes which pages to process right from the start is naturally easy to work with, in general. We can "reorder work" flexibly (i.e. process individual scanned_pages in any order that is convenient). Another example is "changing our mind" about advancing relfrozenxid when it turns out that we maybe should have decided to do that at the start of VACUUM [1]. Maybe the specific "changing our mind" idea will turn out to not be a very useful idea, but it is at least an interesting and thought provoking concept. [1] https://postgr.es/m/CAH2-WzkQ86yf==mgAF=cQ0qeLRWKX3htLw9Qo+qx3zbwJJkPiQ@mail.gmail.com -- Peter Geoghegan
On Tue, Nov 8, 2022 at 11:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sat, Nov 5, 2022 at 6:23 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > For parallel heap pruning, multiple workers will insert key-value > > > pairs to the radix tree concurrently. The simplest solution would be a > > > single lock to protect writes but the performance will not be good. > > > Another solution would be that we can divide the tables into multiple > > > ranges so that keys derived from TIDs are not conflicted with each > > > other and have parallel workers process one or more ranges. That way, > > > parallel vacuum workers can build *sub-trees* and the leader process > > > can merge them. In use cases of lazy vacuum, since the write phase and > > > read phase are separated the readers don't need to worry about > > > concurrent updates. > > > > It's a good idea to use ranges for a different reason -- readahead. See commit 56788d2156fc3, which aimed to improvereadahead for sequential scans. It might work to use that as a model: Each worker prunes a range of 64 pages, keepingthe dead tids in a local array. At the end of the range: lock the tid store, enter the tids into the store, unlock,free the local array, and get the next range from the leader. It's possible contention won't be too bad, and I suspectusing small local arrays as-we-go would be faster and use less memory than merging multiple sub-trees at the end. > > Seems a promising idea. I think it might work well even in the current > parallel vacuum (ie., single writer). I mean, I think we can have a > single lwlock for shared cases in the first version. If the overhead > of acquiring the lwlock per insertion of key-value is not negligible, > we might want to try this idea. > > Apart from that, I'm going to incorporate the comments on 0004 patch > and try a pointer tagging. I'd like to share some progress on this work. 0004 patch is a new patch supporting a pointer tagging of the node kind. Also, it introduces rt_node_ptr we discussed so that internal functions use it rather than having two arguments for encoded and decoded pointers. With this intermediate patch, the DSA support patch became more readable and understandable. Probably we can make it smaller further if we move the change of separating the control object from radix_tree to the main patch (0002). The patch still needs to be polished but I'd like to check if this idea is worthwhile. If we agree on this direction, this patch will be merged into the main radix tree implementation patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> 0004 patch is a new patch supporting a pointer tagging of the node
> kind. Also, it introduces rt_node_ptr we discussed so that internal
> functions use it rather than having two arguments for encoded and
> decoded pointers. With this intermediate patch, the DSA support patch
> became more readable and understandable. Probably we can make it
> smaller further if we move the change of separating the control object
> from radix_tree to the main patch (0002). The patch still needs to be
> polished but I'd like to check if this idea is worthwhile. If we agree
> on this direction, this patch will be merged into the main radix tree
> implementation patch.
Thanks for the new patch set. I've taken a very brief look at 0004 and I think the broad outlines are okay. As you say it needs polish, but before going further, I'd like to do some experiments of my own as I mentioned earlier:
- See how much performance we actually gain from tagging the node kind.
- Try additional size classes while keeping the node kinds to only four.
- Optimize node128 insert.
- Try templating out the differences between local and shared memory. With local memory, the node-pointer struct would be a union, for example. Templating would also reduce branches and re-simplify some internal APIs, but it's likely that would also make the TID store and/or vacuum more complex, because at least some external functions would be duplicated.
I'll set the patch to "waiting on author", but in this case the author is me.
>
> 0004 patch is a new patch supporting a pointer tagging of the node
> kind. Also, it introduces rt_node_ptr we discussed so that internal
> functions use it rather than having two arguments for encoded and
> decoded pointers. With this intermediate patch, the DSA support patch
> became more readable and understandable. Probably we can make it
> smaller further if we move the change of separating the control object
> from radix_tree to the main patch (0002). The patch still needs to be
> polished but I'd like to check if this idea is worthwhile. If we agree
> on this direction, this patch will be merged into the main radix tree
> implementation patch.
Thanks for the new patch set. I've taken a very brief look at 0004 and I think the broad outlines are okay. As you say it needs polish, but before going further, I'd like to do some experiments of my own as I mentioned earlier:
- See how much performance we actually gain from tagging the node kind.
- Try additional size classes while keeping the node kinds to only four.
- Optimize node128 insert.
- Try templating out the differences between local and shared memory. With local memory, the node-pointer struct would be a union, for example. Templating would also reduce branches and re-simplify some internal APIs, but it's likely that would also make the TID store and/or vacuum more complex, because at least some external functions would be duplicated.
I'll set the patch to "waiting on author", but in this case the author is me.
On Mon, Nov 14, 2022 at 10:00 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > 0004 patch is a new patch supporting a pointer tagging of the node > > kind. Also, it introduces rt_node_ptr we discussed so that internal > > functions use it rather than having two arguments for encoded and > > decoded pointers. With this intermediate patch, the DSA support patch > > became more readable and understandable. Probably we can make it > > smaller further if we move the change of separating the control object > > from radix_tree to the main patch (0002). The patch still needs to be > > polished but I'd like to check if this idea is worthwhile. If we agree > > on this direction, this patch will be merged into the main radix tree > > implementation patch. > > Thanks for the new patch set. I've taken a very brief look at 0004 and I think the broad outlines are okay. As you sayit needs polish, but before going further, I'd like to do some experiments of my own as I mentioned earlier: > > - See how much performance we actually gain from tagging the node kind. > - Try additional size classes while keeping the node kinds to only four. > - Optimize node128 insert. > - Try templating out the differences between local and shared memory. With local memory, the node-pointer struct wouldbe a union, for example. Templating would also reduce branches and re-simplify some internal APIs, but it's likely thatwould also make the TID store and/or vacuum more complex, because at least some external functions would be duplicated. Thanks! Please let me know if there is something I can help with. In the meanwhile, I'd like to make some progress on the vacuum integration and improving the test coverages. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thanks! Please let me know if there is something I can help with.
I didn't get very far because the tests fail on 0004 in rt_verify_node:
TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
--
John Naylor
EDB: http://www.enterprisedb.com
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Nov 16, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote:
>
>
> On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Thanks! Please let me know if there is something I can help with.
>
> I didn't get very far because the tests fail on 0004 in rt_verify_node:
>
> TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
Actually I do want to offer some general advice. Upthread I recommended a purely refactoring patch that added the node-pointer struct but did nothing else, so that the DSA changes would be smaller. 0004 attempted pointer tagging in the same commit, which makes it no longer a purely refactoring patch, so that 1) makes it harder to tell what part caused the bug and 2) obscures what is necessary for DSA pointers and what was additionally necessary for pointer tagging. Shared memory support is a prerequisite for a shippable feature, but pointer tagging is (hopefully) a performance optimization. Let's keep them separate.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Nov 16, 2022 at 1:46 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Thanks! Please let me know if there is something I can help with. > > I didn't get very far because the tests fail on 0004 in rt_verify_node: > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242 Which tests do you use to get this assertion failure? I've confirmed there is a bug in 0005 patch but without it, "make check-world" passed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Nov 16, 2022 at 2:17 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > On Wed, Nov 16, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Thanks! Please let me know if there is something I can help with. > > > > I didn't get very far because the tests fail on 0004 in rt_verify_node: > > > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242 > > Actually I do want to offer some general advice. Upthread I recommended a purely refactoring patch that added the node-pointerstruct but did nothing else, so that the DSA changes would be smaller. 0004 attempted pointer tagging in thesame commit, which makes it no longer a purely refactoring patch, so that 1) makes it harder to tell what part causedthe bug and 2) obscures what is necessary for DSA pointers and what was additionally necessary for pointer tagging.Shared memory support is a prerequisite for a shippable feature, but pointer tagging is (hopefully) a performanceoptimization. Let's keep them separate. Totally agreed. I'll separate them in the next version patch. Thank you for your advice. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 1:46 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> >
> > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > Thanks! Please let me know if there is something I can help with.
> >
> > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> >
> > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
>
> Which tests do you use to get this assertion failure? I've confirmed
> there is a bug in 0005 patch but without it, "make check-world"
> passed.
Hmm, I started over and rebuilt and it didn't reproduce. Not sure what happened, sorry for the noise.
I'm attaching a test I wrote to stress test branch prediction in search, and while trying it out I found two possible issues.
It's based on the random int load test, but tests search speed. Run like this:
select * from bench_search_random_nodes(10 * 1000 * 1000)
It also takes some care to include all the different node kinds, restricting the possible keys by AND-ing with a filter. Here's a simple demo:
filter = ((uint64)1<<40)-1;
LOG: num_keys = 9999967, height = 4, n4 = 17513814, n32 = 6320, n128 = 62663, n256 = 3130
Just using random integers leads to >99% using the smallest node. I wanted to get close to having the same number of each, but that's difficult while still using random inputs. I ended up using
filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)
which gives
LOG: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 182670, n256 = 1024
Which seems okay for the task. One puzzling thing I found while trying various filters is that sometimes the reported tree height would change. For example:
filter = (((uint64) 1<<32) | (0xFF<<24));
LOG: num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 3161
1) Any idea why the tree height would be reported as 7 here? I didn't expect that.
2) It seems that 0004 actually causes a significant slowdown in this test (as in the attached, using the second filter above and with turboboost disabled):
v9 0003: 2062 2051 2050
v9 0004: 2346 2316 2321
That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case.
Attachment
On Wed, Nov 16, 2022 at 4:39 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Nov 16, 2022 at 1:46 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > > > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Thanks! Please let me know if there is something I can help with. > > > > > > I didn't get very far because the tests fail on 0004 in rt_verify_node: > > > > > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID:18242 > > > > Which tests do you use to get this assertion failure? I've confirmed > > there is a bug in 0005 patch but without it, "make check-world" > > passed. > > Hmm, I started over and rebuilt and it didn't reproduce. Not sure what happened, sorry for the noise. Good to know. No problem. > I'm attaching a test I wrote to stress test branch prediction in search, and while trying it out I found two possible issues. Thank you for testing! > > It's based on the random int load test, but tests search speed. Run like this: > > select * from bench_search_random_nodes(10 * 1000 * 1000) > > It also takes some care to include all the different node kinds, restricting the possible keys by AND-ing with a filter.Here's a simple demo: > > filter = ((uint64)1<<40)-1; > LOG: num_keys = 9999967, height = 4, n4 = 17513814, n32 = 6320, n128 = 62663, n256 = 3130 > > Just using random integers leads to >99% using the smallest node. I wanted to get close to having the same number of each,but that's difficult while still using random inputs. I ended up using > > filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF) > > which gives > > LOG: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 182670, n256 = 1024 > > Which seems okay for the task. One puzzling thing I found while trying various filters is that sometimes the reported treeheight would change. For example: > > filter = (((uint64) 1<<32) | (0xFF<<24)); > LOG: num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 3161 > > 1) Any idea why the tree height would be reported as 7 here? I didn't expect that. In my environment, (0xFF<<24) is 0xFFFFFFFFFF000000, not 0xFF000000. It seems the filter should be (((uint64) 1<<32) | ((uint64) 0xFF<<24)). > > 2) It seems that 0004 actually causes a significant slowdown in this test (as in the attached, using the second filterabove and with turboboost disabled): > > v9 0003: 2062 2051 2050 > v9 0004: 2346 2316 2321 > > That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the courseof separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to templatelocal and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case. I'll also run the test on my environment and do the investigation tomorrow. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Sep 28, 2022 at 1:18 PM I wrote:
> Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memory efficiency and number of branches when searching/inserting. My current thinking is there is too much coupling between size class and data type. Each size class currently uses a different data type and a different algorithm to search and set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branch prediction [1] and (I imagine) code density.
>
> I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in the struct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migrating a node to a larger size class of the same kind, we can simply repalloc() to the next size.
While the most important challenge right now is how to best represent and organize the shared memory case, I wanted to get the above idea working and out of the way, to be saved for a future time. I've attached a rough implementation (applies on top of v9 0003) that splits node32 into 2 size classes. They both share the exact same base data type and hence the same search/set code, so the number of "kind"s is still four, but here there are five "size classes", so a new case in the "unlikely" node-growing path. The smaller instance of node32 is a "node15", because that's currently 160 bytes, corresponding to one of the DSA size classes. This idea can be applied to any other node except the max size, as we see fit. (Adding a singleton size class would bring it back in line with the prototype, at least as far as memory consumption.)
One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's not an issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(), so I just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct in all cases, but that limits what we can do with the smallest node kind.
In the course of working on this, I encountered a pain point. Since it's impossible to repalloc in slab, we have to do alloc/copy/free ourselves. That's fine, but the current coding makes too many assumptions about the use cases: rt_alloc_node and rt_copy_node are too entangled with each other and do too much work unrelated to what the names imply. I seem to remember an earlier version had something like rt_node_copy_common that did only...copying. That was much easier to reason about. In 0002 I resorted to doing my own allocation to show what I really want to do, because the new use case doesn't need zeroing and setting values. It only needs to...allocate (and increase the stats counter if built that way).
Future optimization work while I'm thinking of it: rt_alloc_node should be always-inlined and the memset done separately (i.e. not *AllocZero). That way the compiler should be able generate more efficient zeroing code for smaller nodes. I'll test the numbers on this sometime in the future.
> Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memory efficiency and number of branches when searching/inserting. My current thinking is there is too much coupling between size class and data type. Each size class currently uses a different data type and a different algorithm to search and set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branch prediction [1] and (I imagine) code density.
>
> I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in the struct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migrating a node to a larger size class of the same kind, we can simply repalloc() to the next size.
While the most important challenge right now is how to best represent and organize the shared memory case, I wanted to get the above idea working and out of the way, to be saved for a future time. I've attached a rough implementation (applies on top of v9 0003) that splits node32 into 2 size classes. They both share the exact same base data type and hence the same search/set code, so the number of "kind"s is still four, but here there are five "size classes", so a new case in the "unlikely" node-growing path. The smaller instance of node32 is a "node15", because that's currently 160 bytes, corresponding to one of the DSA size classes. This idea can be applied to any other node except the max size, as we see fit. (Adding a singleton size class would bring it back in line with the prototype, at least as far as memory consumption.)
One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's not an issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(), so I just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct in all cases, but that limits what we can do with the smallest node kind.
In the course of working on this, I encountered a pain point. Since it's impossible to repalloc in slab, we have to do alloc/copy/free ourselves. That's fine, but the current coding makes too many assumptions about the use cases: rt_alloc_node and rt_copy_node are too entangled with each other and do too much work unrelated to what the names imply. I seem to remember an earlier version had something like rt_node_copy_common that did only...copying. That was much easier to reason about. In 0002 I resorted to doing my own allocation to show what I really want to do, because the new use case doesn't need zeroing and setting values. It only needs to...allocate (and increase the stats counter if built that way).
Future optimization work while I'm thinking of it: rt_alloc_node should be always-inlined and the memset done separately (i.e. not *AllocZero). That way the compiler should be able generate more efficient zeroing code for smaller nodes. I'll test the numbers on this sometime in the future.
Attachment
On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Nov 16, 2022 at 4:39 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > > > On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Wed, Nov 16, 2022 at 1:46 PM John Naylor > > > <john.naylor@enterprisedb.com> wrote: > > > > > > > > > > > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > Thanks! Please let me know if there is something I can help with. > > > > > > > > I didn't get very far because the tests fail on 0004 in rt_verify_node: > > > > > > > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID:18242 > > > > > > Which tests do you use to get this assertion failure? I've confirmed > > > there is a bug in 0005 patch but without it, "make check-world" > > > passed. > > > > Hmm, I started over and rebuilt and it didn't reproduce. Not sure what happened, sorry for the noise. > > Good to know. No problem. > > > I'm attaching a test I wrote to stress test branch prediction in search, and while trying it out I found two possibleissues. > > Thank you for testing! > > > > > It's based on the random int load test, but tests search speed. Run like this: > > > > select * from bench_search_random_nodes(10 * 1000 * 1000) > > > > It also takes some care to include all the different node kinds, restricting the possible keys by AND-ing with a filter.Here's a simple demo: > > > > filter = ((uint64)1<<40)-1; > > LOG: num_keys = 9999967, height = 4, n4 = 17513814, n32 = 6320, n128 = 62663, n256 = 3130 > > > > Just using random integers leads to >99% using the smallest node. I wanted to get close to having the same number ofeach, but that's difficult while still using random inputs. I ended up using > > > > filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF) > > > > which gives > > > > LOG: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 182670, n256 = 1024 > > > > Which seems okay for the task. One puzzling thing I found while trying various filters is that sometimes the reportedtree height would change. For example: > > > > filter = (((uint64) 1<<32) | (0xFF<<24)); > > LOG: num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 3161 > > > > 1) Any idea why the tree height would be reported as 7 here? I didn't expect that. > > In my environment, (0xFF<<24) is 0xFFFFFFFFFF000000, not 0xFF000000. > It seems the filter should be (((uint64) 1<<32) | ((uint64) > 0xFF<<24)). > > > > > 2) It seems that 0004 actually causes a significant slowdown in this test (as in the attached, using the second filterabove and with turboboost disabled): > > > > v9 0003: 2062 2051 2050 > > v9 0004: 2346 2316 2321 > > > > That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the courseof separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to templatelocal and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case. > > I'll also run the test on my environment and do the investigation tomorrow. > FYI I've not tested the patch you shared today but here are the benchmark results I did with the v9 patch in my environment (I used the second filter). I splitted 0004 patch into two patches: a patch for pure refactoring patch to introduce rt_node_ptr and a patch to do pointer tagging. v9 0003 patch : 1113 1114 1114 introduce rt_node_ptr: 1127 1128 1128 pointer tagging : 1085 1087 1086 (equivalent to 0004 patch) In my environment, rt_node_ptr seemed to lead some overhead but pointer tagging had performance benefits. I'm not sure the reason why the results are different from yours. The radix tree stats shows the same as your tests. =# select * from bench_search_random_nodes(10 * 1000 * 1000); 2022-11-18 22:18:21.608 JST [3913544] LOG: num_keys = 9291812, height = 4, n4 = 262144, n32 =79603, n128 = 182670, n256 = 1024 Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> FYI I've not tested the patch you shared today but here are the
> benchmark results I did with the v9 patch in my environment (I used
> the second filter). I splitted 0004 patch into two patches: a patch
> for pure refactoring patch to introduce rt_node_ptr and a patch to do
> pointer tagging.
>
> v9 0003 patch : 1113 1114 1114
> introduce rt_node_ptr: 1127 1128 1128
> pointer tagging : 1085 1087 1086 (equivalent to 0004 patch)
>
> In my environment, rt_node_ptr seemed to lead some overhead but
> pointer tagging had performance benefits. I'm not sure the reason why
> the results are different from yours. The radix tree stats shows the
> same as your tests.
There is less than 2% difference from the medial set of results, so it's hard to distinguish from noise. I did a fresh rebuild and retested with the same results: about 15% slowdown in v9 0004. That's strange.
On Wed, Nov 16, 2022 at 10:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > filter = (((uint64) 1<<32) | (0xFF<<24));
> > LOG: num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 3161
> >
> > 1) Any idea why the tree height would be reported as 7 here? I didn't expect that.
>
> In my environment, (0xFF<<24) is 0xFFFFFFFFFF000000, not 0xFF000000.
> It seems the filter should be (((uint64) 1<<32) | ((uint64)
> 0xFF<<24)).
Ugh, sign extension, brain fade on my part. Thanks, I'm glad there was a straightforward explanation.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> FYI I've not tested the patch you shared today but here are the
> benchmark results I did with the v9 patch in my environment (I used
> the second filter). I splitted 0004 patch into two patches: a patch
> for pure refactoring patch to introduce rt_node_ptr and a patch to do
> pointer tagging.
>
> v9 0003 patch : 1113 1114 1114
> introduce rt_node_ptr: 1127 1128 1128
> pointer tagging : 1085 1087 1086 (equivalent to 0004 patch)
>
> In my environment, rt_node_ptr seemed to lead some overhead but
> pointer tagging had performance benefits. I'm not sure the reason why
> the results are different from yours. The radix tree stats shows the
> same as your tests.
There is less than 2% difference from the medial set of results, so it's hard to distinguish from noise. I did a fresh rebuild and retested with the same results: about 15% slowdown in v9 0004. That's strange.
On Wed, Nov 16, 2022 at 10:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > filter = (((uint64) 1<<32) | (0xFF<<24));
> > LOG: num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 3161
> >
> > 1) Any idea why the tree height would be reported as 7 here? I didn't expect that.
>
> In my environment, (0xFF<<24) is 0xFFFFFFFFFF000000, not 0xFF000000.
> It seems the filter should be (((uint64) 1<<32) | ((uint64)
> 0xFF<<24)).
Ugh, sign extension, brain fade on my part. Thanks, I'm glad there was a straightforward explanation.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 4:39 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > > That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case.
Digging a bit deeper, I see a flaw in my benchmark: Even though the total distribution of node kinds is decently even, the pattern that the benchmark sees is not terribly random:
3,343,352 branch-misses:u # 0.85% of all branches
393,204,959 branches:u
Recall a previous benchmark [1] where the leaf node was about half node16 and half node32. Randomizing the leaf node between the two caused branch misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this new benchmark, each level has a skewed distribution of nodes, giving a smart branch predictor something to work with. We will need a way to efficiently generate keys that lead to a relatively unpredictable distribution of node kinds, as seen by a searcher. Especially in the leaves (or just above the leaves), since those are less likely to be cached.
> > I'll also run the test on my environment and do the investigation tomorrow.
> >
>
> FYI I've not tested the patch you shared today but here are the
> benchmark results I did with the v9 patch in my environment (I used
> the second filter). I splitted 0004 patch into two patches: a patch
> for pure refactoring patch to introduce rt_node_ptr and a patch to do
> pointer tagging.
Would you be able to share the refactoring patch? And a fix for the failing tests? I'm thinking I want to try the templating approach fairly soon.
[1] https://www.postgresql.org/message-id/CAFBsxsFEVckVzsBsfgGzGR4Yz%3DJp%3DUxOtjYvTjOz6fOoLXtOig%40mail.gmail.com
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 4:39 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > > That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case.
Digging a bit deeper, I see a flaw in my benchmark: Even though the total distribution of node kinds is decently even, the pattern that the benchmark sees is not terribly random:
3,343,352 branch-misses:u # 0.85% of all branches
393,204,959 branches:u
Recall a previous benchmark [1] where the leaf node was about half node16 and half node32. Randomizing the leaf node between the two caused branch misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this new benchmark, each level has a skewed distribution of nodes, giving a smart branch predictor something to work with. We will need a way to efficiently generate keys that lead to a relatively unpredictable distribution of node kinds, as seen by a searcher. Especially in the leaves (or just above the leaves), since those are less likely to be cached.
> > I'll also run the test on my environment and do the investigation tomorrow.
> >
>
> FYI I've not tested the patch you shared today but here are the
> benchmark results I did with the v9 patch in my environment (I used
> the second filter). I splitted 0004 patch into two patches: a patch
> for pure refactoring patch to introduce rt_node_ptr and a patch to do
> pointer tagging.
Would you be able to share the refactoring patch? And a fix for the failing tests? I'm thinking I want to try the templating approach fairly soon.
[1] https://www.postgresql.org/message-id/CAFBsxsFEVckVzsBsfgGzGR4Yz%3DJp%3DUxOtjYvTjOz6fOoLXtOig%40mail.gmail.com
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Nov 18, 2022 at 2:48 PM I wrote:
> One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's not an issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(), so I just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct in all cases, but that limits what we can do with the smallest node kind.
Thinking about this part, there's an easy resolution -- use a different macro for fixed- and variable-sized node kinds to determine if there is a free slot.
Also, I wanted to share some results of adjusting the boundary between the two smallest node kinds. In the hackish attached patch, I modified the fixed height search benchmark to search a small (within L1 cache) tree thousands of times. For the first set I modified node4's maximum fanout and filled it up. For the second, I set node4's fanout to 1, which causes 2+ to spill to node32 (actually the partially-filled node15 size class as demoed earlier).
node4:
NOTICE: num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
2 | 16 | 16520 | 0 | 3
NOTICE: num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
3 | 81 | 16456 | 0 | 17
NOTICE: num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
4 | 256 | 16456 | 0 | 89
NOTICE: num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
5 | 625 | 16488 | 0 | 327
node32:
NOTICE: num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
2 | 16 | 16488 | 0 | 5
(1 row)
NOTICE: num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
3 | 81 | 16520 | 0 | 28
NOTICE: num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
4 | 256 | 16408 | 0 | 79
NOTICE: num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
5 | 625 | 24616 | 0 | 199
In this test, node32 seems slightly faster than node4 with 4 elements, at the cost of more memory.
Thinking about this part, there's an easy resolution -- use a different macro for fixed- and variable-sized node kinds to determine if there is a free slot.
Also, I wanted to share some results of adjusting the boundary between the two smallest node kinds. In the hackish attached patch, I modified the fixed height search benchmark to search a small (within L1 cache) tree thousands of times. For the first set I modified node4's maximum fanout and filled it up. For the second, I set node4's fanout to 1, which causes 2+ to spill to node32 (actually the partially-filled node15 size class as demoed earlier).
node4:
NOTICE: num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
2 | 16 | 16520 | 0 | 3
NOTICE: num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
3 | 81 | 16456 | 0 | 17
NOTICE: num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
4 | 256 | 16456 | 0 | 89
NOTICE: num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
5 | 625 | 16488 | 0 | 327
node32:
NOTICE: num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
2 | 16 | 16488 | 0 | 5
(1 row)
NOTICE: num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
3 | 81 | 16520 | 0 | 28
NOTICE: num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
4 | 256 | 16408 | 0 | 79
NOTICE: num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
5 | 625 | 24616 | 0 | 199
In this test, node32 seems slightly faster than node4 with 4 elements, at the cost of more memory.
Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space:
node4: 5 + 4+(7) + 4*8 = 48 bytes
node3: 5 + 3 + 3*8 = 32
node3: 5 + 3 + 3*8 = 32
On Mon, Nov 21, 2022 at 3:43 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Wed, Nov 16, 2022 at 4:39 PM John Naylor > > > <john.naylor@enterprisedb.com> wrote: > > > > > That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in thecourse of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to templatelocal and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case. > > Digging a bit deeper, I see a flaw in my benchmark: Even though the total distribution of node kinds is decently even,the pattern that the benchmark sees is not terribly random: > > 3,343,352 branch-misses:u # 0.85% of all branches > 393,204,959 branches:u > > Recall a previous benchmark [1] where the leaf node was about half node16 and half node32. Randomizing the leaf node betweenthe two caused branch misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this new benchmark, eachlevel has a skewed distribution of nodes, giving a smart branch predictor something to work with. We will need a wayto efficiently generate keys that lead to a relatively unpredictable distribution of node kinds, as seen by a searcher.Especially in the leaves (or just above the leaves), since those are less likely to be cached. > > > > I'll also run the test on my environment and do the investigation tomorrow. > > > > > > > FYI I've not tested the patch you shared today but here are the > > benchmark results I did with the v9 patch in my environment (I used > > the second filter). I splitted 0004 patch into two patches: a patch > > for pure refactoring patch to introduce rt_node_ptr and a patch to do > > pointer tagging. > > Would you be able to share the refactoring patch? And a fix for the failing tests? I'm thinking I want to try the templatingapproach fairly soon. > Sure. I've attached the v10 patches. 0004 is the pure refactoring patch and 0005 patch introduces the pointer tagging. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v10-0003-tool-for-measuring-radix-tree-performance.patch
- v10-0004-Use-rt_node_ptr-to-reference-radix-tree-nodes.patch
- v10-0005-PoC-tag-the-node-kind-to-rt_pointer.patch
- v10-0007-PoC-lazy-vacuum-integration.patch
- v10-0006-PoC-DSA-support-for-radix-tree.patch
- v10-0002-Add-radix-implementation.patch
- v10-0001-introduce-vector8_min-and-vector8_highbit_mask.patch
On Mon, Nov 21, 2022 at 4:20 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Fri, Nov 18, 2022 at 2:48 PM I wrote: > > One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's notan issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(), soI just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct inall cases, but that limits what we can do with the smallest node kind. > > Thinking about this part, there's an easy resolution -- use a different macro for fixed- and variable-sized node kindsto determine if there is a free slot. > > Also, I wanted to share some results of adjusting the boundary between the two smallest node kinds. In the hackish attachedpatch, I modified the fixed height search benchmark to search a small (within L1 cache) tree thousands of times.For the first set I modified node4's maximum fanout and filled it up. For the second, I set node4's fanout to 1, whichcauses 2+ to spill to node32 (actually the partially-filled node15 size class as demoed earlier). > > node4: > > NOTICE: num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0, n256 = 0 > fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms > --------+-------+------------------+------------+-------------- > 2 | 16 | 16520 | 0 | 3 > > NOTICE: num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0, n256 = 0 > fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms > --------+-------+------------------+------------+-------------- > 3 | 81 | 16456 | 0 | 17 > > NOTICE: num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0, n256 = 0 > fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms > --------+-------+------------------+------------+-------------- > 4 | 256 | 16456 | 0 | 89 > > NOTICE: num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0, n256 = 0 > fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms > --------+-------+------------------+------------+-------------- > 5 | 625 | 16488 | 0 | 327 > > > node32: > > NOTICE: num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0, n256 = 0 > fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms > --------+-------+------------------+------------+-------------- > 2 | 16 | 16488 | 0 | 5 > (1 row) > > NOTICE: num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0, n256 = 0 > fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms > --------+-------+------------------+------------+-------------- > 3 | 81 | 16520 | 0 | 28 > > NOTICE: num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0, n256 = 0 > fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms > --------+-------+------------------+------------+-------------- > 4 | 256 | 16408 | 0 | 79 > > NOTICE: num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0, n256 = 0 > fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms > --------+-------+------------------+------------+-------------- > 5 | 625 | 24616 | 0 | 199 > > In this test, node32 seems slightly faster than node4 with 4 elements, at the cost of more memory. > > Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sizednodes), 3 has a nice property: no wasted padding space: > > node4: 5 + 4+(7) + 4*8 = 48 bytes > node3: 5 + 3 + 3*8 = 32 IIUC if we store the fanout member only in variable-sized nodes, rt_node has only count, shift, and chunk, so 4 bytes in total. If so, the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The size doesn't change but there is 1 byte padding space. Also, even if we have the node3 a variable-sized node, size class 1 for node3 could be a good choice since it also doesn't need padding space and could be a good alternative to path compression. node3 : 5 + 3 + 3*8 = 32 bytes size class 1 : 5 + 3 + 1*8 = 16 bytes Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Nov 21, 2022 at 4:20 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space:
> >
> > node4: 5 + 4+(7) + 4*8 = 48 bytes
> > node3: 5 + 3 + 3*8 = 32
>
> IIUC if we store the fanout member only in variable-sized nodes,
> rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
> the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
> size doesn't change but there is 1 byte padding space.
I forgot to mention I'm assuming no pointer-tagging for this exercise. You've demonstrated it can be done in a small amount of code, and I hope we can demonstrate a speedup in search. Just in case there is some issue with portability, valgrind, or some other obstacle, I'm being pessimistic in my calculations.
> Also, even if we have the node3 a variable-sized node, size class 1
> for node3 could be a good choice since it also doesn't need padding
> space and could be a good alternative to path compression.
>
> node3 : 5 + 3 + 3*8 = 32 bytes
> size class 1 : 5 + 3 + 1*8 = 16 bytes
Precisely! I have that scenario in my notes as well -- it's quite compelling.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Mon, Nov 21, 2022 at 4:20 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space:
> >
> > node4: 5 + 4+(7) + 4*8 = 48 bytes
> > node3: 5 + 3 + 3*8 = 32
>
> IIUC if we store the fanout member only in variable-sized nodes,
> rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
> the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
> size doesn't change but there is 1 byte padding space.
I forgot to mention I'm assuming no pointer-tagging for this exercise. You've demonstrated it can be done in a small amount of code, and I hope we can demonstrate a speedup in search. Just in case there is some issue with portability, valgrind, or some other obstacle, I'm being pessimistic in my calculations.
> Also, even if we have the node3 a variable-sized node, size class 1
> for node3 could be a good choice since it also doesn't need padding
> space and could be a good alternative to path compression.
>
> node3 : 5 + 3 + 3*8 = 32 bytes
> size class 1 : 5 + 3 + 1*8 = 16 bytes
Precisely! I have that scenario in my notes as well -- it's quite compelling.
--
John Naylor
EDB: http://www.enterprisedb.com
On 2022-11-21 17:06:56 +0900, Masahiko Sawada wrote: > Sure. I've attached the v10 patches. 0004 is the pure refactoring > patch and 0005 patch introduces the pointer tagging. This failed on cfbot, with som many crashes that the VM ran out of disk for core dumps. During testing with 32bit, so there's probably something broken around that. https://cirrus-ci.com/task/4635135954386944 A failure is e.g. at: https://api.cirrus-ci.com/v1/artifact/task/4635135954386944/testrun/build-32/testrun/adminpack/regress/log/initdb.log performing post-bootstrap initialization ... ../src/backend/lib/radixtree.c:1696:21: runtime error: member access withinmisaligned address 0x590faf74 for type 'struct radix_tree_control', which requires 8 byte alignment 0x590faf74: note: pointer points here 90 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ ==55813==Using libbacktrace symbolizer. #0 0x56dcc274 in rt_create ../src/backend/lib/radixtree.c:1696 #1 0x56953d1b in tidstore_create ../src/backend/access/common/tidstore.c:57 #2 0x56a1ca4f in dead_items_alloc ../src/backend/access/heap/vacuumlazy.c:3109 #3 0x56a2219f in heap_vacuum_rel ../src/backend/access/heap/vacuumlazy.c:539 #4 0x56cb77ed in table_relation_vacuum ../src/include/access/tableam.h:1681 #5 0x56cb77ed in vacuum_rel ../src/backend/commands/vacuum.c:2062 #6 0x56cb9a16 in vacuum ../src/backend/commands/vacuum.c:472 #7 0x56cba904 in ExecVacuum ../src/backend/commands/vacuum.c:272 #8 0x5711b6d0 in standard_ProcessUtility ../src/backend/tcop/utility.c:866 #9 0x5711bdeb in ProcessUtility ../src/backend/tcop/utility.c:530 #10 0x5711759f in PortalRunUtility ../src/backend/tcop/pquery.c:1158 #11 0x57117cb8 in PortalRunMulti ../src/backend/tcop/pquery.c:1315 #12 0x571183d2 in PortalRun ../src/backend/tcop/pquery.c:791 #13 0x57111049 in exec_simple_query ../src/backend/tcop/postgres.c:1238 #14 0x57113f9c in PostgresMain ../src/backend/tcop/postgres.c:4551 #15 0x5711463d in PostgresSingleUserMain ../src/backend/tcop/postgres.c:4028 #16 0x56df4672 in main ../src/backend/main/main.c:197 #17 0xf6ad8e45 in __libc_start_main (/lib/i386-linux-gnu/libc.so.6+0x1ae45) #18 0x5691d0f0 in _start (/tmp/cirrus-ci-build/build-32/tmp_install/usr/local/pgsql/bin/postgres+0x3040f0) Aborted (core dumped) child process exited with exit code 134 initdb: data directory "/tmp/cirrus-ci-build/build-32/testrun/adminpack/regress/tmp_check/data" not removed at user's request
On Mon, Nov 21, 2022 at 6:30 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Nov 21, 2022 at 4:20 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sizednodes), 3 has a nice property: no wasted padding space: > > > > > > node4: 5 + 4+(7) + 4*8 = 48 bytes > > > node3: 5 + 3 + 3*8 = 32 > > > > IIUC if we store the fanout member only in variable-sized nodes, > > rt_node has only count, shift, and chunk, so 4 bytes in total. If so, > > the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The > > size doesn't change but there is 1 byte padding space. > > I forgot to mention I'm assuming no pointer-tagging for this exercise. You've demonstrated it can be done in a small amountof code, and I hope we can demonstrate a speedup in search. Just in case there is some issue with portability, valgrind,or some other obstacle, I'm being pessimistic in my calculations. > > > Also, even if we have the node3 a variable-sized node, size class 1 > > for node3 could be a good choice since it also doesn't need padding > > space and could be a good alternative to path compression. > > > > node3 : 5 + 3 + 3*8 = 32 bytes > > size class 1 : 5 + 3 + 1*8 = 16 bytes > > Precisely! I have that scenario in my notes as well -- it's quite compelling. So it seems that there are two candidates of rt_node structure: (1) all nodes except for node256 are variable-size nodes and use pointer tagging, and (2) node32 and node128 are variable-sized nodes and do not use pointer tagging (fanout is in part of only these two nodes). rt_node can be 5 bytes in both cases. But before going to this step, I started to verify the idea of variable-size nodes by using 6-bytes rt_node. We can adjust the node kinds and node classes later. In this verification, I have all nodes except for node256 variable-sized nodes, and the sizes are: radix tree node 1 : 6 + 4 + (6) + 1*8 = 24 bytes radix tree node 4 : 6 + 4 + (6) + 4*8 = 48 radix tree node 15 : 6 + 32 + (2) + 15*8 = 160 radix tree node 32 : 6 + 32 + (2) + 32*8 = 296 radix tree node 61 : inner 6 + 256 + (2) + 61*8 = 752, leaf 6 + 256 + (2) + 16 + 61*8 = 768 radix tree node 128 : inner 6 + 256 + (2) + 128*8 = 1288, leaf 6 + 256 + (2) + 16 + 128*8 = 1304 radix tree node 256 : inner 6 + (2) + 256*8 = 2056, leaf 6 + (2) + 32 + 256*8 = 2088 I did some performance tests against two radix trees: a radix tree supporting only fixed-size nodes (i.e. applying up to 0003 patch), and a radix tree supporting variable-size nodes (i.e. applying all attached patches). Also, I changed bench_search_random_nodes() function so that we can specify the filter via a function argument. Here are results: Here are results: * Query select * from bench_seq_search(0, 1*1000*1000, false) * Fixed-size NOTICE: num_keys = 1000000, height = 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122 nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms ---------+------------------+---------------------+------------+---------------+--------------+----------------- 1000000 | 9871216 | | 67 | | 212 | (1 row) * Variable-size NOTICE: num_keys = 1000000, height = 2, n1 = 0, n4 = 0, n15 = 0, n32 = 31251, n61 = 0, n128 = 1, n256 = 122 nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms ---------+------------------+---------------------+------------+---------------+--------------+----------------- 1000000 | 9871280 | | 74 | | 212 | (1 row) --- * Query select * from bench_seq_search(0, 2*1000*1000, true) NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245 * Fixed-size nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms --------+------------------+---------------------+------------+---------------+--------------+----------------- 999654 | 19680848 | | 74 | | 201 | (1 row) * Variable-size NOTICE: num_keys = 999654, height = 2, n1 = 0, n4 = 1, n15 = 26951, n32 = 35548, n61 = 1, n128 = 0, n256 = 245 nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms --------+------------------+---------------------+------------+---------------+--------------+----------------- 999654 | 16009040 | | 85 | | 201 | (1 row) --- * Query select * from bench_search_random_nodes(10 * 1000 * 1000, '0x7F07FF00FF') * Fixed-size NOTICE: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 182670, n256 = 1024 mem_allocated | search_ms ---------------+----------- 343001456 | 1151 (1 row) * Variable-size NOTICE: num_keys = 9291812, height = 4, n1 = 262144, n4 = 0, n15 = 138, n32 = 79465, n61 = 182665, n128 = 5, n256 = 1024 mem_allocated | search_ms ---------------+----------- 230504328 | 1077 (1 row) --- * Query select * from bench_search_random_nodes(10 * 1000 * 1000, '0xFFFF0000003F') * Fixed-size NOTICE: num_keys = 3807650, height = 5, n4 = 196608, n32 = 0, n128 = 65536, n256 = 257 mem_allocated | search_ms ---------------+----------- 99911920 | 632 (1 row) * Variable-size NOTICE: num_keys = 3807650, height = 5, n1 = 196608, n4 = 0, n15 = 0, n32 = 0, n61 = 61747, n128 = 3789, n256 = 257 mem_allocated | search_ms ---------------+----------- 64045688 | 554 (1 row) Overall, the idea of variable-sized nodes is good, smaller size without losing search performance. I'm going to check the load performance as well. I've attached the patches I used for the verification. I don't include patches for pointer tagging, DSA support, and vacuum integration since I'm investigating the issue on cfbot that Andres reported. Also, I've modified tests to improve the test coverage. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> So it seems that there are two candidates of rt_node structure: (1)
> all nodes except for node256 are variable-size nodes and use pointer
> tagging, and (2) node32 and node128 are variable-sized nodes and do
> not use pointer tagging (fanout is in part of only these two nodes).
> rt_node can be 5 bytes in both cases. But before going to this step, I
> started to verify the idea of variable-size nodes by using 6-bytes
> rt_node. We can adjust the node kinds and node classes later.
First, I'm glad you picked up the size class concept and expanded it. (I have some comments about some internal APIs below.)
Let's leave the pointer tagging piece out until the main functionality is committed. We have all the prerequisites in place, except for a benchmark random enough to demonstrate benefit. I'm still not quite satisfied with how the shared memory coding looked, and that is the only sticky problem we still have, IMO. The rest is "just work".
That said, (1) and (2) above are still relevant -- variable sizing any given node is optional, and we can refine as needed.
> Overall, the idea of variable-sized nodes is good, smaller size
> without losing search performance.
Good.
> I'm going to check the load
> performance as well.
Part of that is this, which gets called a lot more now, when node1 expands:
+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);
Since memset for expanding size class is now handled separately, these can use the non-zeroing versions. When compiling MemoryContextAllocZero, the compiler has no idea how big the size is, so it assumes the worst and optimizes for large sizes. On x86-64, that means using "rep stos", which calls microcode found in the CPU's ROM. This is slow for small sizes. The "init" function should be always inline with const parameters where possible. That way, memset can compile to a single instruction for the smallest node kind. (More on alloc/init below)
Note, there is a wrinkle: As currently written inner_node128 searches the child pointers for NULL when inserting, so when expanding from partial to full size class, the new node must be zeroed (Worth fixing in the short term. I thought of this while writing the proof-of-concept for size classes, but didn't mention it.) Medium term, rather than special-casing this, I actually want to rewrite the inner-node128 to be more similar to the leaf, with an "isset" array, but accessed and tested differently. I guarantee it's *really* slow now to load (maybe somewhat true even for leaves), but I'll leave the details for later. Regarding node128 leaf, note that it's slightly larger than a DSA size class, and we can trim it to fit:
node61: 6 + 256+(2) +16 + 61*8 = 768
node125: 6 + 256+(2) +16 + 125*8 = 1280
> I've attached the patches I used for the verification. I don't include
> patches for pointer tagging, DSA support, and vacuum integration since
> I'm investigating the issue on cfbot that Andres reported. Also, I've
> modified tests to improve the test coverage.
Sounds good. For v12, I think size classes have proven themselves, so v11's 0002/4/5 can be squashed. Plus, some additional comments:
+/* Return a new and initialized node */
+static rt_node *
+rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, bool inner)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, kind, inner);
+ rt_init_node(newnode, kind, shift, chunk, inner);
+
+ return newnode;
+}
I don't see the point of a function that just calls two functions.
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
+ node->shift > 0);
+ newnode->count = node->count;
+
+ return newnode;
+}
This, in turn, just calls a function that does _almost_ everything, and additionally must set one member. This function should really be alloc-node + init-node + copy-common, where copy-common is like in the prototype:
+ newnode->node_shift = oldnode->node_shift;
+ newnode->node_chunk = oldnode->node_chunk;
+ newnode->count = oldnode->count;
And init-node should really be just memset + set kind + set initial fanout. It has no business touching "shift" and "chunk". The callers rt_new_root, rt_set_extend, and rt_extend set some values of their own anyway, so let them set those, too -- it might even improve readability.
- if (n32->base.n.fanout == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))
This macro doesn't really improve readability -- it obscures what is being tested, and the name implies the "else" branch means "node doesn't need to grow class", which is false. If we want to simplify expressions in this block, I think it'd be more effective to improve the lines that follow:
+ memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
+ new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
Maybe we can have const variables old_size and new_fanout to break out the array lookup? While I'm thinking of it, these arrays should be const so the compiler can avoid runtime lookups. Speaking of...
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
When I looked at this earlier, I somehow didn't go far enough -- why are we passing the runtime count in the first place? This function can only be called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last parameter to memcpy should evaluate to a compile-time constant, right? Even when we add node shrinking in the future, the constant should be correct, IIUC?
- .fanout = 256,
+ /* technically it's 256, but we can't store that in a uint8,
+ and this is the max size class so it will never grow */
+ .fanout = 0,
- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(((rt_node *) n256)->fanout == 0);
+ Assert(chunk_exists || ((rt_node *) n256)->count < 256);
These hacks were my work, but I think we can improve that by having two versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized nodes. For that to work, in "init-node" we'd need a branch to set fanout to zero for node256. That should be fine -- it already has to branch for memset'ing node128's indexes to 0xFF.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> So it seems that there are two candidates of rt_node structure: (1)
> all nodes except for node256 are variable-size nodes and use pointer
> tagging, and (2) node32 and node128 are variable-sized nodes and do
> not use pointer tagging (fanout is in part of only these two nodes).
> rt_node can be 5 bytes in both cases. But before going to this step, I
> started to verify the idea of variable-size nodes by using 6-bytes
> rt_node. We can adjust the node kinds and node classes later.
First, I'm glad you picked up the size class concept and expanded it. (I have some comments about some internal APIs below.)
Let's leave the pointer tagging piece out until the main functionality is committed. We have all the prerequisites in place, except for a benchmark random enough to demonstrate benefit. I'm still not quite satisfied with how the shared memory coding looked, and that is the only sticky problem we still have, IMO. The rest is "just work".
That said, (1) and (2) above are still relevant -- variable sizing any given node is optional, and we can refine as needed.
> Overall, the idea of variable-sized nodes is good, smaller size
> without losing search performance.
Good.
> I'm going to check the load
> performance as well.
Part of that is this, which gets called a lot more now, when node1 expands:
+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);
Since memset for expanding size class is now handled separately, these can use the non-zeroing versions. When compiling MemoryContextAllocZero, the compiler has no idea how big the size is, so it assumes the worst and optimizes for large sizes. On x86-64, that means using "rep stos", which calls microcode found in the CPU's ROM. This is slow for small sizes. The "init" function should be always inline with const parameters where possible. That way, memset can compile to a single instruction for the smallest node kind. (More on alloc/init below)
Note, there is a wrinkle: As currently written inner_node128 searches the child pointers for NULL when inserting, so when expanding from partial to full size class, the new node must be zeroed (Worth fixing in the short term. I thought of this while writing the proof-of-concept for size classes, but didn't mention it.) Medium term, rather than special-casing this, I actually want to rewrite the inner-node128 to be more similar to the leaf, with an "isset" array, but accessed and tested differently. I guarantee it's *really* slow now to load (maybe somewhat true even for leaves), but I'll leave the details for later. Regarding node128 leaf, note that it's slightly larger than a DSA size class, and we can trim it to fit:
node61: 6 + 256+(2) +16 + 61*8 = 768
node125: 6 + 256+(2) +16 + 125*8 = 1280
> I've attached the patches I used for the verification. I don't include
> patches for pointer tagging, DSA support, and vacuum integration since
> I'm investigating the issue on cfbot that Andres reported. Also, I've
> modified tests to improve the test coverage.
Sounds good. For v12, I think size classes have proven themselves, so v11's 0002/4/5 can be squashed. Plus, some additional comments:
+/* Return a new and initialized node */
+static rt_node *
+rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, bool inner)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, kind, inner);
+ rt_init_node(newnode, kind, shift, chunk, inner);
+
+ return newnode;
+}
I don't see the point of a function that just calls two functions.
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
+ node->shift > 0);
+ newnode->count = node->count;
+
+ return newnode;
+}
This, in turn, just calls a function that does _almost_ everything, and additionally must set one member. This function should really be alloc-node + init-node + copy-common, where copy-common is like in the prototype:
+ newnode->node_shift = oldnode->node_shift;
+ newnode->node_chunk = oldnode->node_chunk;
+ newnode->count = oldnode->count;
And init-node should really be just memset + set kind + set initial fanout. It has no business touching "shift" and "chunk". The callers rt_new_root, rt_set_extend, and rt_extend set some values of their own anyway, so let them set those, too -- it might even improve readability.
- if (n32->base.n.fanout == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+ if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))
This macro doesn't really improve readability -- it obscures what is being tested, and the name implies the "else" branch means "node doesn't need to grow class", which is false. If we want to simplify expressions in this block, I think it'd be more effective to improve the lines that follow:
+ memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
+ new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
Maybe we can have const variables old_size and new_fanout to break out the array lookup? While I'm thinking of it, these arrays should be const so the compiler can avoid runtime lookups. Speaking of...
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
When I looked at this earlier, I somehow didn't go far enough -- why are we passing the runtime count in the first place? This function can only be called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last parameter to memcpy should evaluate to a compile-time constant, right? Even when we add node shrinking in the future, the constant should be correct, IIUC?
- .fanout = 256,
+ /* technically it's 256, but we can't store that in a uint8,
+ and this is the max size class so it will never grow */
+ .fanout = 0,
- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(((rt_node *) n256)->fanout == 0);
+ Assert(chunk_exists || ((rt_node *) n256)->count < 256);
These hacks were my work, but I think we can improve that by having two versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized nodes. For that to work, in "init-node" we'd need a branch to set fanout to zero for node256. That should be fine -- it already has to branch for memset'ing node128's indexes to 0xFF.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> [v11]
There is one more thing that just now occurred to me: In expanding the use of size classes, that makes rebasing and reworking the shared memory piece more work than it should be. That's important because there are still some open questions about the design around shared memory. To keep unnecessary churn to a minimum, perhaps we should limit size class expansion to just one (or 5 total size classes) for the near future?
--
John Naylor
EDB: http://www.enterprisedb.com
While creating a benchmark for inserting into node128-inner, I found a bug. If a caller deletes from a node128, the slot index is set to invalid, but the child pointer is still valid. Do that a few times, and every child pointer is valid, even if no slot index points to it. When the next inserter comes along, something surprising happens. This function:
/* Return an unused slot in node-128 */
static int
node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
{
int slotpos = 0;
Assert(!NODE_IS_LEAF(node));
while (node_inner_128_is_slot_used(node, slotpos))
slotpos++;
return slotpos;
}
...passes an integer to this function, whose parameter is a uint8:
/* Is the slot in the node used? */
static inline bool
node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
{
Assert(!NODE_IS_LEAF(node));
return (node->children[slot] != NULL);
}
...so instead of growing the node unnecessarily or segfaulting, it enters an infinite loop doing this:
add eax, 1
movzx ecx, al
cmp QWORD PTR [rbx+264+rcx*8], 0
jne .L147
The fix is easy enough -- set the child pointer to null upon deletion, but I'm somewhat astonished that the regression tests didn't hit this. I do still intend to replace this code with something faster, but before I do so the tests should probably exercise the deletion paths more. Since VACUUM
--
John Naylor
EDB: http://www.enterprisedb.com
/* Return an unused slot in node-128 */
static int
node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
{
int slotpos = 0;
Assert(!NODE_IS_LEAF(node));
while (node_inner_128_is_slot_used(node, slotpos))
slotpos++;
return slotpos;
}
...passes an integer to this function, whose parameter is a uint8:
/* Is the slot in the node used? */
static inline bool
node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
{
Assert(!NODE_IS_LEAF(node));
return (node->children[slot] != NULL);
}
...so instead of growing the node unnecessarily or segfaulting, it enters an infinite loop doing this:
add eax, 1
movzx ecx, al
cmp QWORD PTR [rbx+264+rcx*8], 0
jne .L147
The fix is easy enough -- set the child pointer to null upon deletion, but I'm somewhat astonished that the regression tests didn't hit this. I do still intend to replace this code with something faster, but before I do so the tests should probably exercise the deletion paths more. Since VACUUM
--
John Naylor
EDB: http://www.enterprisedb.com
> The fix is easy enough -- set the child pointer to null upon deletion, but I'm somewhat astonished that the regression tests didn't hit this. I do still intend to replace this code with something faster, but before I do so the tests should probably exercise the deletion paths more. Since VACUUM
Oops. I meant to finish with "Since VACUUM doesn't perform deletion we didn't have an opportunity to detect this during that operation."
--
John Naylor
EDB: http://www.enterprisedb.com
Oops. I meant to finish with "Since VACUUM doesn't perform deletion we didn't have an opportunity to detect this during that operation."
--
John Naylor
EDB: http://www.enterprisedb.com
There are a few things up in the air, so I'm coming back to this list to summarize and add a recent update:
On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> - See how much performance we actually gain from tagging the node kind.
Needs a benchmark that has enough branch mispredicts and L2/3 misses to show a benefit. Otherwise either neutral or worse in its current form, depending on compiler(?). Put off for later.
> - Try additional size classes while keeping the node kinds to only four.
This is relatively simple and effective. If only one additional size class (total 5) is coded as a placeholder, I imagine it will be easier to rebase shared memory logic than using this technique everywhere possible.
> - Optimize node128 insert.
I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over and operate on word-sized (32- or 64-bit) types at a time, rather than bytes. To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That's probably going to need a separate email thread to resolve the coding style clash this causes, so that can be put off for later. This is not meant to be included in the next patchset. For demonstration purposes, I get these results with a function that repeatedly deletes the last value from a mostly-full node128 leaf and re-inserts it:
select * from bench_node128_load(120);
v11
NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
120 | 14400 | 208304 | 56
v11 + 0006 addendum
NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
120 | 14400 | 208816 | 34
I didn't test inner nodes, but I imagine the difference is bigger. This bitmap style should also be used for the node256-leaf isset array simply to be consistent and avoid needing single-use macros, but that has not been done yet. It won't make a difference for performance because there is no iteration there.
> - Try templating out the differences between local and shared memory.
I hope to start this sometime after the crashes on 32-bit are resolved.
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> - See how much performance we actually gain from tagging the node kind.
Needs a benchmark that has enough branch mispredicts and L2/3 misses to show a benefit. Otherwise either neutral or worse in its current form, depending on compiler(?). Put off for later.
> - Try additional size classes while keeping the node kinds to only four.
This is relatively simple and effective. If only one additional size class (total 5) is coded as a placeholder, I imagine it will be easier to rebase shared memory logic than using this technique everywhere possible.
> - Optimize node128 insert.
I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over and operate on word-sized (32- or 64-bit) types at a time, rather than bytes. To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That's probably going to need a separate email thread to resolve the coding style clash this causes, so that can be put off for later. This is not meant to be included in the next patchset. For demonstration purposes, I get these results with a function that repeatedly deletes the last value from a mostly-full node128 leaf and re-inserts it:
select * from bench_node128_load(120);
v11
NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
120 | 14400 | 208304 | 56
v11 + 0006 addendum
NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
120 | 14400 | 208816 | 34
I didn't test inner nodes, but I imagine the difference is bigger. This bitmap style should also be used for the node256-leaf isset array simply to be consistent and avoid needing single-use macros, but that has not been done yet. It won't make a difference for performance because there is no iteration there.
> - Try templating out the differences between local and shared memory.
I hope to start this sometime after the crashes on 32-bit are resolved.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
On Fri, Nov 25, 2022 at 5:00 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > So it seems that there are two candidates of rt_node structure: (1) > > all nodes except for node256 are variable-size nodes and use pointer > > tagging, and (2) node32 and node128 are variable-sized nodes and do > > not use pointer tagging (fanout is in part of only these two nodes). > > rt_node can be 5 bytes in both cases. But before going to this step, I > > started to verify the idea of variable-size nodes by using 6-bytes > > rt_node. We can adjust the node kinds and node classes later. > > First, I'm glad you picked up the size class concept and expanded it. (I have some comments about some internal APIs below.) > > Let's leave the pointer tagging piece out until the main functionality is committed. We have all the prerequisites in place,except for a benchmark random enough to demonstrate benefit. I'm still not quite satisfied with how the shared memorycoding looked, and that is the only sticky problem we still have, IMO. The rest is "just work". > > That said, (1) and (2) above are still relevant -- variable sizing any given node is optional, and we can refine as needed. > > > Overall, the idea of variable-sized nodes is good, smaller size > > without losing search performance. > > Good. > > > I'm going to check the load > > performance as well. > > Part of that is this, which gets called a lot more now, when node1 expands: > > + if (inner) > + newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind], > + rt_node_kind_info[kind].inner_size); > + else > + newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind], > + rt_node_kind_info[kind].leaf_size); > > Since memset for expanding size class is now handled separately, these can use the non-zeroing versions. When compilingMemoryContextAllocZero, the compiler has no idea how big the size is, so it assumes the worst and optimizes forlarge sizes. On x86-64, that means using "rep stos", which calls microcode found in the CPU's ROM. This is slow for smallsizes. The "init" function should be always inline with const parameters where possible. That way, memset can compileto a single instruction for the smallest node kind. (More on alloc/init below) Right. I forgot to update it. > > Note, there is a wrinkle: As currently written inner_node128 searches the child pointers for NULL when inserting, so whenexpanding from partial to full size class, the new node must be zeroed (Worth fixing in the short term. I thought ofthis while writing the proof-of-concept for size classes, but didn't mention it.) Medium term, rather than special-casingthis, I actually want to rewrite the inner-node128 to be more similar to the leaf, with an "isset" array, butaccessed and tested differently. I guarantee it's *really* slow now to load (maybe somewhat true even for leaves), butI'll leave the details for later. Agreed, I start with zeroing out the node when expanding from partial to full size. > Regarding node128 leaf, note that it's slightly larger than a DSA size class, and we can trim it to fit: > > node61: 6 + 256+(2) +16 + 61*8 = 768 > node125: 6 + 256+(2) +16 + 125*8 = 1280 Agreed, changed. > > > I've attached the patches I used for the verification. I don't include > > patches for pointer tagging, DSA support, and vacuum integration since > > I'm investigating the issue on cfbot that Andres reported. Also, I've > > modified tests to improve the test coverage. > > Sounds good. For v12, I think size classes have proven themselves, so v11's 0002/4/5 can be squashed. Plus, some additionalcomments: > > +/* Return a new and initialized node */ > +static rt_node * > +rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, bool inner) > +{ > + rt_node *newnode; > + > + newnode = rt_alloc_node(tree, kind, inner); > + rt_init_node(newnode, kind, shift, chunk, inner); > + > + return newnode; > +} > > I don't see the point of a function that just calls two functions. Removed. > > +/* > + * Create a new node with 'new_kind' and the same shift, chunk, and > + * count of 'node'. > + */ > +static rt_node * > +rt_grow_node(radix_tree *tree, rt_node *node, int new_kind) > +{ > + rt_node *newnode; > + > + newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk, > + node->shift > 0); > + newnode->count = node->count; > + > + return newnode; > +} > > This, in turn, just calls a function that does _almost_ everything, and additionally must set one member. This functionshould really be alloc-node + init-node + copy-common, where copy-common is like in the prototype: > + newnode->node_shift = oldnode->node_shift; > + newnode->node_chunk = oldnode->node_chunk; > + newnode->count = oldnode->count; > > And init-node should really be just memset + set kind + set initial fanout. It has no business touching "shift" and "chunk".The callers rt_new_root, rt_set_extend, and rt_extend set some values of their own anyway, so let them set those,too -- it might even improve readability. > > - if (n32->base.n.fanout == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout) > + if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL)) Agreed. > > This macro doesn't really improve readability -- it obscures what is being tested, and the name implies the "else" branchmeans "node doesn't need to grow class", which is false. If we want to simplify expressions in this block, I thinkit'd be more effective to improve the lines that follow: > > + memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size); > + new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout; > > Maybe we can have const variables old_size and new_fanout to break out the array lookup? While I'm thinking of it, thesearrays should be const so the compiler can avoid runtime lookups. Speaking of... > > +/* Copy both chunks and children/values arrays */ > +static inline void > +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children, > + uint8 *dst_chunks, rt_node **dst_children, int count) > +{ > + /* For better code generation */ > + if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout) > + pg_unreachable(); > + > + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count); > + memcpy(dst_children, src_children, sizeof(rt_node *) * count); > +} > > When I looked at this earlier, I somehow didn't go far enough -- why are we passing the runtime count in the first place?This function can only be called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last parameter to memcpyshould evaluate to a compile-time constant, right? Even when we add node shrinking in the future, the constant shouldbe correct, IIUC? Right. We don't need to pass count to these functions. > > - .fanout = 256, > + /* technically it's 256, but we can't store that in a uint8, > + and this is the max size class so it will never grow */ > + .fanout = 0, > > - Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256)); > + Assert(((rt_node *) n256)->fanout == 0); > + Assert(chunk_exists || ((rt_node *) n256)->count < 256); > > These hacks were my work, but I think we can improve that by having two versions of NODE_HAS_FREE_SLOT -- one for fixed-and one for variable-sized nodes. For that to work, in "init-node" we'd need a branch to set fanout to zero for node256.That should be fine -- it already has to branch for memset'ing node128's indexes to 0xFF. Since the node has fanout regardless of fixed-sized and variable-sized, only node256 is the special case where the fanout in the node doesn't match the actual fanout of the node. I think if we want to have two versions of NODE_HAS_FREE_SLOT, we can have one for node256 and one for other classes. Thoughts? In your idea, for NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the following? #define FIXED_NODDE_HAS_FREE_SLOT(node, class) (node->base.n.count < rt_size_class_info[class].fanout) Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Nov 25, 2022 at 6:47 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > [v11] > > There is one more thing that just now occurred to me: In expanding the use of size classes, that makes rebasing and reworkingthe shared memory piece more work than it should be. That's important because there are still some open questionsabout the design around shared memory. To keep unnecessary churn to a minimum, perhaps we should limit size classexpansion to just one (or 5 total size classes) for the near future? Make sense. We can add size classes once we have a good design and implementation around shared memory. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Nov 29, 2022 at 1:36 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > While creating a benchmark for inserting into node128-inner, I found a bug. If a caller deletes from a node128, the slotindex is set to invalid, but the child pointer is still valid. Do that a few times, and every child pointer is valid,even if no slot index points to it. When the next inserter comes along, something surprising happens. This function: > > /* Return an unused slot in node-128 */ > static int > node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk) > { > int slotpos = 0; > > Assert(!NODE_IS_LEAF(node)); > while (node_inner_128_is_slot_used(node, slotpos)) > slotpos++; > > return slotpos; > } > > ...passes an integer to this function, whose parameter is a uint8: > > /* Is the slot in the node used? */ > static inline bool > node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot) > { > Assert(!NODE_IS_LEAF(node)); > return (node->children[slot] != NULL); > } > > ...so instead of growing the node unnecessarily or segfaulting, it enters an infinite loop doing this: > > add eax, 1 > movzx ecx, al > cmp QWORD PTR [rbx+264+rcx*8], 0 > jne .L147 > > The fix is easy enough -- set the child pointer to null upon deletion, Good catch! > but I'm somewhat astonished that the regression tests didn't hit this. I do still intend to replace this code with somethingfaster, but before I do so the tests should probably exercise the deletion paths more. Since VACUUM Indeed, there are some tests for deletion but all of them delete all keys in the node so we end up deleting the node. I've added tests of repeating deletion and insertion as well as additional assertions. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Nov 23, 2022 at 2:10 AM Andres Freund <andres@anarazel.de> wrote: > > On 2022-11-21 17:06:56 +0900, Masahiko Sawada wrote: > > Sure. I've attached the v10 patches. 0004 is the pure refactoring > > patch and 0005 patch introduces the pointer tagging. > > This failed on cfbot, with som many crashes that the VM ran out of disk for > core dumps. During testing with 32bit, so there's probably something broken > around that. > > https://cirrus-ci.com/task/4635135954386944 > > A failure is e.g. at: https://api.cirrus-ci.com/v1/artifact/task/4635135954386944/testrun/build-32/testrun/adminpack/regress/log/initdb.log > > performing post-bootstrap initialization ... ../src/backend/lib/radixtree.c:1696:21: runtime error: member access withinmisaligned address 0x590faf74 for type 'struct radix_tree_control', which requires 8 byte alignment > 0x590faf74: note: pointer points here > 90 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > ^ radix_tree_control struct has two pg_atomic_uint64 variables, and the assertion check in pg_atomic_init_u64() failed: static inline void pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val) { /* * Can't necessarily enforce alignment - and don't need it - when using * the spinlock based fallback implementation. Therefore only assert when * not using it. */ #ifndef PG_HAVE_ATOMIC_U64_SIMULATION AssertPointerAlignment(ptr, 8); #endif pg_atomic_init_u64_impl(ptr, val); } I've investigated this issue and have a question about using atomic variables on palloc'ed memory. In non-parallel vacuum cases, radix_tree_control is allocated via aset.c. IIUC in 32-bit machines, the memory allocated by aset.c is 4-bytes aligned so these atomic variables are not always 8-bytes aligned. Is there any way to enforce 8-bytes aligned memory allocations in 32-bit machines? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Nov 30, 2022 at 11:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've investigated this issue and have a question about using atomic
> variables on palloc'ed memory. In non-parallel vacuum cases,
> radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
> the memory allocated by aset.c is 4-bytes aligned so these atomic
> variables are not always 8-bytes aligned. Is there any way to enforce
> 8-bytes aligned memory allocations in 32-bit machines?
The bigger question in my mind is: Why is there an atomic variable in backend-local memory?
On Wed, Nov 30, 2022 at 2:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Nov 25, 2022 at 5:00 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > These hacks were my work, but I think we can improve that by having two versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized nodes. For that to work, in "init-node" we'd need a branch to set fanout to zero for node256. That should be fine -- it already has to branch for memset'ing node128's indexes to 0xFF.
>
> Since the node has fanout regardless of fixed-sized and
> variable-sized
>
> Since the node has fanout regardless of fixed-sized and
> variable-sized
As currently coded, yes. But that's not strictly necessary, I think.
>, only node256 is the special case where the fanout in
> the node doesn't match the actual fanout of the node. I think if we
> want to have two versions of NODE_HAS_FREE_SLOT, we can have one for
> node256 and one for other classes. Thoughts? In your idea, for
> NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the
> following?
>
> #define FIXED_NODDE_HAS_FREE_SLOT(node, class)
> (node->base.n.count < rt_size_class_info[class].fanout)
Right, and the other one could be VAR_NODE_...
--
John Naylor
EDB: http://www.enterprisedb.com
> the node doesn't match the actual fanout of the node. I think if we
> want to have two versions of NODE_HAS_FREE_SLOT, we can have one for
> node256 and one for other classes. Thoughts? In your idea, for
> NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the
> following?
>
> #define FIXED_NODDE_HAS_FREE_SLOT(node, class)
> (node->base.n.count < rt_size_class_info[class].fanout)
Right, and the other one could be VAR_NODE_...
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Dec 1, 2022 at 4:00 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Wed, Nov 30, 2022 at 11:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I've investigated this issue and have a question about using atomic > > variables on palloc'ed memory. In non-parallel vacuum cases, > > radix_tree_control is allocated via aset.c. IIUC in 32-bit machines, > > the memory allocated by aset.c is 4-bytes aligned so these atomic > > variables are not always 8-bytes aligned. Is there any way to enforce > > 8-bytes aligned memory allocations in 32-bit machines? > > The bigger question in my mind is: Why is there an atomic variable in backend-local memory? Because I use the same radix_tree and radix_tree_control structs for non-parallel and parallel vacuum. Therefore, radix_tree_control is allocated in DSM for parallel-vacuum cases or in backend-local memory for non-parallel vacuum cases. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Dec 1, 2022 at 3:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Dec 1, 2022 at 4:00 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> > The bigger question in my mind is: Why is there an atomic variable in backend-local memory?
>
> Because I use the same radix_tree and radix_tree_control structs for
> non-parallel and parallel vacuum. Therefore, radix_tree_control is
> allocated in DSM for parallel-vacuum cases or in backend-local memory
> for non-parallel vacuum cases.
Ok, that could be yet another reason to compile local- and shared-memory functionality separately, but now I'm wondering why there are atomic variables at all, since there isn't yet any locking support.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Nov 30, 2022 at 2:51 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > There are a few things up in the air, so I'm coming back to this list to summarize and add a recent update: > > On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > - See how much performance we actually gain from tagging the node kind. > > Needs a benchmark that has enough branch mispredicts and L2/3 misses to show a benefit. Otherwise either neutral or worsein its current form, depending on compiler(?). Put off for later. > > > - Try additional size classes while keeping the node kinds to only four. > > This is relatively simple and effective. If only one additional size class (total 5) is coded as a placeholder, I imagineit will be easier to rebase shared memory logic than using this technique everywhere possible. > > > - Optimize node128 insert. > > I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over and operateon word-sized (32- or 64-bit) types at a time, rather than bytes. Thanks! I think this is a good idea. > To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That's probablygoing to need a separate email thread to resolve the coding style clash this causes, so that can be put off for later. Agreed. Since tidbitmap.c also has WORDNUM(x) and BITNUM(x), we can use it if we move from bitmapset.h. > This is not meant to be included in the next patchset. For demonstration purposes, I get these results with a functionthat repeatedly deletes the last value from a mostly-full node128 leaf and re-inserts it: > > select * from bench_node128_load(120); > > v11 > > NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0 > fanout | nkeys | rt_mem_allocated | rt_sparseload_ms > --------+-------+------------------+------------------ > 120 | 14400 | 208304 | 56 > > v11 + 0006 addendum > > NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0 > fanout | nkeys | rt_mem_allocated | rt_sparseload_ms > --------+-------+------------------+------------------ > 120 | 14400 | 208816 | 34 > > I didn't test inner nodes, but I imagine the difference is bigger. This bitmap style should also be used for the node256-leafisset array simply to be consistent and avoid needing single-use macros, but that has not been done yet. It won'tmake a difference for performance because there is no iteration there. After updating the patch set according to recent comments, I've also done the same test in my environment and got similar good results. w/o 0006 addendum patch NOTICE: num_keys = 14400, height = 1, n4 = 0, n15 = 0, n32 = 0, n125 = 121, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms --------+-------+------------------+------------------ 120 | 14400 | 204424 | 29 (1 row) w/ 0006 addendum patch NOTICE: num_keys = 14400, height = 1, n4 = 0, n15 = 0, n32 = 0, n125 = 121, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms --------+-------+------------------+------------------ 120 | 14400 | 204936 | 18 (1 row) > > - Try templating out the differences between local and shared memory. > > I hope to start this sometime after the crashes on 32-bit are resolved. I've attached updated patches that incorporated all comments I got so far as well as fixes for compiler warnings. I included your bitmapword patch as 0004 for benchmarking. Also I reverted the change around pg_atomic_u64 since we don't support any locking as you mentioned and if we have a single lwlock to protect the radix tree, we don't need to use pg_atomic_u64 only for max_val and num_keys. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v12-0007-PoC-lazy-vacuum-integration.patch
- v12-0005-Use-rt_node_ptr-to-reference-radix-tree-nodes.patch
- v12-0003-tool-for-measuring-radix-tree-performance.patch
- v12-0006-PoC-DSA-support-for-radix-tree.patch
- v12-0004-Use-bitmapword-for-node-125.patch
- v12-0002-Add-radix-implementation.patch
- v12-0001-introduce-vector8_min-and-vector8_highbit_mask.patch
On Fri, Dec 2, 2022 at 11:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > >
> > > - Optimize node128 insert.
> >
> > I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over and operate on word-sized (32- or 64-bit) types at a time, rather than bytes.
>
> Thanks! I think this is a good idea.
>
> > To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That's probably going to need a separate email thread to resolve the coding style clash this causes, so that can be put off for later.
I started a separate thread [1], and 0002 comes from feedback on that. There is a FIXME about using WORDNUM and BITNUM, at least with that spelling. I'm putting that off to ease rebasing the rest as v13 -- getting some CI testing with 0002 seems like a good idea. There are no other changes yet. Next, I will take a look at templating local vs. shared memory. I might try basing that on the styles of both v12 and v8, and see which one works best with templating.
[1] https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
--
John Naylor
EDB: http://www.enterprisedb.com
>
> > On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > >
> > > - Optimize node128 insert.
> >
> > I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over and operate on word-sized (32- or 64-bit) types at a time, rather than bytes.
>
> Thanks! I think this is a good idea.
>
> > To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That's probably going to need a separate email thread to resolve the coding style clash this causes, so that can be put off for later.
I started a separate thread [1], and 0002 comes from feedback on that. There is a FIXME about using WORDNUM and BITNUM, at least with that spelling. I'm putting that off to ease rebasing the rest as v13 -- getting some CI testing with 0002 seems like a good idea. There are no other changes yet. Next, I will take a look at templating local vs. shared memory. I might try basing that on the styles of both v12 and v8, and see which one works best with templating.
[1] https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
- v13-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v13-0004-Use-bitmapword-for-node-125.patch
- v13-0003-Add-radix-implementation.patch
- v13-0001-introduce-vector8_min-and-vector8_highbit_mask.patch
- v13-0005-tool-for-measuring-radix-tree-performance.patch
- v13-0008-PoC-lazy-vacuum-integration.patch
- v13-0007-PoC-DSA-support-for-radix-tree.patch
- v13-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patch
On Tue, Dec 6, 2022 at 7:32 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Fri, Dec 2, 2022 at 11:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > > > > > - Optimize node128 insert. > > > > > > I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over andoperate on word-sized (32- or 64-bit) types at a time, rather than bytes. > > > > Thanks! I think this is a good idea. > > > > > To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That'sprobably going to need a separate email thread to resolve the coding style clash this causes, so that can be put offfor later. > > I started a separate thread [1], and 0002 comes from feedback on that. There is a FIXME about using WORDNUM and BITNUM,at least with that spelling. I'm putting that off to ease rebasing the rest as v13 -- getting some CI testing with0002 seems like a good idea. There are no other changes yet. Next, I will take a look at templating local vs. sharedmemory. I might try basing that on the styles of both v12 and v8, and see which one works best with templating. Thank you so much! In the meanwhile, I've been working on vacuum integration. There are two things I'd like to discuss some time: The first is the minimum of maintenance_work_mem, 1 MB. Since the initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel vacuum with radix tree cannot work with the minimum maintenance_work_mem. It will need to increase it to 4MB or so. Maybe we can start a new thread for that. The second is how to limit the size of the radix tree to maintenance_work_mem. I think that it's tricky to estimate the maximum number of keys in the radix tree that fit in maintenance_work_mem. The radix tree size varies depending on the key distribution. The next idea I considered was how to limit the size when inserting a key. In order to strictly limit the radix tree size, probably we have to change the rt_set so that it breaks off and returns false if the radix tree size is about to exceed the memory limit when we allocate a new node or grow a node kind/class. Ideally, I'd like to control the size outside of radix tree (e.g. TIDStore) since it could introduce overhead to rt_set() but probably we need to add such logic in radix tree. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Dec 9, 2022 at 8:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> In the meanwhile, I've been working on vacuum integration. There are
> two things I'd like to discuss some time:
>
> The first is the minimum of maintenance_work_mem, 1 MB. Since the
> initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel
> vacuum with radix tree cannot work with the minimum
> maintenance_work_mem. It will need to increase it to 4MB or so. Maybe
> we can start a new thread for that.
I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail).
> The second is how to limit the size of the radix tree to
> maintenance_work_mem. I think that it's tricky to estimate the maximum
> number of keys in the radix tree that fit in maintenance_work_mem. The
> radix tree size varies depending on the key distribution. The next
> idea I considered was how to limit the size when inserting a key. In
> order to strictly limit the radix tree size, probably we have to
> change the rt_set so that it breaks off and returns false if the radix
> tree size is about to exceed the memory limit when we allocate a new
> node or grow a node kind/class.
That seems complex, fragile, and wrong scope.
> Ideally, I'd like to control the size
> outside of radix tree (e.g. TIDStore) since it could introduce
> overhead to rt_set() but probably we need to add such logic in radix
> tree.
Does the TIDStore have the ability to ask the DSA (or slab context) to see how big it is? If a new segment has been allocated that brings us to the limit, we can stop when we discover that fact. In the local case with slab blocks, it won't be on nice neat boundaries, but we could check if we're within the largest block size (~64kB) of overflow.
> outside of radix tree (e.g. TIDStore) since it could introduce
> overhead to rt_set() but probably we need to add such logic in radix
> tree.
Does the TIDStore have the ability to ask the DSA (or slab context) to see how big it is? If a new segment has been allocated that brings us to the limit, we can stop when we discover that fact. In the local case with slab blocks, it won't be on nice neat boundaries, but we could check if we're within the largest block size (~64kB) of overflow.
Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing.
--
John Naylor
EDB: http://www.enterprisedb.com
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Fri, Dec 9, 2022 at 8:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > In the meanwhile, I've been working on vacuum integration. There are > > two things I'd like to discuss some time: > > > > The first is the minimum of maintenance_work_mem, 1 MB. Since the > > initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel > > vacuum with radix tree cannot work with the minimum > > maintenance_work_mem. It will need to increase it to 4MB or so. Maybe > > we can start a new thread for that. > > I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail whatexactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some usefulwork and not fail). The minimum requirement is 2MB. In PoC patch, TIDStore checks how big the radix tree is using dsa_get_total_size(). If the size returned by dsa_get_total_size() (+ some memory used by TIDStore meta information) exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum and heap vacuum. However, when allocating DSA memory for radix_tree_control at creation, we allocate 1MB (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for radix_tree_control from it. das_get_total_size() returns 1MB even if there is no TID collected. > > > The second is how to limit the size of the radix tree to > > maintenance_work_mem. I think that it's tricky to estimate the maximum > > number of keys in the radix tree that fit in maintenance_work_mem. The > > radix tree size varies depending on the key distribution. The next > > idea I considered was how to limit the size when inserting a key. In > > order to strictly limit the radix tree size, probably we have to > > change the rt_set so that it breaks off and returns false if the radix > > tree size is about to exceed the memory limit when we allocate a new > > node or grow a node kind/class. > > That seems complex, fragile, and wrong scope. > > > Ideally, I'd like to control the size > > outside of radix tree (e.g. TIDStore) since it could introduce > > overhead to rt_set() but probably we need to add such logic in radix > > tree. > > Does the TIDStore have the ability to ask the DSA (or slab context) to see how big it is? Yes, TIDStore can check it using dsa_get_total_size(). > If a new segment has been allocated that brings us to the limit, we can stop when we discover that fact. In the local casewith slab blocks, it won't be on nice neat boundaries, but we could check if we're within the largest block size (~64kB)of overflow. > > Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytesto reduce contention on the tidstore. We could use such an array even for a single worker (always doing the samething is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insertinto the store, and check the store's memory usage before continuing. Right, I think it's no problem in slab cases. In DSA cases, the new segment size follows a geometric series that approximately doubles the total storage each time we create a new segment. This behavior comes from the fact that the underlying DSM system isn't designed for large numbers of segments. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> > I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail).
>
> The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
> the radix tree is using dsa_get_total_size(). If the size returned by
> dsa_get_total_size() (+ some memory used by TIDStore meta information)
> exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
> and heap vacuum. However, when allocating DSA memory for
> radix_tree_control at creation, we allocate 1MB
> (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
> radix_tree_control from it. das_get_total_size() returns 1MB even if
> there is no TID collected.
2MB makes sense.
If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be driven by blocks allocated. I have an idea on that below.
> > Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing.
>
> Right, I think it's no problem in slab cases. In DSA cases, the new
> segment size follows a geometric series that approximately doubles the
> total storage each time we create a new segment. This behavior comes
> from the fact that the underlying DSM system isn't designed for large
> numbers of segments.
And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area allocated is greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the area will contain just under a power of two the last time it passes the test. The next segment will bring it to about 3/4 full, like this:
maintenance work mem = 256MB, so stop if we go over 128MB:
2*(1+2+4+8+16+32) = 126MB -> keep going
126MB + 64 = 190MB -> stop
That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last segment would be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even with Peter G.'s VM snapshot informing the allocation size, I imagine).
And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with technically going over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would never go over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and small local arrays of TIDs, I think with this scheme we'll be well under the limit.
After this feature is complete, I think we should consider a follow-on patch to get rid of vacuum_work_mem, since it would no longer be needed.
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Dec 12, 2022 at 7:14 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > > > > I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detailwhat exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do someuseful work and not fail). > > > > The minimum requirement is 2MB. In PoC patch, TIDStore checks how big > > the radix tree is using dsa_get_total_size(). If the size returned by > > dsa_get_total_size() (+ some memory used by TIDStore meta information) > > exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum > > and heap vacuum. However, when allocating DSA memory for > > radix_tree_control at creation, we allocate 1MB > > (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for > > radix_tree_control from it. das_get_total_size() returns 1MB even if > > there is no TID collected. > > 2MB makes sense. > > If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be driven byblocks allocated. I have an idea on that below. > > > > Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytesto reduce contention on the tidstore. We could use such an array even for a single worker (always doing the samething is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insertinto the store, and check the store's memory usage before continuing. > > > > Right, I think it's no problem in slab cases. In DSA cases, the new > > segment size follows a geometric series that approximately doubles the > > total storage each time we create a new segment. This behavior comes > > from the fact that the underlying DSM system isn't designed for large > > numbers of segments. > > And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area allocatedis greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the areawill contain just under a power of two the last time it passes the test. The next segment will bring it to about 3/4full, like this: > > maintenance work mem = 256MB, so stop if we go over 128MB: > > 2*(1+2+4+8+16+32) = 126MB -> keep going > 126MB + 64 = 190MB -> stop > > That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last segment wouldbe mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even with PeterG.'s VM snapshot informing the allocation size, I imagine). Right. In this case, even if we allocate 64MB, we will use only 2088 bytes at maximum. So I think the memory space used for vacuum is practically limited to half. > > And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with technicallygoing over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would nevergo over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and smalllocal arrays of TIDs, I think with this scheme we'll be well under the limit. Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it seems that they look at only memory that are actually dsa_allocate'd. To be exact, we estimate the number of hash buckets based on work_mem (and hash_mem_multiplier) and use it as the upper limit. So I've confirmed that the result of dsa_get_total_size() could exceed the limit. I'm not sure it's a known and legitimate usage. If we can follow such usage, we can probably track how much dsa_allocate'd memory is used in the radix tree. Templating whether or not to count the memory usage might help avoid the overheads. > After this feature is complete, I think we should consider a follow-on patch to get rid of vacuum_work_mem, since it wouldno longer be needed. I think you meant autovacuum_work_mem. Yes, I also think we can get rid of it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Dec 12, 2022 at 7:14 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > > > On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > > > > > > > I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detailwhat exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do someuseful work and not fail). > > > > > > The minimum requirement is 2MB. In PoC patch, TIDStore checks how big > > > the radix tree is using dsa_get_total_size(). If the size returned by > > > dsa_get_total_size() (+ some memory used by TIDStore meta information) > > > exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum > > > and heap vacuum. However, when allocating DSA memory for > > > radix_tree_control at creation, we allocate 1MB > > > (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for > > > radix_tree_control from it. das_get_total_size() returns 1MB even if > > > there is no TID collected. > > > > 2MB makes sense. > > > > If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be drivenby blocks allocated. I have an idea on that below. > > > > > > Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytesto reduce contention on the tidstore. We could use such an array even for a single worker (always doing the samething is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insertinto the store, and check the store's memory usage before continuing. > > > > > > Right, I think it's no problem in slab cases. In DSA cases, the new > > > segment size follows a geometric series that approximately doubles the > > > total storage each time we create a new segment. This behavior comes > > > from the fact that the underlying DSM system isn't designed for large > > > numbers of segments. > > > > And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area allocatedis greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the areawill contain just under a power of two the last time it passes the test. The next segment will bring it to about 3/4full, like this: > > > > maintenance work mem = 256MB, so stop if we go over 128MB: > > > > 2*(1+2+4+8+16+32) = 126MB -> keep going > > 126MB + 64 = 190MB -> stop > > > > That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last segmentwould be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even withPeter G.'s VM snapshot informing the allocation size, I imagine). > > Right. In this case, even if we allocate 64MB, we will use only 2088 > bytes at maximum. So I think the memory space used for vacuum is > practically limited to half. > > > > > And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with technicallygoing over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would nevergo over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and smalllocal arrays of TIDs, I think with this scheme we'll be well under the limit. > > Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it > seems that they look at only memory that are actually dsa_allocate'd. > To be exact, we estimate the number of hash buckets based on work_mem > (and hash_mem_multiplier) and use it as the upper limit. So I've > confirmed that the result of dsa_get_total_size() could exceed the > limit. I'm not sure it's a known and legitimate usage. If we can > follow such usage, we can probably track how much dsa_allocate'd > memory is used in the radix tree. I've experimented with this idea. The newly added 0008 patch changes the radix tree so that it counts the memory usage for both local and shared cases. As shown below, there is an overhead for that: w/o 0008 patch =# select * from bench_load_random_int(1000000) NOTICE: num_keys = 1000000, height = 7, n4 = 4970924, n15 = 38277, n32 = 27205, n125 = 0, n256 = 257 mem_allocated | load_ms ---------------+--------- 298453544 | 282 (1 row) w/0 0008 patch =# select * from bench_load_random_int(1000000) NOTICE: num_keys = 1000000, height = 7, n4 = 4970924, n15 = 38277, n32 = 27205, n125 = 0, n256 = 257 mem_allocated | load_ms ---------------+--------- 293603184 | 297 (1 row) Although it adds some overhead, I think this idea is straightforward and the most practical for users. And it seems to be consistent with other components using DSA. We can improve this part in the future for better memory control, for example, by introducing slab-like DSA memory management. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v14-0005-tool-for-measuring-radix-tree-performance.patch
- v14-0008-PoC-calculate-memory-usage-in-radix-tree.patch
- v14-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patch
- v14-0009-PoC-lazy-vacuum-integration.patch
- v14-0007-PoC-DSA-support-for-radix-tree.patch
- v14-0004-Use-bitmapword-for-node-125.patch
- v14-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v14-0001-introduce-vector8_min-and-vector8_highbit_mask.patch
- v14-0003-Add-radix-implementation.patch
On Mon, Dec 19, 2022 at 4:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Dec 12, 2022 at 7:14 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > > > > On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > > > > > > > > > > I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in moredetail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just dosome useful work and not fail). > > > > > > > > The minimum requirement is 2MB. In PoC patch, TIDStore checks how big > > > > the radix tree is using dsa_get_total_size(). If the size returned by > > > > dsa_get_total_size() (+ some memory used by TIDStore meta information) > > > > exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum > > > > and heap vacuum. However, when allocating DSA memory for > > > > radix_tree_control at creation, we allocate 1MB > > > > (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for > > > > radix_tree_control from it. das_get_total_size() returns 1MB even if > > > > there is no TID collected. > > > > > > 2MB makes sense. > > > > > > If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be drivenby blocks allocated. I have an idea on that below. > > > > > > > > Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytesto reduce contention on the tidstore. We could use such an array even for a single worker (always doing the samething is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insertinto the store, and check the store's memory usage before continuing. > > > > > > > > Right, I think it's no problem in slab cases. In DSA cases, the new > > > > segment size follows a geometric series that approximately doubles the > > > > total storage each time we create a new segment. This behavior comes > > > > from the fact that the underlying DSM system isn't designed for large > > > > numbers of segments. > > > > > > And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area allocatedis greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the areawill contain just under a power of two the last time it passes the test. The next segment will bring it to about 3/4full, like this: > > > > > > maintenance work mem = 256MB, so stop if we go over 128MB: > > > > > > 2*(1+2+4+8+16+32) = 126MB -> keep going > > > 126MB + 64 = 190MB -> stop > > > > > > That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last segmentwould be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even withPeter G.'s VM snapshot informing the allocation size, I imagine). > > > > Right. In this case, even if we allocate 64MB, we will use only 2088 > > bytes at maximum. So I think the memory space used for vacuum is > > practically limited to half. > > > > > > > > And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with technicallygoing over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would nevergo over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and smalllocal arrays of TIDs, I think with this scheme we'll be well under the limit. > > > > Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it > > seems that they look at only memory that are actually dsa_allocate'd. > > To be exact, we estimate the number of hash buckets based on work_mem > > (and hash_mem_multiplier) and use it as the upper limit. So I've > > confirmed that the result of dsa_get_total_size() could exceed the > > limit. I'm not sure it's a known and legitimate usage. If we can > > follow such usage, we can probably track how much dsa_allocate'd > > memory is used in the radix tree. > > I've experimented with this idea. The newly added 0008 patch changes > the radix tree so that it counts the memory usage for both local and > shared cases. I've attached updated version patches to make cfbot happy. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v15-0008-PoC-calculate-memory-usage-in-radix-tree.patch
- v15-0005-tool-for-measuring-radix-tree-performance.patch
- v15-0007-PoC-DSA-support-for-radix-tree.patch
- v15-0009-PoC-lazy-vacuum-integration.patch
- v15-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patch
- v15-0004-Use-bitmapword-for-node-125.patch
- v15-0001-introduce-vector8_min-and-vector8_highbit_mask.patch
- v15-0003-Add-radix-implementation.patch
- v15-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
On Mon, Dec 19, 2022 at 2:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
> > seems that they look at only memory that are actually dsa_allocate'd.
> > To be exact, we estimate the number of hash buckets based on work_mem
> > (and hash_mem_multiplier) and use it as the upper limit. So I've
> > confirmed that the result of dsa_get_total_size() could exceed the
> > limit. I'm not sure it's a known and legitimate usage. If we can
> > follow such usage, we can probably track how much dsa_allocate'd
> > memory is used in the radix tree.
>
> I've experimented with this idea. The newly added 0008 patch changes
> the radix tree so that it counts the memory usage for both local and
> shared cases. As shown below, there is an overhead for that:
>
> w/o 0008 patch
> 298453544 | 282
> w/0 0008 patch
> 293603184 | 297
> > seems that they look at only memory that are actually dsa_allocate'd.
> > To be exact, we estimate the number of hash buckets based on work_mem
> > (and hash_mem_multiplier) and use it as the upper limit. So I've
> > confirmed that the result of dsa_get_total_size() could exceed the
> > limit. I'm not sure it's a known and legitimate usage. If we can
> > follow such usage, we can probably track how much dsa_allocate'd
> > memory is used in the radix tree.
>
> I've experimented with this idea. The newly added 0008 patch changes
> the radix tree so that it counts the memory usage for both local and
> shared cases. As shown below, there is an overhead for that:
>
> w/o 0008 patch
> 298453544 | 282
> w/0 0008 patch
> 293603184 | 297
This adds about as much overhead as the improvement I measured in the v4 slab allocator patch. That's not acceptable, and is exactly what Andres warned about in
I'm guessing the hash join case can afford to be precise about memory because it must spill to disk when exceeding workmem. We don't have that design constraint.
--
John Naylor
EDB: http://www.enterprisedb.com
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Dec 20, 2022 at 3:09 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Mon, Dec 19, 2022 at 2:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it > > > seems that they look at only memory that are actually dsa_allocate'd. > > > To be exact, we estimate the number of hash buckets based on work_mem > > > (and hash_mem_multiplier) and use it as the upper limit. So I've > > > confirmed that the result of dsa_get_total_size() could exceed the > > > limit. I'm not sure it's a known and legitimate usage. If we can > > > follow such usage, we can probably track how much dsa_allocate'd > > > memory is used in the radix tree. > > > > I've experimented with this idea. The newly added 0008 patch changes > > the radix tree so that it counts the memory usage for both local and > > shared cases. As shown below, there is an overhead for that: > > > > w/o 0008 patch > > 298453544 | 282 > > > w/0 0008 patch > > 293603184 | 297 > > This adds about as much overhead as the improvement I measured in the v4 slab allocator patch. Oh, yes, that's bad. > https://www.postgresql.org/message-id/20220704211822.kfxtzpcdmslzm2dy%40awork3.anarazel.de > > I'm guessing the hash join case can afford to be precise about memory because it must spill to disk when exceeding workmem.We don't have that design constraint. You mean that the memory used by the radix tree should be limited not by the amount of memory actually used, but by the amount of memory allocated? In other words, it checks by MomoryContextMemAllocated() in the local cases and by dsa_get_total_size() in the shared case. The idea of using up to half of maintenance_work_mem might be a good idea compared to the current flat-array solution. But since it only uses half, I'm concerned that there will be users who double their maintenace_work_mem. When it is improved, the user needs to restore maintenance_work_mem again. A better solution would be to have slab-like DSA. We allocate the dynamic shared memory by adding fixed-length large segments. However, downside would be since the segment size gets large we need to increase maintenance_work_mem as well. Also, this patch set is already getting bigger and more complicated, I don't think it's a good idea to add more. If we limit the memory usage by checking the amount of memory actually used, we can use SlabStats() for the local cases. Since DSA doesn't have such functionality for now we would need to add it. Or we can track it in the radix tree only in the shared cases. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Dec 21, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Dec 20, 2022 at 3:09 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > https://www.postgresql.org/message-id/20220704211822.kfxtzpcdmslzm2dy%40awork3.anarazel.de
> >
> > I'm guessing the hash join case can afford to be precise about memory because it must spill to disk when exceeding workmem. We don't have that design constraint.
>
> You mean that the memory used by the radix tree should be limited not
> by the amount of memory actually used, but by the amount of memory
> allocated? In other words, it checks by MomoryContextMemAllocated() in
> the local cases and by dsa_get_total_size() in the shared case.
I mean, if this patch set uses 10x less memory than v15 (not always, but easy to find cases where it does), and if it's also expensive to track memory use precisely, then we don't have an incentive to track memory precisely. Even if we did, we don't want to assume that every future caller of radix tree is willing to incur that cost.
> The idea of using up to half of maintenance_work_mem might be a good
> idea compared to the current flat-array solution. But since it only
> uses half, I'm concerned that there will be users who double their
> maintenace_work_mem. When it is improved, the user needs to restore
> maintenance_work_mem again.
I find it useful to step back and look at the usage patterns:
Autovacuum: Limiting the memory allocated by vacuum is important, since there are multiple workers and they can run at any time (possibly most of the time). This case will not use parallel index vacuum, so will use slab, where the quick estimation of memory taken by the context is not terribly far off, so we can afford to be more optimistic here.
Manual vacuum: The default configuration assumes we want to finish as soon as possible (vacuum_cost_delay is zero). Parallel index vacuum can be used. My experience leads me to believe users are willing to use a lot of memory to make manual vacuum finish as quickly as possible, and are disappointed to learn that even if maintenance work mem is 10GB, vacuum can only use 1GB.
So I don't believe anyone will have to double maintenance work mem after upgrading (even with pessimistic accounting) because we'll be both
- much more efficient with memory on average
- free from the 1GB cap
That said, it's possible 50% is too pessimistic -- a 75% threshold will bring us very close to powers of two for example:
2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> keep going
766 + 256 = 1022MB -> stop
I'm not sure if that calculation could cause going over the limit, or how common that would be.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Dec 22, 2022 at 7:24 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Wed, Dec 21, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Dec 20, 2022 at 3:09 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > https://www.postgresql.org/message-id/20220704211822.kfxtzpcdmslzm2dy%40awork3.anarazel.de > > > > > > I'm guessing the hash join case can afford to be precise about memory because it must spill to disk when exceedingworkmem. We don't have that design constraint. > > > > You mean that the memory used by the radix tree should be limited not > > by the amount of memory actually used, but by the amount of memory > > allocated? In other words, it checks by MomoryContextMemAllocated() in > > the local cases and by dsa_get_total_size() in the shared case. > > I mean, if this patch set uses 10x less memory than v15 (not always, but easy to find cases where it does), and if it'salso expensive to track memory use precisely, then we don't have an incentive to track memory precisely. Even if we did,we don't want to assume that every future caller of radix tree is willing to incur that cost. Understood. > > > The idea of using up to half of maintenance_work_mem might be a good > > idea compared to the current flat-array solution. But since it only > > uses half, I'm concerned that there will be users who double their > > maintenace_work_mem. When it is improved, the user needs to restore > > maintenance_work_mem again. > > I find it useful to step back and look at the usage patterns: > > Autovacuum: Limiting the memory allocated by vacuum is important, since there are multiple workers and they can run atany time (possibly most of the time). This case will not use parallel index vacuum, so will use slab, where the quick estimationof memory taken by the context is not terribly far off, so we can afford to be more optimistic here. > > Manual vacuum: The default configuration assumes we want to finish as soon as possible (vacuum_cost_delay is zero). Parallelindex vacuum can be used. My experience leads me to believe users are willing to use a lot of memory to make manualvacuum finish as quickly as possible, and are disappointed to learn that even if maintenance work mem is 10GB, vacuumcan only use 1GB. Agreed. > So I don't believe anyone will have to double maintenance work mem after upgrading (even with pessimistic accounting) becausewe'll be both > - much more efficient with memory on average > - free from the 1GB cap Make sense. > > That said, it's possible 50% is too pessimistic -- a 75% threshold will bring us very close to powers of two for example: > > 2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> keep going > 766 + 256 = 1022MB -> stop > > I'm not sure if that calculation could cause going over the limit, or how common that would be. > If the value is a power of 2, it seems to work perfectly fine. But for example if it's 700MB, the total memory exceeds the limit: 2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going 510 + 256 = 766MB -> stop but it exceeds the limit. In a more bigger case, if it's 11000MB, 2*(1+2+...+2048) = 8190MB (74.4%) 8190 + 4096 = 12286MB That being said, I don't think they are not common cases. So the 75% threshold seems to work fine in most cases. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Dec 22, 2022 at 10:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> If the value is a power of 2, it seems to work perfectly fine. But for
> example if it's 700MB, the total memory exceeds the limit:
>
> 2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going
> 510 + 256 = 766MB -> stop but it exceeds the limit.
>
> In a more bigger case, if it's 11000MB,
>
> 2*(1+2+...+2048) = 8190MB (74.4%)
> 8190 + 4096 = 12286MB
>
> That being said, I don't think they are not common cases. So the 75%
> threshold seems to work fine in most cases.
Thinking some more, I agree this doesn't have large practical risk, but thinking from the point of view of the community, being loose with memory limits by up to 10% is not a good precedent.
> example if it's 700MB, the total memory exceeds the limit:
>
> 2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going
> 510 + 256 = 766MB -> stop but it exceeds the limit.
>
> In a more bigger case, if it's 11000MB,
>
> 2*(1+2+...+2048) = 8190MB (74.4%)
> 8190 + 4096 = 12286MB
>
> That being said, I don't think they are not common cases. So the 75%
> threshold seems to work fine in most cases.
Thinking some more, I agree this doesn't have large practical risk, but thinking from the point of view of the community, being loose with memory limits by up to 10% is not a good precedent.
Perhaps we can be clever and use 75% when the limit is a power of two and 50% otherwise. I'm skeptical of trying to be clever, and I just thought of an additional concern: We're assuming behavior of the growth in size of new DSA segments, which could possibly change. Given how allocators are typically coded, though, it seems safe to assume that they'll at most double in size.
I wrote:
> - Try templating out the differences between local and shared memory.
Here is a brief progress report before Christmas vacation.
I thought the best way to approach this was to go "inside out", that is, start with the modest goal of reducing duplicated code for v16.
0001-0005 are copies from v13.
0006 whacks around the rt_node_insert_inner function to reduce the "surface area" as far as symbols and casts. This includes replacing the goto with an extra "unlikely" branch.
0007 removes the STRICT pragma for one of our benchmark functions that crept in somewhere -- it should use the default and not just return NULL instantly.
0008 further whacks around the node-growing code in rt_node_insert_inner to remove casts. When growing the size class within the same kind, we have no need for a "new32" (etc) variable. Also, to keep from getting confused about what an assert build verifies at the end, add a "newnode" variable and assign it to "node" as soon as possible.
0009 uses the bitmap logic from 0004 for node256 also. There is no performance reason for this, because there is no iteration needed, but it's good for simplicity and consistency.
0010 and 0011 template a common implementation for both leaf and inner nodes for searching and inserting.
0012: While at it, I couldn't resist using this technique to separate out delete from search, which makes sense and might give a small performance boost (at least on less capable hardware). I haven't got to the iteration functions, but they should be straightforward.
There is more that could be done here, but I didn't want to get too ahead of myself. For example, it's possible that struct members "children" and "values" are names that don't need to be distinguished. Making them the same would reduce code like
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = value;
+#else
+ n32->children[insertpos] = child;
+#endif
...but there could be downsides and I don't want to distract from the goal of dealing with shared memory.
The tests pass, but it's not impossible that there is a new bug somewhere.
> - Try templating out the differences between local and shared memory.
Here is a brief progress report before Christmas vacation.
I thought the best way to approach this was to go "inside out", that is, start with the modest goal of reducing duplicated code for v16.
0001-0005 are copies from v13.
0006 whacks around the rt_node_insert_inner function to reduce the "surface area" as far as symbols and casts. This includes replacing the goto with an extra "unlikely" branch.
0007 removes the STRICT pragma for one of our benchmark functions that crept in somewhere -- it should use the default and not just return NULL instantly.
0008 further whacks around the node-growing code in rt_node_insert_inner to remove casts. When growing the size class within the same kind, we have no need for a "new32" (etc) variable. Also, to keep from getting confused about what an assert build verifies at the end, add a "newnode" variable and assign it to "node" as soon as possible.
0009 uses the bitmap logic from 0004 for node256 also. There is no performance reason for this, because there is no iteration needed, but it's good for simplicity and consistency.
0010 and 0011 template a common implementation for both leaf and inner nodes for searching and inserting.
0012: While at it, I couldn't resist using this technique to separate out delete from search, which makes sense and might give a small performance boost (at least on less capable hardware). I haven't got to the iteration functions, but they should be straightforward.
There is more that could be done here, but I didn't want to get too ahead of myself. For example, it's possible that struct members "children" and "values" are names that don't need to be distinguished. Making them the same would reduce code like
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = value;
+#else
+ n32->children[insertpos] = child;
+#endif
...but there could be downsides and I don't want to distract from the goal of dealing with shared memory.
The tests pass, but it's not impossible that there is a new bug somewhere.
John Naylor
EDB: http://www.enterprisedb.com
Attachment
- v16-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v16-0001-introduce-vector8_min-and-vector8_highbit_mask.patch
- v16-0003-Add-radix-implementation.patch
- v16-0004-Use-bitmapword-for-node-125.patch
- v16-0005-tool-for-measuring-radix-tree-performance.patch
- v16-0009-Use-bitmap-operations-for-isset-arrays-rather-th.patch
- v16-0008-Use-newnode-variable-to-reduce-unnecessary-casti.patch
- v16-0010-Template-out-node-insert-functions.patch
- v16-0007-Remove-STRICT-from-bench_search_random_nodes.patch
- v16-0011-Template-out-node-search-functions.patch
- v16-0012-Separate-find-and-delete-actions-into-separate-f.patch
- v16-0006-Preparatory-refactoring-to-simplify-templating.patch
On Fri, Dec 23, 2022 at 8:47 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > I wrote: > > > - Try templating out the differences between local and shared memory. > > Here is a brief progress report before Christmas vacation. Thanks! > > I thought the best way to approach this was to go "inside out", that is, start with the modest goal of reducing duplicatedcode for v16. > > 0001-0005 are copies from v13. > > 0006 whacks around the rt_node_insert_inner function to reduce the "surface area" as far as symbols and casts. This includesreplacing the goto with an extra "unlikely" branch. > > 0007 removes the STRICT pragma for one of our benchmark functions that crept in somewhere -- it should use the defaultand not just return NULL instantly. > > 0008 further whacks around the node-growing code in rt_node_insert_inner to remove casts. When growing the size class withinthe same kind, we have no need for a "new32" (etc) variable. Also, to keep from getting confused about what an assertbuild verifies at the end, add a "newnode" variable and assign it to "node" as soon as possible. > > 0009 uses the bitmap logic from 0004 for node256 also. There is no performance reason for this, because there is no iterationneeded, but it's good for simplicity and consistency. These 4 patches make sense to me. We can merge them into 0002 patch and I'll do similar changes for functions for leaf nodes as well. > 0010 and 0011 template a common implementation for both leaf and inner nodes for searching and inserting. > > 0012: While at it, I couldn't resist using this technique to separate out delete from search, which makes sense and mightgive a small performance boost (at least on less capable hardware). I haven't got to the iteration functions, but theyshould be straightforward. Cool! > > There is more that could be done here, but I didn't want to get too ahead of myself. For example, it's possible that structmembers "children" and "values" are names that don't need to be distinguished. Making them the same would reduce codelike > > +#ifdef RT_NODE_LEVEL_LEAF > + n32->values[insertpos] = value; > +#else > + n32->children[insertpos] = child; > +#endif > > ...but there could be downsides and I don't want to distract from the goal of dealing with shared memory. With these patches, some functions in radixtree.h load the header files, radixtree_xxx_impl.h, that have the function body. What do you think about how we can expand this template method to deal with DSA memory? I imagined that we load say radixtree_template.h with some macros to use the radix tree like we do for simplehash.h. And radixtree_template.h further loads xxx_impl.h files for some internal functions. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Dec 27, 2022 at 12:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Dec 23, 2022 at 8:47 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> These 4 patches make sense to me.We can merge them into 0002 patch
Okay, then I'll squash them when I post my next patch.
> and I'll do similar changes for functions for leaf nodes as well.
I assume you meant something else? -- some of the differences between inner and leaf are already abstracted away.
In any case, some things are still half-baked, so please wait until my next patch before doing work on these files.
Also, CI found a bug on 32-bit -- I know what I missed and will fix next week.
> > 0010 and 0011 template a common implementation for both leaf and inner nodes for searching and inserting.
> >
> > 0012: While at it, I couldn't resist using this technique to separate out delete from search, which makes sense and might give a small performance boost (at least on less capable hardware). I haven't got to the iteration functions, but they should be straightforward.
Two things came to mind since I posted this, which I'll make clear next patch:
- A good compiler will get rid of branches when inlining, so maybe no difference in code generation, but it still looks nicer this way.
- Delete should really use its own template, because it only _accidentally_ looks like search because we don't yet shrink nodes.
> What do you
> think about how we can expand this template method to deal with DSA
> memory? I imagined that we load say radixtree_template.h with some
> macros to use the radix tree like we do for simplehash.h. And
> radixtree_template.h further loads xxx_impl.h files for some internal
> functions.
Right, I was thinking the same. I wanted to start small and look for opportunities to shrink the code footprint.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Fri, Dec 23, 2022 at 8:47 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> These 4 patches make sense to me.We can merge them into 0002 patch
Okay, then I'll squash them when I post my next patch.
> and I'll do similar changes for functions for leaf nodes as well.
I assume you meant something else? -- some of the differences between inner and leaf are already abstracted away.
In any case, some things are still half-baked, so please wait until my next patch before doing work on these files.
Also, CI found a bug on 32-bit -- I know what I missed and will fix next week.
> > 0010 and 0011 template a common implementation for both leaf and inner nodes for searching and inserting.
> >
> > 0012: While at it, I couldn't resist using this technique to separate out delete from search, which makes sense and might give a small performance boost (at least on less capable hardware). I haven't got to the iteration functions, but they should be straightforward.
Two things came to mind since I posted this, which I'll make clear next patch:
- A good compiler will get rid of branches when inlining, so maybe no difference in code generation, but it still looks nicer this way.
- Delete should really use its own template, because it only _accidentally_ looks like search because we don't yet shrink nodes.
> What do you
> think about how we can expand this template method to deal with DSA
> memory? I imagined that we load say radixtree_template.h with some
> macros to use the radix tree like we do for simplehash.h. And
> radixtree_template.h further loads xxx_impl.h files for some internal
> functions.
Right, I was thinking the same. I wanted to start small and look for opportunities to shrink the code footprint.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Dec 27, 2022 at 2:24 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Tue, Dec 27, 2022 at 12:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Dec 23, 2022 at 8:47 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > These 4 patches make sense to me.We can merge them into 0002 patch > > Okay, then I'll squash them when I post my next patch. > > > and I'll do similar changes for functions for leaf nodes as well. > > I assume you meant something else? -- some of the differences between inner and leaf are already abstracted away. Right. If we template these routines I don't need that. > > In any case, some things are still half-baked, so please wait until my next patch before doing work on these files. > > Also, CI found a bug on 32-bit -- I know what I missed and will fix next week. Thanks! > > > > 0010 and 0011 template a common implementation for both leaf and inner nodes for searching and inserting. > > > > > > 0012: While at it, I couldn't resist using this technique to separate out delete from search, which makes sense andmight give a small performance boost (at least on less capable hardware). I haven't got to the iteration functions, butthey should be straightforward. > > Two things came to mind since I posted this, which I'll make clear next patch: > - A good compiler will get rid of branches when inlining, so maybe no difference in code generation, but it still looksnicer this way. > - Delete should really use its own template, because it only _accidentally_ looks like search because we don't yet shrinknodes. Okay. > > > What do you > > think about how we can expand this template method to deal with DSA > > memory? I imagined that we load say radixtree_template.h with some > > macros to use the radix tree like we do for simplehash.h. And > > radixtree_template.h further loads xxx_impl.h files for some internal > > functions. > > Right, I was thinking the same. I wanted to start small and look for opportunities to shrink the code footprint. Thank you for your confirmation! Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
> [working on templating]
Part of what I didn't like about v8 was distinctions like "node" vs "nodep", which hinder readability. I've used "allocnode" for some cases where it makes sense, which is translated to "newnode" for the local pointer. Some places I just gave up and used "nodep" for parameters like in v8, just to get it done. We can revisit naming later.
Not done yet:
- get_handle() is not implemented
- rt_attach is defined but unused
- grow_node_kind() was hackishly removed, but could be turned into a macro (or function that writes to 2 pointers)
- node_update_inner() is back, now that we can share a template with "search". Seems easier to read, and I suspect this is easier for the compiler.
- the value type should really be a template macro, but is still hard-coded to uint64
- I think it's okay if the key is hard coded for PG16: If some use case needs more than uint64, we could consider "single-value leaves" with varlen keys as a template option.
- benchmark tests not updated
v13-0007 had some changes to the regression tests, but I haven't included those. The tests from v13-0003 do pass, both locally and shared. I quickly hacked together changing shared/local tests by hand (need to recompile), but it would be good for maintainability if tests could run once each with local and shmem, but use the same "expected" test output.
Also, I didn't look to see if there were any changes in v14/15 that didn't have to do with precise memory accounting.
At this point, Masahiko, I'd appreciate your feedback on whether this is an improvement at all (or at least a good base for improvement), especially for integrating with the TID store. I think there are some advantages to the template approach. One possible disadvantage is needing separate functions for each local and shared memory.
If we go this route, I do think the TID store should invoke the template as static functions. I'm not quite comfortable with a global function that may not fit well with future use cases.
One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type -- it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up only 32 bytes with a fanout of 3, by eliminating padding.
v13-0007 had some changes to the regression tests, but I haven't included those. The tests from v13-0003 do pass, both locally and shared. I quickly hacked together changing shared/local tests by hand (need to recompile), but it would be good for maintainability if tests could run once each with local and shmem, but use the same "expected" test output.
Also, I didn't look to see if there were any changes in v14/15 that didn't have to do with precise memory accounting.
At this point, Masahiko, I'd appreciate your feedback on whether this is an improvement at all (or at least a good base for improvement), especially for integrating with the TID store. I think there are some advantages to the template approach. One possible disadvantage is needing separate functions for each local and shared memory.
If we go this route, I do think the TID store should invoke the template as static functions. I'm not quite comfortable with a global function that may not fit well with future use cases.
One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type -- it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up only 32 bytes with a fanout of 3, by eliminating padding.
Also, v17-0005 has an optimization/simplification for growing into node125 (my version needs an assertion or fallback, but works well now), found by another reading of Andres' prototype There is a lot of good engineering there, we should try to preserve it.
Attachment
- v17-0005-Template-out-inner-and-leaf-nodes.patch
- v17-0009-Implement-shared-memory.patch
- v17-0008-Invent-specific-pointer-macros.patch
- v17-0007-Convert-radixtree.h-into-a-template.patch
- v17-0006-Convert-radixtree.c-into-a-header.patch
- v17-0001-introduce-vector8_min-and-vector8_highbit_mask.patch
- v17-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v17-0004-tool-for-measuring-radix-tree-performance.patch
- v17-0003-Add-radix-implementation.patch
On Mon, Jan 9, 2023 at 5:59 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > [working on templating] > > In the end, I decided to base my effort on v8, and not v12 (based on one of my less-well-thought-out ideas). The latterwas a good experiment, but it did not lead to an increase in readability as I had hoped. The attached v17 is stillrough, but it's in good enough shape to evaluate a mostly-complete templating implementation. I really appreciate your work! > > v13-0007 had some changes to the regression tests, but I haven't included those. The tests from v13-0003 do pass, bothlocally and shared. I quickly hacked together changing shared/local tests by hand (need to recompile), but it would begood for maintainability if tests could run once each with local and shmem, but use the same "expected" test output. Agreed. > Also, I didn't look to see if there were any changes in v14/15 that didn't have to do with precise memory accounting. > > At this point, Masahiko, I'd appreciate your feedback on whether this is an improvement at all (or at least a good basefor improvement), especially for integrating with the TID store. I think there are some advantages to the template approach.One possible disadvantage is needing separate functions for each local and shared memory. > > If we go this route, I do think the TID store should invoke the template as static functions. I'm not quite comfortablewith a global function that may not fit well with future use cases. It looks no problem in terms of vacuum integration, although I've not fully tested yet. TID store uses the radix tree as the main storage, and with the template radix tree, the data types for shared and non-shared will be different. TID store can have an union for the radix tree and the structure would be like follows: /* Per-backend state for a TidStore */ struct TidStore { /* * Control object. This is allocated in DSA area 'area' in the shared * case, otherwise in backend-local memory. */ TidStoreControl *control; /* Storage for Tids */ union tree { local_radix_tree *local; shared_radix_tree *shared; }; /* DSA area for TidStore if used */ dsa_area *area; }; In the functions of TID store, we need to call either local or shared radix tree functions depending on whether TID store is shared or not. We need if-branch for each key-value pair insertion, but I think it would not be a big performance problem in TID store use cases, since vacuum is an I/O intensive operation in many cases. Overall, I think there is no problem and I'll investigate it in depth. Apart from that, I've been considering the lock support for shared radix tree. As we discussed before, the current usage (i.e, only parallel index vacuum) doesn't require locking support at all, so it would be enough to have a single lock for simplicity. If we want to use the shared radix tree for other use cases such as the parallel heap vacuum or the replacement of the hash table for shared buffers, we would need better lock support. For example, if we want to support Optimistic Lock Coupling[1], we would need to change not only the node structure but also the logic. Which probably leads to widen the gap between the code for non-shared and shared radix tree. In this case, once we have a better radix tree optimized for shared case, perhaps we can replace the templated shared radix tree with it. I'd like to hear your opinion on this line. > > One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type -- it'sonly set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up only32 bytes with a fanout of 3, by eliminating padding. Oh, I didn't notice that. The chunk field was originally used when redirecting the child pointer in the parent node from old to new (grown) node. When redirecting the pointer, since the corresponding chunk surely exists on the parent we can skip existence checks. Currently we use RT_NODE_UPDATE_INNER() for that (see RT_REPLACE_NODE()) but having a dedicated function to update the existing chunk and child pointer might improve the performance. Or reducing the node size by getting rid of the chunk field might be better. > Also, v17-0005 has an optimization/simplification for growing into node125 (my version needs an assertion or fallback,but works well now), found by another reading of Andres' prototype There is a lot of good engineering there, weshould try to preserve it. Agreed. Regards, [1] https://db.in.tum.de/~leis/papers/artsync.pdf -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> It looks no problem in terms of vacuum integration, although I've not
> fully tested yet. TID store uses the radix tree as the main storage,
> and with the template radix tree, the data types for shared and
> non-shared will be different. TID store can have an union for the
> radix tree and the structure would be like follows:
> /* Storage for Tids */
> union tree
> {
> local_radix_tree *local;
> shared_radix_tree *shared;
> };
We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We would have to be more careful of things like the 32-bit crash from a few weeks ago.
> In the functions of TID store, we need to call either local or shared
> radix tree functions depending on whether TID store is shared or not.
> We need if-branch for each key-value pair insertion, but I think it
> would not be a big performance problem in TID store use cases, since
> vacuum is an I/O intensive operation in many cases.
Also, the branch will be easily predicted. That was still true in earlier patches, but with many more branches and fatter code paths.
> Overall, I think
> there is no problem and I'll investigate it in depth.
Okay, great. If the separate-functions approach turns out to be ugly, we can always go back to the branching approach for shared memory. I think we'll want to keep this as a template overall, at least to allow different value types and to ease adding variable-length keys if someone finds a need.
> Apart from that, I've been considering the lock support for shared
> radix tree. As we discussed before, the current usage (i.e, only
> parallel index vacuum) doesn't require locking support at all, so it
> would be enough to have a single lock for simplicity.
Right, that should be enough for PG16.
> If we want to
> use the shared radix tree for other use cases such as the parallel
> heap vacuum or the replacement of the hash table for shared buffers,
> we would need better lock support.
For future parallel pruning, I still think a global lock is "probably" fine if the workers buffer in local arrays. Highly concurrent applications will need additional work, of course.
> For example, if we want to support
> Optimistic Lock Coupling[1],
> It looks no problem in terms of vacuum integration, although I've not
> fully tested yet. TID store uses the radix tree as the main storage,
> and with the template radix tree, the data types for shared and
> non-shared will be different. TID store can have an union for the
> radix tree and the structure would be like follows:
> /* Storage for Tids */
> union tree
> {
> local_radix_tree *local;
> shared_radix_tree *shared;
> };
We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We would have to be more careful of things like the 32-bit crash from a few weeks ago.
> In the functions of TID store, we need to call either local or shared
> radix tree functions depending on whether TID store is shared or not.
> We need if-branch for each key-value pair insertion, but I think it
> would not be a big performance problem in TID store use cases, since
> vacuum is an I/O intensive operation in many cases.
Also, the branch will be easily predicted. That was still true in earlier patches, but with many more branches and fatter code paths.
> Overall, I think
> there is no problem and I'll investigate it in depth.
Okay, great. If the separate-functions approach turns out to be ugly, we can always go back to the branching approach for shared memory. I think we'll want to keep this as a template overall, at least to allow different value types and to ease adding variable-length keys if someone finds a need.
> Apart from that, I've been considering the lock support for shared
> radix tree. As we discussed before, the current usage (i.e, only
> parallel index vacuum) doesn't require locking support at all, so it
> would be enough to have a single lock for simplicity.
Right, that should be enough for PG16.
> If we want to
> use the shared radix tree for other use cases such as the parallel
> heap vacuum or the replacement of the hash table for shared buffers,
> we would need better lock support.
For future parallel pruning, I still think a global lock is "probably" fine if the workers buffer in local arrays. Highly concurrent applications will need additional work, of course.
> For example, if we want to support
> Optimistic Lock Coupling[1],
Interesting, from the same authors!
> we would need to change not only the node
> structure but also the logic. Which probably leads to widen the gap
> between the code for non-shared and shared radix tree. In this case,
> once we have a better radix tree optimized for shared case, perhaps we
> can replace the templated shared radix tree with it. I'd like to hear
> your opinion on this line.
I'm not in a position to speculate on how best to do scalable concurrency, much less how it should coexist with the local implementation. It's interesting that their "ROWEX" scheme gives up maintaining order in the linear nodes.
> > One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type -- it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up only 32 bytes with a fanout of 3, by eliminating padding.
>
> Oh, I didn't notice that. The chunk field was originally used when
> redirecting the child pointer in the parent node from old to new
> (grown) node. When redirecting the pointer, since the corresponding
> chunk surely exists on the parent we can skip existence checks.
> Currently we use RT_NODE_UPDATE_INNER() for that (see
> RT_REPLACE_NODE()) but having a dedicated function to update the
> existing chunk and child pointer might improve the performance. Or
> reducing the node size by getting rid of the chunk field might be
> better.
I see. IIUC from a brief re-reading of the code, saving that chunk would only save us from re-loading "parent->shift" from L1 cache and shifting the key. The cycles spent doing that seem small compared to the rest of the work involved in growing a node. Expressions like "if (idx < 0) return false;" return to an asserts-only variable, so in production builds, I would hope that branch gets elided (I haven't checked).
I'm quite keen on making the smallest node padding-free, (since we don't yet have path compression or lazy path expansion), and this seems the way to get there.
--
John Naylor
EDB: http://www.enterprisedb.com
> structure but also the logic. Which probably leads to widen the gap
> between the code for non-shared and shared radix tree. In this case,
> once we have a better radix tree optimized for shared case, perhaps we
> can replace the templated shared radix tree with it. I'd like to hear
> your opinion on this line.
I'm not in a position to speculate on how best to do scalable concurrency, much less how it should coexist with the local implementation. It's interesting that their "ROWEX" scheme gives up maintaining order in the linear nodes.
> > One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type -- it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up only 32 bytes with a fanout of 3, by eliminating padding.
>
> Oh, I didn't notice that. The chunk field was originally used when
> redirecting the child pointer in the parent node from old to new
> (grown) node. When redirecting the pointer, since the corresponding
> chunk surely exists on the parent we can skip existence checks.
> Currently we use RT_NODE_UPDATE_INNER() for that (see
> RT_REPLACE_NODE()) but having a dedicated function to update the
> existing chunk and child pointer might improve the performance. Or
> reducing the node size by getting rid of the chunk field might be
> better.
I see. IIUC from a brief re-reading of the code, saving that chunk would only save us from re-loading "parent->shift" from L1 cache and shifting the key. The cycles spent doing that seem small compared to the rest of the work involved in growing a node. Expressions like "if (idx < 0) return false;" return to an asserts-only variable, so in production builds, I would hope that branch gets elided (I haven't checked).
I'm quite keen on making the smallest node padding-free, (since we don't yet have path compression or lazy path expansion), and this seems the way to get there.
--
John Naylor
EDB: http://www.enterprisedb.com
I wrote:
> I see. IIUC from a brief re-reading of the code, saving that chunk would only save us from re-loading "parent->shift" from L1 cache and shifting the key. The cycles spent doing that seem small compared to the rest of the work involved in growing a node. Expressions like "if (idx < 0) return false;" return to an asserts-only variable, so in production builds, I would hope that branch gets elided (I haven't checked).
On further reflection, this is completely false and I'm not sure what I was thinking. However, for the update-inner case maybe we can assert that we found a valid slot.
--
John Naylor
EDB: http://www.enterprisedb.com
On further reflection, this is completely false and I'm not sure what I was thinking. However, for the update-inner case maybe we can assert that we found a valid slot.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Jan 11, 2023 at 12:13 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > It looks no problem in terms of vacuum integration, although I've not > > fully tested yet. TID store uses the radix tree as the main storage, > > and with the template radix tree, the data types for shared and > > non-shared will be different. TID store can have an union for the > > radix tree and the structure would be like follows: > > > /* Storage for Tids */ > > union tree > > { > > local_radix_tree *local; > > shared_radix_tree *shared; > > }; > > We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We wouldhave to be more careful of things like the 32-bit crash from a few weeks ago. One idea to have a common data type without unused fields is to use radix_tree a base class. We cast it to radix_tree_shared or radix_tree_local depending on the flag is_shared in radix_tree. For instance we have like (based on non-template version), struct radix_tree { bool is_shared; MemoryContext context; }; typedef struct rt_shared { rt_handle handle; uint32 magic; /* Root node */ dsa_pointer root; uint64 max_val; uint64 num_keys; /* need a lwlock */ /* statistics */ #ifdef RT_DEBUG int32 cnt[RT_SIZE_CLASS_COUNT]; #endif } rt_shared; struct radix_tree_shared { radix_tree rt; rt_shared *shared; dsa_area *area; } radix_tree_shared; struct radix_tree_local { radix_tree rt; uint64 max_val; uint64 num_keys; rt_node *root; /* used only when the radix tree is private */ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT]; MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT]; /* statistics */ #ifdef RT_DEBUG int32 cnt[RT_SIZE_CLASS_COUNT]; #endif } radix_tree_local; > > > In the functions of TID store, we need to call either local or shared > > radix tree functions depending on whether TID store is shared or not. > > We need if-branch for each key-value pair insertion, but I think it > > would not be a big performance problem in TID store use cases, since > > vacuum is an I/O intensive operation in many cases. > > Also, the branch will be easily predicted. That was still true in earlier patches, but with many more branches and fattercode paths. > > > Overall, I think > > there is no problem and I'll investigate it in depth. > > Okay, great. If the separate-functions approach turns out to be ugly, we can always go back to the branching approach forshared memory. I think we'll want to keep this as a template overall, at least to allow different value types and to easeadding variable-length keys if someone finds a need. I agree to keep this as a template. From the vacuum integration perspective, it would be better if we can use a common data type for shared and local. It makes sense to have different data types if the radix trees have different values types. > > > Apart from that, I've been considering the lock support for shared > > radix tree. As we discussed before, the current usage (i.e, only > > parallel index vacuum) doesn't require locking support at all, so it > > would be enough to have a single lock for simplicity. > > Right, that should be enough for PG16. > > > If we want to > > use the shared radix tree for other use cases such as the parallel > > heap vacuum or the replacement of the hash table for shared buffers, > > we would need better lock support. > > For future parallel pruning, I still think a global lock is "probably" fine if the workers buffer in local arrays. Highlyconcurrent applications will need additional work, of course. > > > For example, if we want to support > > Optimistic Lock Coupling[1], > > Interesting, from the same authors! +1 > > > we would need to change not only the node > > structure but also the logic. Which probably leads to widen the gap > > between the code for non-shared and shared radix tree. In this case, > > once we have a better radix tree optimized for shared case, perhaps we > > can replace the templated shared radix tree with it. I'd like to hear > > your opinion on this line. > > I'm not in a position to speculate on how best to do scalable concurrency, much less how it should coexist with the localimplementation. It's interesting that their "ROWEX" scheme gives up maintaining order in the linear nodes. > > > > One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type --it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take uponly 32 bytes with a fanout of 3, by eliminating padding. > > > > Oh, I didn't notice that. The chunk field was originally used when > > redirecting the child pointer in the parent node from old to new > > (grown) node. When redirecting the pointer, since the corresponding > > chunk surely exists on the parent we can skip existence checks. > > Currently we use RT_NODE_UPDATE_INNER() for that (see > > RT_REPLACE_NODE()) but having a dedicated function to update the > > existing chunk and child pointer might improve the performance. Or > > reducing the node size by getting rid of the chunk field might be > > better. > > I see. IIUC from a brief re-reading of the code, saving that chunk would only save us from re-loading "parent->shift" fromL1 cache and shifting the key. The cycles spent doing that seem small compared to the rest of the work involved in growinga node. Expressions like "if (idx < 0) return false;" return to an asserts-only variable, so in production builds,I would hope that branch gets elided (I haven't checked). > > I'm quite keen on making the smallest node padding-free, (since we don't yet have path compression or lazy path expansion),and this seems the way to get there. Okay, let's get rid of that in the v18. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Jan 12, 2023 at 12:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Jan 11, 2023 at 12:13 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I agree to keep this as a template.
Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same so that your work on vacuum integration can be easily rebased on top of that, and we can work independently.
> From the vacuum integration
> perspective, it would be better if we can use a common data type for
> shared and local. It makes sense to have different data types if the
> radix trees have different values types.
I agree it would be better, all else being equal. I have some further thoughts below.
> > > It looks no problem in terms of vacuum integration, although I've not
> > > fully tested yet. TID store uses the radix tree as the main storage,
> > > and with the template radix tree, the data types for shared and
> > > non-shared will be different. TID store can have an union for the
> > > radix tree and the structure would be like follows:
> >
> > > /* Storage for Tids */
> > > union tree
> > > {
> > > local_radix_tree *local;
> > > shared_radix_tree *shared;
> > > };
> >
> > We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We would have to be more careful of things like the 32-bit crash from a few weeks ago.
>
> One idea to have a common data type without unused fields is to use
> radix_tree a base class. We cast it to radix_tree_shared or
> radix_tree_local depending on the flag is_shared in radix_tree. For
> instance we have like (based on non-template version),
> struct radix_tree
> {
> bool is_shared;
> MemoryContext context;
> };
That could work in principle. My first impression is, just a memory context is not much of a base class. Also, casts can creep into a large number of places.
Another thought came to mind: I'm guessing the TID store is unusual -- meaning most uses of radix tree will only need one kind of memory (local/shared). I could be wrong about that, and it _is_ a guess about the future. If true, then it makes more sense that only code that needs both memory kinds should be responsible for keeping them separate.
The template might be easier for future use cases if shared memory were all-or-nothing, meaning either
- completely different functions and types depending on RT_SHMEM
- use branches (like v8)
The union sounds like a good thing to try, but do whatever seems right.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Wed, Jan 11, 2023 at 12:13 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I agree to keep this as a template.
Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same so that your work on vacuum integration can be easily rebased on top of that, and we can work independently.
> From the vacuum integration
> perspective, it would be better if we can use a common data type for
> shared and local. It makes sense to have different data types if the
> radix trees have different values types.
I agree it would be better, all else being equal. I have some further thoughts below.
> > > It looks no problem in terms of vacuum integration, although I've not
> > > fully tested yet. TID store uses the radix tree as the main storage,
> > > and with the template radix tree, the data types for shared and
> > > non-shared will be different. TID store can have an union for the
> > > radix tree and the structure would be like follows:
> >
> > > /* Storage for Tids */
> > > union tree
> > > {
> > > local_radix_tree *local;
> > > shared_radix_tree *shared;
> > > };
> >
> > We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We would have to be more careful of things like the 32-bit crash from a few weeks ago.
>
> One idea to have a common data type without unused fields is to use
> radix_tree a base class. We cast it to radix_tree_shared or
> radix_tree_local depending on the flag is_shared in radix_tree. For
> instance we have like (based on non-template version),
> struct radix_tree
> {
> bool is_shared;
> MemoryContext context;
> };
That could work in principle. My first impression is, just a memory context is not much of a base class. Also, casts can creep into a large number of places.
Another thought came to mind: I'm guessing the TID store is unusual -- meaning most uses of radix tree will only need one kind of memory (local/shared). I could be wrong about that, and it _is_ a guess about the future. If true, then it makes more sense that only code that needs both memory kinds should be responsible for keeping them separate.
The template might be easier for future use cases if shared memory were all-or-nothing, meaning either
- completely different functions and types depending on RT_SHMEM
- use branches (like v8)
The union sounds like a good thing to try, but do whatever seems right.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Jan 12, 2023 at 5:21 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Thu, Jan 12, 2023 at 12:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Jan 11, 2023 at 12:13 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I agree to keep this as a template. > > Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same so thatyour work on vacuum integration can be easily rebased on top of that, and we can work independently. Thanks! > > > From the vacuum integration > > perspective, it would be better if we can use a common data type for > > shared and local. It makes sense to have different data types if the > > radix trees have different values types. > > I agree it would be better, all else being equal. I have some further thoughts below. > > > > > It looks no problem in terms of vacuum integration, although I've not > > > > fully tested yet. TID store uses the radix tree as the main storage, > > > > and with the template radix tree, the data types for shared and > > > > non-shared will be different. TID store can have an union for the > > > > radix tree and the structure would be like follows: > > > > > > > /* Storage for Tids */ > > > > union tree > > > > { > > > > local_radix_tree *local; > > > > shared_radix_tree *shared; > > > > }; > > > > > > We could possibly go back to using a common data type for this, but with unused fields in each setting, as before.We would have to be more careful of things like the 32-bit crash from a few weeks ago. > > > > One idea to have a common data type without unused fields is to use > > radix_tree a base class. We cast it to radix_tree_shared or > > radix_tree_local depending on the flag is_shared in radix_tree. For > > instance we have like (based on non-template version), > > > struct radix_tree > > { > > bool is_shared; > > MemoryContext context; > > }; > > That could work in principle. My first impression is, just a memory context is not much of a base class. Also, casts cancreep into a large number of places. > > Another thought came to mind: I'm guessing the TID store is unusual -- meaning most uses of radix tree will only need onekind of memory (local/shared). I could be wrong about that, and it _is_ a guess about the future. If true, then it makesmore sense that only code that needs both memory kinds should be responsible for keeping them separate. True. > > The template might be easier for future use cases if shared memory were all-or-nothing, meaning either > > - completely different functions and types depending on RT_SHMEM > - use branches (like v8) > > The union sounds like a good thing to try, but do whatever seems right. I've implemented the idea of using union. Let me share WIP code for discussion, I've attached three patches that can be applied on top of v17-0009 patch. v17-0010 implements missing shared memory support functions such as RT_DETACH and RT_GET_HANDLE, and some fixes. v17-0011 patch adds TidStore, and v17-0012 patch is the vacuum integration. Overall, TidStore implementation with the union idea doesn't look so ugly to me. But I got many compiler warning about unused radix tree functions like: tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used [-Wunused-function] I'm not sure there is a convenient way to suppress this warning but one idea is to have some macros to specify what operations are enabled/declared. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Dec 23, 2022 at 4:33 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > On Thu, Dec 22, 2022 at 10:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > If the value is a power of 2, it seems to work perfectly fine. But for > > example if it's 700MB, the total memory exceeds the limit: > > > > 2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going > > 510 + 256 = 766MB -> stop but it exceeds the limit. > > > > In a more bigger case, if it's 11000MB, > > > > 2*(1+2+...+2048) = 8190MB (74.4%) > > 8190 + 4096 = 12286MB > > > > That being said, I don't think they are not common cases. So the 75% > > threshold seems to work fine in most cases. > > Thinking some more, I agree this doesn't have large practical risk, but thinking from the point of view of the community,being loose with memory limits by up to 10% is not a good precedent. Agreed. > Perhaps we can be clever and use 75% when the limit is a power of two and 50% otherwise. I'm skeptical of trying to beclever, and I just thought of an additional concern: We're assuming behavior of the growth in size of new DSA segments,which could possibly change. Given how allocators are typically coded, though, it seems safe to assume that they'llat most double in size. Sounds good to me. I've written a simple script to simulate the DSA memory usage and the limit. The 75% limit works fine for a power of two cases, and we can use the 60% limit for other cases (it seems we can use up to about 66% but used 60% for safety). It would be best if we can mathematically prove it but I could prove only the power of two cases. But the script practically shows the 60% threshold would work for these cases. Regards -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Jan 12, 2023 at 9:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jan 12, 2023 at 5:21 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same so that your work on vacuum integration can be easily rebased on top of that, and we can work independently.
There were some conflicts with HEAD, so to keep the CF bot busy, I've quickly put together v18. I still have a lot of cleanup work to do, but this is enough for now.
0003 contains all v17 local-memory coding squashed together.
0004 perf test not updated but it doesn't build by default so it's fine for now
0005 removes node.chunk as discussed, but does not change node4 fanout yet.
0006 is a small cleanup regarding setting node fanout.
0007 squashes my shared memory work with Masahiko's fixes from the addendum v17-0010.
0008 turns the existence checks in RT_NODE_UPDATE_INNER into Asserts, as discussed.
0009/0010 are just copies of Masauiko's v17 addendum v17-0011/12, but the latter rebased over recent variable renaming (it's possible I missed something, so worth checking).
> I've implemented the idea of using union. Let me share WIP code for
> discussion, I've attached three patches that can be applied on top of
Seems fine as far as the union goes. Let's go ahead with this, and make progress on locking etc.
> Overall, TidStore implementation with the union idea doesn't look so
> ugly to me. But I got many compiler warning about unused radix tree
> functions like:
>
> tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used
> [-Wunused-function]
>
> I'm not sure there is a convenient way to suppress this warning but
> one idea is to have some macros to specify what operations are
> enabled/declared.
That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The vacuum case cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to blocknumber and value to Bitmapset, the number of keys might not be interesting. It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an implementation didn't care about iteration or its order, we could optimize insertion into linear nodes)
>
> On Thu, Jan 12, 2023 at 5:21 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same so that your work on vacuum integration can be easily rebased on top of that, and we can work independently.
There were some conflicts with HEAD, so to keep the CF bot busy, I've quickly put together v18. I still have a lot of cleanup work to do, but this is enough for now.
0003 contains all v17 local-memory coding squashed together.
0004 perf test not updated but it doesn't build by default so it's fine for now
0005 removes node.chunk as discussed, but does not change node4 fanout yet.
0006 is a small cleanup regarding setting node fanout.
0007 squashes my shared memory work with Masahiko's fixes from the addendum v17-0010.
0008 turns the existence checks in RT_NODE_UPDATE_INNER into Asserts, as discussed.
0009/0010 are just copies of Masauiko's v17 addendum v17-0011/12, but the latter rebased over recent variable renaming (it's possible I missed something, so worth checking).
> I've implemented the idea of using union. Let me share WIP code for
> discussion, I've attached three patches that can be applied on top of
Seems fine as far as the union goes. Let's go ahead with this, and make progress on locking etc.
> Overall, TidStore implementation with the union idea doesn't look so
> ugly to me. But I got many compiler warning about unused radix tree
> functions like:
>
> tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used
> [-Wunused-function]
>
> I'm not sure there is a convenient way to suppress this warning but
> one idea is to have some macros to specify what operations are
> enabled/declared.
That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The vacuum case cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to blocknumber and value to Bitmapset, the number of keys might not be interesting. It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an implementation didn't care about iteration or its order, we could optimize insertion into linear nodes)
Since this is WIP, you may already have some polish in mind, so I won't go over the patches in detail, but I wanted to ask about a few things (numbers referring to v17 addendum, not v18):
0011
+ * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+ * bytes a TidStore can use. These two fields are commonly used in both
+ * non-shared case and shared case.
+ */
+ uint32 num_tids;
uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better.
+ * We calculate the maximum bytes for the TidStore in different ways
+ * for non-shared case and shared case. Please refer to the comment
+ * TIDSTORE_MEMORY_DEDUCT for details.
+ */
Maybe the #define and comment should be close to here.
+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain.
+ * Add Tids on a block to TidStore. The caller must ensure the offset numbers
+ * in 'offsets' are ordered in ascending order.
Must? What happens otherwise?
+ uint64 last_key = PG_UINT64_MAX;
I'm having some difficulty understanding this sentinel and how it's used.
@@ -1039,11 +1040,18 @@ lazy_scan_heap(LVRelState *vacrel)
if (prunestate.has_lpdead_items)
{
Size freespace;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ result = tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+ buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ tidstore_end_iterate(iter);
/* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);
This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because lazy_scan_prune() populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune() could pass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm missing some detail, so some description of the design choices made would be helpful.
On Mon, Jan 16, 2023 at 9:53 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I've written a simple script to simulate the DSA memory usage and the
> limit. The 75% limit works fine for a power of two cases, and we can
> use the 60% limit for other cases (it seems we can use up to about 66%
> but used 60% for safety). It would be best if we can mathematically
> prove it but I could prove only the power of two cases. But the script
> practically shows the 60% threshold would work for these cases.
Okay. It's worth highlighting this in the comments, and also the fact that it depends on internal details of how DSA increases segment size.
Attachment
- v18-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v18-0001-introduce-vector8_min-and-vector8_highbit_mask.patch
- v18-0005-Remove-chunk-from-the-common-node-type.patch
- v18-0004-tool-for-measuring-radix-tree-performance.patch
- v18-0003-Add-radixtree-template.patch
- v18-0006-Clarify-coding-around-fanout.patch
- v18-0009-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
- v18-0008-Turn-branch-into-Assert-in-RT_NODE_UPDATE_INNER.patch
- v18-0007-Implement-shared-memory.patch
- v18-0010-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch
On Mon, Jan 16, 2023 at 2:02 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Thu, Jan 12, 2023 at 9:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Jan 12, 2023 at 5:21 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same sothat your work on vacuum integration can be easily rebased on top of that, and we can work independently. > > There were some conflicts with HEAD, so to keep the CF bot busy, I've quickly put together v18. I still have a lot of cleanupwork to do, but this is enough for now. Thanks! cfbot complaints about some warnings but these are expected (due to unused delete routines etc). But one reported error[1] might be relevant with 0002 patch? [05:44:11.759] "link" /MACHINE:x64 /OUT:src/test/modules/test_radixtree/test_radixtree.dll src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.res src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj "/nologo" "/release" "/nologo" "/DEBUG" "/PDB:src/test\modules\test_radixtree\test_radixtree.pdb" "/DLL" "/IMPLIB:src/test\modules\test_radixtree\test_radixtree.lib" "/INCREMENTAL:NO" "/STACK:4194304" "/NOEXP" "/DEBUG:FASTLINK" "/NOIMPLIB" "C:/cirrus/build/src/backend/postgres.exe.lib" "wldap32.lib" "c:/openssl/1.1/lib/libssl.lib" "c:/openssl/1.1/lib/libcrypto.lib" "ws2_32.lib" "kernel32.lib" "user32.lib" "gdi32.lib" "winspool.lib" "shell32.lib" "ole32.lib" "oleaut32.lib" "uuid.lib" "comdlg32.lib" "advapi32.lib" [05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved external symbol pg_popcount64 [05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll : fatal error LNK1120: 1 unresolved externals > 0003 contains all v17 local-memory coding squashed together. + * XXX: Most functions in this file have two variants for inner nodes and leaf + * nodes, therefore there are duplication codes. While this sometimes makes the + * code maintenance tricky, this reduces branch prediction misses when judging + * whether the node is a inner node of a leaf node. This comment seems to be out-of-date since we made it a template. --- +#ifndef RT_COMMON +#define RT_COMMON What are we using this macro RT_COMMON for? --- The following macros are defined but not undefined in radixtree.h: RT_MAKE_PREFIX RT_MAKE_NAME RT_MAKE_NAME_ RT_SEARCH UINT64_FORMAT_HEX RT_NODE_SPAN RT_NODE_MAX_SLOTS RT_CHUNK_MASK RT_MAX_SHIFT RT_MAX_LEVEL RT_NODE_125_INVALID_IDX RT_GET_KEY_CHUNK BM_IDX BM_BIT RT_NODE_KIND_4 RT_NODE_KIND_32 RT_NODE_KIND_125 RT_NODE_KIND_256 RT_NODE_KIND_COUNT RT_PTR_LOCAL RT_PTR_ALLOC RT_INVALID_PTR_ALLOC NODE_SLAB_BLOCK_SIZE > 0004 perf test not updated but it doesn't build by default so it's fine for now Okay. > 0005 removes node.chunk as discussed, but does not change node4 fanout yet. LGTM. > 0006 is a small cleanup regarding setting node fanout. LGTM. > 0007 squashes my shared memory work with Masahiko's fixes from the addendum v17-0010. + /* XXX: do we need to set a callback on exit to detach dsa? */ In the current shared radix tree design, it's a caller responsible that they create (or attach to) a DSA area and pass it to RT_CREATE() or RT_ATTACH(). It enables us to use one DSA not only for the radix tree but also other data. Which is more flexible. So the caller needs to detach from the DSA somehow, so I think we don't need to set a callback here for that. --- + dsa_free(tree->dsa, tree->ctl->handle); // XXX + //dsa_detach(tree->dsa); Similar to above, I think we should not detach from the DSA area here. Given that the DSA area used by the radix tree could be used also by other data, I think that in RT_FREE() we need to free each radix tree node allocated in DSA. In lazy vacuum, we check the memory usage instead of the number of TIDs and need to reset the TidScan after an index scan. So it does RT_FREE() and dsa_trim() to return DSM segments to the OS. I've implemented rt_free_recurse() for this purpose in the v15 version patch. -- - Assert(tree->root); + //Assert(tree->ctl->root); I think we don't need this assertion in the first place. We check it at the beginning of the function. --- +#ifdef RT_NODE_LEVEL_LEAF + Assert(NODE_IS_LEAF(node)); +#else + Assert(!NODE_IS_LEAF(node)); +#endif + I think we can move this change to 0003 patch. > 0008 turns the existence checks in RT_NODE_UPDATE_INNER into Asserts, as discussed. LGTM. > > 0009/0010 are just copies of Masauiko's v17 addendum v17-0011/12, but the latter rebased over recent variable renaming(it's possible I missed something, so worth checking). > > > I've implemented the idea of using union. Let me share WIP code for > > discussion, I've attached three patches that can be applied on top of > > Seems fine as far as the union goes. Let's go ahead with this, and make progress on locking etc. +1 > > > Overall, TidStore implementation with the union idea doesn't look so > > ugly to me. But I got many compiler warning about unused radix tree > > functions like: > > > > tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used > > [-Wunused-function] > > > > I'm not sure there is a convenient way to suppress this warning but > > one idea is to have some macros to specify what operations are > > enabled/declared. > > That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller is capableof keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The vacuumcase cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to blocknumberand value to Bitmapset, the number of keys might not be interesting. Right. In fact, TIdStore doesn't use RT_NUM_ENTRIES. > It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an implementationdidn't care about iteration or its order, we could optimize insertion into linear nodes) Agreed. > > Since this is WIP, you may already have some polish in mind, so I won't go over the patches in detail, but I wanted toask about a few things (numbers referring to v17 addendum, not v18): > > 0011 > > + * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum > + * bytes a TidStore can use. These two fields are commonly used in both > + * non-shared case and shared case. > + */ > + uint32 num_tids; > > uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better. Agreed, will fix. > > + * We calculate the maximum bytes for the TidStore in different ways > + * for non-shared case and shared case. Please refer to the comment > + * TIDSTORE_MEMORY_DEDUCT for details. > + */ > > Maybe the #define and comment should be close to here. Will fix. > > + * Destroy a TidStore, returning all memory. The caller must be certain that > + * no other backend will attempt to access the TidStore before calling this > + * function. Other backend must explicitly call tidstore_detach to free up > + * backend-local memory associated with the TidStore. The backend that calls > + * tidstore_destroy must not call tidstore_detach. > + */ > +void > +tidstore_destroy(TidStore *ts) > > If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain. Will fix. > > + * Add Tids on a block to TidStore. The caller must ensure the offset numbers > + * in 'offsets' are ordered in ascending order. > > Must? What happens otherwise? It ends up missing TIDs by overwriting the same key with different values. Is it better to have a bool argument, say need_sort, to sort the given array if the caller wants? > > + uint64 last_key = PG_UINT64_MAX; > > I'm having some difficulty understanding this sentinel and how it's used. Will improve the logic. > > @@ -1039,11 +1040,18 @@ lazy_scan_heap(LVRelState *vacrel) > if (prunestate.has_lpdead_items) > { > Size freespace; > + TidStoreIter *iter; > + TidStoreIterResult *result; > > - lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer); > + iter = tidstore_begin_iterate(vacrel->dead_items); > + result = tidstore_iterate_next(iter); > + lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets, > + buf, &vmbuffer); > + Assert(!tidstore_iterate_next(iter)); > + tidstore_end_iterate(iter); > > /* Forget the LP_DEAD items that we just vacuumed */ > - dead_items->num_items = 0; > + tidstore_reset(dead_items); > > This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because lazy_scan_prune()populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune() couldpass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as needed.It's quite possible I'm missing some detail, so some description of the design choices made would be helpful. I agree that we don't need complexity here. I'll try this idea. > > On Mon, Jan 16, 2023 at 9:53 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I've written a simple script to simulate the DSA memory usage and the > > limit. The 75% limit works fine for a power of two cases, and we can > > use the 60% limit for other cases (it seems we can use up to about 66% > > but used 60% for safety). It would be best if we can mathematically > > prove it but I could prove only the power of two cases. But the script > > practically shows the 60% threshold would work for these cases. > > Okay. It's worth highlighting this in the comments, and also the fact that it depends on internal details of how DSA increasessegment size. Agreed. Since it seems you're working on another cleanup, I can address the above comments after your work is completed. But I'm also fine with including them into your cleanup work. Regards, [1] https://cirrus-ci.com/task/5078505327689728 -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> Thanks! cfbot complaints about some warnings but these are expected
> (due to unused delete routines etc). But one reported error[1] might
> be relevant with 0002 patch?
> [05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
> external symbol pg_popcount64
> [05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
> fatal error LNK1120: 1 unresolved externals
Yeah, I'm not sure what's causing that. Since that comes from a debugging function, we could work around it, but it would be nice to understand why, so I'll probably have to experiment on my CI repo.
> ---
> +#ifndef RT_COMMON
> +#define RT_COMMON
>
> What are we using this macro RT_COMMON for?
It was a quick way to define some things only once, so they probably all showed up in the list of things you found not undefined. It's different from the style of simplehash.h, which is to have a local name and #undef for every single thing. simplehash.h is a precedent, so I'll change it to match. I'll take a look at your list, too.
> > + * Add Tids on a block to TidStore. The caller must ensure the offset numbers
> > + * in 'offsets' are ordered in ascending order.
> >
> > Must? What happens otherwise?
>
> It ends up missing TIDs by overwriting the same key with different
> values. Is it better to have a bool argument, say need_sort, to sort
> the given array if the caller wants?
> Since it seems you're working on another cleanup, I can address the
> above comments after your work is completed. But I'm also fine with
> including them into your cleanup work.
I think we can work mostly simultaneously, if you work on tid store and vacuum, and I work on the template. We can always submit a full patchset including each other's latest work. That will catch rebase issues sooner.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> Thanks! cfbot complaints about some warnings but these are expected
> (due to unused delete routines etc). But one reported error[1] might
> be relevant with 0002 patch?
> [05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
> external symbol pg_popcount64
> [05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
> fatal error LNK1120: 1 unresolved externals
Yeah, I'm not sure what's causing that. Since that comes from a debugging function, we could work around it, but it would be nice to understand why, so I'll probably have to experiment on my CI repo.
> ---
> +#ifndef RT_COMMON
> +#define RT_COMMON
>
> What are we using this macro RT_COMMON for?
It was a quick way to define some things only once, so they probably all showed up in the list of things you found not undefined. It's different from the style of simplehash.h, which is to have a local name and #undef for every single thing. simplehash.h is a precedent, so I'll change it to match. I'll take a look at your list, too.
> > + * Add Tids on a block to TidStore. The caller must ensure the offset numbers
> > + * in 'offsets' are ordered in ascending order.
> >
> > Must? What happens otherwise?
>
> It ends up missing TIDs by overwriting the same key with different
> values. Is it better to have a bool argument, say need_sort, to sort
> the given array if the caller wants?
> Since it seems you're working on another cleanup, I can address the
> above comments after your work is completed. But I'm also fine with
> including them into your cleanup work.
I think we can work mostly simultaneously, if you work on tid store and vacuum, and I work on the template. We can always submit a full patchset including each other's latest work. That will catch rebase issues sooner.
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
Attached is an update that mostly has the modest goal of getting CI green again. v19-0003 has squashed the entire radix tree template from previously. I've kept out the perf test module for now -- still needs updating.
> > [05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
> > external symbol pg_popcount64
> > [05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
> > fatal error LNK1120: 1 unresolved externals
>
> Yeah, I'm not sure what's causing that. Since that comes from a debugging function, we could work around it, but it would be nice to understand why, so I'll probably have to experiment on my CI repo.
I'm still confused by this error, because it only occurs in the test module. I successfully built with just 0002 in CI so elsewhere where bmw_* symbols resolve just fine on all platforms. I've worked around the error in v19-0004 by using the general-purpose pg_popcount() function. We only need to count bits in assert builds, so it doesn't matter a whole lot.
> + /* XXX: do we need to set a callback on exit to detach dsa? */
>
> In the current shared radix tree design, it's a caller responsible
> that they create (or attach to) a DSA area and pass it to RT_CREATE()
> or RT_ATTACH(). It enables us to use one DSA not only for the radix
> tree but also other data. Which is more flexible. So the caller needs
> to detach from the DSA somehow, so I think we don't need to set a
> callback here for that.
>
> ---
> + dsa_free(tree->dsa, tree->ctl->handle); // XXX
> + //dsa_detach(tree->dsa);
>
> Similar to above, I think we should not detach from the DSA area here.
>
> Given that the DSA area used by the radix tree could be used also by
> other data, I think that in RT_FREE() we need to free each radix tree
> node allocated in DSA. In lazy vacuum, we check the memory usage
> instead of the number of TIDs and need to reset the TidScan after an
> index scan. So it does RT_FREE() and dsa_trim() to return DSM segments
> to the OS. I've implemented rt_free_recurse() for this purpose in the
> v15 version patch.
>
> --
> - Assert(tree->root);
> + //Assert(tree->ctl->root);
>
> I think we don't need this assertion in the first place. We check it
> at the beginning of the function.
I've removed these in v19-0006.
> > That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The vacuum case cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to blocknumber and value to Bitmapset, the number of keys might not be interesting.
>
> Right. In fact, TIdStore doesn't use RT_NUM_ENTRIES.
I've moved it to the test module, which uses it extensively. There, the name is more clear what it's for, so I didn't change the name.
> > It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an implementation didn't care about iteration or its order, we could optimize insertion into linear nodes)
>
> Agreed.
Done in v19-0007.
v19-0009 is just a rebase over some more vacuum cleanups.
I'll continue working on internals cleanup.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
Attached is an update that mostly has the modest goal of getting CI green again. v19-0003 has squashed the entire radix tree template from previously. I've kept out the perf test module for now -- still needs updating.
> > [05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
> > external symbol pg_popcount64
> > [05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
> > fatal error LNK1120: 1 unresolved externals
>
> Yeah, I'm not sure what's causing that. Since that comes from a debugging function, we could work around it, but it would be nice to understand why, so I'll probably have to experiment on my CI repo.
I'm still confused by this error, because it only occurs in the test module. I successfully built with just 0002 in CI so elsewhere where bmw_* symbols resolve just fine on all platforms. I've worked around the error in v19-0004 by using the general-purpose pg_popcount() function. We only need to count bits in assert builds, so it doesn't matter a whole lot.
> + /* XXX: do we need to set a callback on exit to detach dsa? */
>
> In the current shared radix tree design, it's a caller responsible
> that they create (or attach to) a DSA area and pass it to RT_CREATE()
> or RT_ATTACH(). It enables us to use one DSA not only for the radix
> tree but also other data. Which is more flexible. So the caller needs
> to detach from the DSA somehow, so I think we don't need to set a
> callback here for that.
>
> ---
> + dsa_free(tree->dsa, tree->ctl->handle); // XXX
> + //dsa_detach(tree->dsa);
>
> Similar to above, I think we should not detach from the DSA area here.
>
> Given that the DSA area used by the radix tree could be used also by
> other data, I think that in RT_FREE() we need to free each radix tree
> node allocated in DSA. In lazy vacuum, we check the memory usage
> instead of the number of TIDs and need to reset the TidScan after an
> index scan. So it does RT_FREE() and dsa_trim() to return DSM segments
> to the OS. I've implemented rt_free_recurse() for this purpose in the
> v15 version patch.
>
> --
> - Assert(tree->root);
> + //Assert(tree->ctl->root);
>
> I think we don't need this assertion in the first place. We check it
> at the beginning of the function.
I've removed these in v19-0006.
> > That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The vacuum case cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to blocknumber and value to Bitmapset, the number of keys might not be interesting.
>
> Right. In fact, TIdStore doesn't use RT_NUM_ENTRIES.
I've moved it to the test module, which uses it extensively. There, the name is more clear what it's for, so I didn't change the name.
> > It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an implementation didn't care about iteration or its order, we could optimize insertion into linear nodes)
>
> Agreed.
Done in v19-0007.
v19-0009 is just a rebase over some more vacuum cleanups.
I'll continue working on internals cleanup.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
- v19-0001-introduce-vector8_min-and-vector8_highbit_mask.patch
- v19-0005-Remove-RT_NUM_ENTRIES.patch
- v19-0004-Workaround-link-errors-on-Windows-CI.patch
- v19-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v19-0003-Add-radixtree-template.patch
- v19-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch
- v19-0006-Shared-memory-cleanups.patch
- v19-0007-Make-RT_DELETE-optional.patch
- v19-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > + * Add Tids on a block to TidStore. The caller must ensure the offset numbers
> > + * in 'offsets' are ordered in ascending order.
> >
> > Must? What happens otherwise?
>
> It ends up missing TIDs by overwriting the same key with different
> values. Is it better to have a bool argument, say need_sort, to sort
> the given array if the caller wants?
> > + * in 'offsets' are ordered in ascending order.
> >
> > Must? What happens otherwise?
>
> It ends up missing TIDs by overwriting the same key with different
> values. Is it better to have a bool argument, say need_sort, to sort
> the given array if the caller wants?
Now that I've studied it some more, I see what's happening: We need all bits set in the "value" before we insert it, since it would be too expensive to retrieve the current value, add one bit, and put it back. Also, as a consequence of the encoding, part of the tid is in the key, and part in the value. It makes more sense now, but it needs more than zero comments.
As for the order, I don't think it's the responsibility of the caller to guess if it needs sorting -- if unordered offsets lead to data loss, this function needs to take care of it.
> > + uint64 last_key = PG_UINT64_MAX;
> >
> > I'm having some difficulty understanding this sentinel and how it's used.
>
> Will improve the logic.
Part of the problem is the English language: "last" can mean "previous" or "at the end", so maybe some name changes would help.
--
John Naylor
EDB: http://www.enterprisedb.com
> >
> > I'm having some difficulty understanding this sentinel and how it's used.
>
> Will improve the logic.
Part of the problem is the English language: "last" can mean "previous" or "at the end", so maybe some name changes would help.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Jan 17, 2023 at 8:06 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Jan 16, 2023 at 2:02 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > Attached is an update that mostly has the modest goal of getting CI green again. v19-0003 has squashed the entire radixtree template from previously. I've kept out the perf test module for now -- still needs updating. > > > > [05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved > > > external symbol pg_popcount64 > > > [05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll : > > > fatal error LNK1120: 1 unresolved externals > > > > Yeah, I'm not sure what's causing that. Since that comes from a debugging function, we could work around it, but it wouldbe nice to understand why, so I'll probably have to experiment on my CI repo. > > I'm still confused by this error, because it only occurs in the test module. I successfully built with just 0002 in CIso elsewhere where bmw_* symbols resolve just fine on all platforms. I've worked around the error in v19-0004 by usingthe general-purpose pg_popcount() function. We only need to count bits in assert builds, so it doesn't matter a wholelot. I spent today investigating this issue, I found out that on Windows, libpgport_src.a is not linked when building codes outside of src/backend unless explicitly linking it. It's not a problem on Linux etc. but the linker raises a fatal error on Windows. I'm not sure the right way to fix it but the attached patch resolved the issue on cfbot. Since it seems not to be related to 0002 patch but maybe the designed behavior or a problem in meson. We can discuss it on a separate thread. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Jan 17, 2023 at 8:06 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Jan 16, 2023 at 2:02 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > Attached is an update that mostly has the modest goal of getting CI green again. v19-0003 has squashed the entire radixtree template from previously. I've kept out the perf test module for now -- still needs updating. > > > > [05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved > > > external symbol pg_popcount64 > > > [05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll : > > > fatal error LNK1120: 1 unresolved externals > > > > Yeah, I'm not sure what's causing that. Since that comes from a debugging function, we could work around it, but it wouldbe nice to understand why, so I'll probably have to experiment on my CI repo. > > I'm still confused by this error, because it only occurs in the test module. I successfully built with just 0002 in CIso elsewhere where bmw_* symbols resolve just fine on all platforms. I've worked around the error in v19-0004 by usingthe general-purpose pg_popcount() function. We only need to count bits in assert builds, so it doesn't matter a wholelot. > > > + /* XXX: do we need to set a callback on exit to detach dsa? */ > > > > In the current shared radix tree design, it's a caller responsible > > that they create (or attach to) a DSA area and pass it to RT_CREATE() > > or RT_ATTACH(). It enables us to use one DSA not only for the radix > > tree but also other data. Which is more flexible. So the caller needs > > to detach from the DSA somehow, so I think we don't need to set a > > callback here for that. > > > > --- > > + dsa_free(tree->dsa, tree->ctl->handle); // XXX > > + //dsa_detach(tree->dsa); > > > > Similar to above, I think we should not detach from the DSA area here. > > > > Given that the DSA area used by the radix tree could be used also by > > other data, I think that in RT_FREE() we need to free each radix tree > > node allocated in DSA. In lazy vacuum, we check the memory usage > > instead of the number of TIDs and need to reset the TidScan after an > > index scan. So it does RT_FREE() and dsa_trim() to return DSM segments > > to the OS. I've implemented rt_free_recurse() for this purpose in the > > v15 version patch. > > > > -- > > - Assert(tree->root); > > + //Assert(tree->ctl->root); > > > > I think we don't need this assertion in the first place. We check it > > at the beginning of the function. > > I've removed these in v19-0006. > > > > That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller iscapable of keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The vacuumcase cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to blocknumberand value to Bitmapset, the number of keys might not be interesting. > > > > Right. In fact, TIdStore doesn't use RT_NUM_ENTRIES. > > I've moved it to the test module, which uses it extensively. There, the name is more clear what it's for, so I didn't changethe name. > > > > It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an implementationdidn't care about iteration or its order, we could optimize insertion into linear nodes) > > > > Agreed. > > Done in v19-0007. > > v19-0009 is just a rebase over some more vacuum cleanups. Thank you for updating the patches! I've attached new version patches. There is no change from v19 patch for 0001 through 0006. And 0004, 0005 and 0006 patches look good to me. We can merge them into 0003 patch. 0007 patch fixes functions that are defined when RT_DEBUG. These functions might be removed before the commit but this is useful at least under development. 0008 patch fixes a bug in RT_CHUNK_VALUES_ARRAY_SHIFT() and adds tests for that. 0009 patch fixes the cfbot issue by linking pgport_srv. 0010 patch adds RT_FREE_RECURSE() to free all radix tree nodes allocated in DSA. 0011 patch updates copyright etc. 0012 and 0013 patches are updated patches that incorporate all comments I got so far. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v20-0013-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch
- v20-0010-Free-all-radix-tree-node-recursively.patch
- v20-0011-Update-Copyright-and-Identification.patch
- v20-0009-add-link-to-pgport_srv-in-test_radixtree.patch
- v20-0012-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
- v20-0005-Shared-memory-cleanups.patch
- v20-0006-Make-RT_DELETE-optional.patch
- v20-0008-Fix-bug-in-RT_CHUNK_VALUES_ARRAY_SHIFT.patch
- v20-0007-Fix-RT_DEBUG-functions.patch
- v20-0004-Remove-RT_NUM_ENTRIES.patch
- v20-0003-Add-radixtree-template.patch
- v20-0001-introduce-vector8_min-and-vector8_highbit_mask.patch
- v20-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one exception: v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I believe one of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with that approach? I don't recall where that discussion went.
> + * XXX: Most functions in this file have two variants for inner nodes and leaf
> + * nodes, therefore there are duplication codes. While this sometimes makes the
> + * code maintenance tricky, this reduces branch prediction misses when judging
> + * whether the node is a inner node of a leaf node.
>
> This comment seems to be out-of-date since we made it a template.
Done in 0020, along with a bunch of other comment editing.
> The following macros are defined but not undefined in radixtree.h:
Fixed in v21-0018.
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one exception: v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I believe one of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with that approach? I don't recall where that discussion went.
> + * XXX: Most functions in this file have two variants for inner nodes and leaf
> + * nodes, therefore there are duplication codes. While this sometimes makes the
> + * code maintenance tricky, this reduces branch prediction misses when judging
> + * whether the node is a inner node of a leaf node.
>
> This comment seems to be out-of-date since we made it a template.
Done in 0020, along with a bunch of other comment editing.
Fixed in v21-0018.
Also:
0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest is agnostic.
0010 turns node4 into node3, as discussed, going from 48 bytes to 32.
0012 adopts the benchmark module to the template, and adds meson support (builds with warnings, but okay because not meant for commit).
The rest are cleanups, small refactorings, and more comment rewrites. I've kept them separate for visibility. Next patch can squash them unless there is any discussion.
> > uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better.
>
> Agreed, will fix.
Great, but it's now uint64, not int64. All the large counters in struct LVRelState, for example, are signed integers, as the usual practice. Unsigned ints are "usually" for things like bit patterns and where explicit wraparound is desired. There's probably more that can be done here to change to signed types, but I think it's still a bit early to get to that level of nitpicking. (Soon, I hope :-) )
> > + * We calculate the maximum bytes for the TidStore in different ways
> > + * for non-shared case and shared case. Please refer to the comment
> > + * TIDSTORE_MEMORY_DEDUCT for details.
> > + */
> >
> > Maybe the #define and comment should be close to here.
>
> Will fix.
For this, I intended that "here" meant "in or just above the function".
+#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT (1024L * 70) /* 70kB */
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2 (float) 0.75
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO (float) 0.6
These symbols are used only once, in tidstore_create(), and are difficult to read. That function has few comments. The symbols have several paragraphs, but they are far away. It might be better for readability to just hard-code numbers in the function, with the explanation about the numbers near where they are used.
> > + * Destroy a TidStore, returning all memory. The caller must be certain that
> > + * no other backend will attempt to access the TidStore before calling this
> > + * function. Other backend must explicitly call tidstore_detach to free up
> > + * backend-local memory associated with the TidStore. The backend that calls
> > + * tidstore_destroy must not call tidstore_detach.
> > + */
> > +void
> > +tidstore_destroy(TidStore *ts)
> >
> > If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain.
>
> Will fix.
Did anything change here? There is also this, in the template, which I'm not sure has been addressed:
* XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
* has the local pointers to nodes, rather than RT_PTR_ALLOC.
* We need either a safeguard to disallow other processes to begin the iteration
* while one process is doing or to allow multiple processes to do the iteration.
> > This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because lazy_scan_prune() populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune() could pass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm missing some detail, so some description of the design choices made would be helpful.
>
> I agree that we don't need complexity here. I'll try this idea.
Keeping the offsets array in the prunestate seems to work out well.
Some other quick comments on tid store and vacuum, not comprehensive. Let me know if I've misunderstood something:
TID store:
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
I was confused for a while, and I realized the bits are in reverse order from how they are usually pictured (high on left, low on the right).
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
+ * XXX: if we want to support non-heap table AM that want to use the full
+ * range of possible offset numbers, we'll need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallback for other table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine.
+typedef dsa_pointer tidstore_handle;
It's not clear why we need a typedef here, since here:
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
...
+ control = handle;
...there is a differently-named dsa_pointer variable that just gets the function parameter.
+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)
size_t is more suitable for memory.
+ /*
+ * Since the shared radix tree supports concurrent insert,
+ * we don't need to acquire the lock.
+ */
Hmm? IIUC, the caller only acquires the lock after returning from here, to update statistics. Why is it safe to insert with no lock? Am I missing something?
VACUUM integration:
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
Seems like unnecessary churn? It is still all about dead items, after all. I understand using "DSA" for the LWLock, since that matches surrounding code.
+#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)
This macro helps the patch readability in some places, but I'm not sure it helps readability of the file as a whole. The following is in the patch and seems perfectly clear without the macro:
- if (lpdead_items > 0)
+ if (prunestate->lpdead_items > 0)
About shared memory: I have some mild reservations about the naming of the "control object", which may be in shared memory. Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memory is the actual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree. I don't have a better naming scheme, though, and might not be that important. (Added a WIP comment)
Now might be a good time to look at earlier XXX comments and come up with a plan to address them.
That's all I have for now.
--
John Naylor
EDB: http://www.enterprisedb.com
0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest is agnostic.
0010 turns node4 into node3, as discussed, going from 48 bytes to 32.
0012 adopts the benchmark module to the template, and adds meson support (builds with warnings, but okay because not meant for commit).
The rest are cleanups, small refactorings, and more comment rewrites. I've kept them separate for visibility. Next patch can squash them unless there is any discussion.
> > uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better.
>
> Agreed, will fix.
Great, but it's now uint64, not int64. All the large counters in struct LVRelState, for example, are signed integers, as the usual practice. Unsigned ints are "usually" for things like bit patterns and where explicit wraparound is desired. There's probably more that can be done here to change to signed types, but I think it's still a bit early to get to that level of nitpicking. (Soon, I hope :-) )
> > + * We calculate the maximum bytes for the TidStore in different ways
> > + * for non-shared case and shared case. Please refer to the comment
> > + * TIDSTORE_MEMORY_DEDUCT for details.
> > + */
> >
> > Maybe the #define and comment should be close to here.
>
> Will fix.
For this, I intended that "here" meant "in or just above the function".
+#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT (1024L * 70) /* 70kB */
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2 (float) 0.75
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO (float) 0.6
These symbols are used only once, in tidstore_create(), and are difficult to read. That function has few comments. The symbols have several paragraphs, but they are far away. It might be better for readability to just hard-code numbers in the function, with the explanation about the numbers near where they are used.
> > + * Destroy a TidStore, returning all memory. The caller must be certain that
> > + * no other backend will attempt to access the TidStore before calling this
> > + * function. Other backend must explicitly call tidstore_detach to free up
> > + * backend-local memory associated with the TidStore. The backend that calls
> > + * tidstore_destroy must not call tidstore_detach.
> > + */
> > +void
> > +tidstore_destroy(TidStore *ts)
> >
> > If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain.
>
> Will fix.
Did anything change here? There is also this, in the template, which I'm not sure has been addressed:
* XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
* has the local pointers to nodes, rather than RT_PTR_ALLOC.
* We need either a safeguard to disallow other processes to begin the iteration
* while one process is doing or to allow multiple processes to do the iteration.
> > This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because lazy_scan_prune() populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune() could pass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm missing some detail, so some description of the design choices made would be helpful.
>
> I agree that we don't need complexity here. I'll try this idea.
Keeping the offsets array in the prunestate seems to work out well.
Some other quick comments on tid store and vacuum, not comprehensive. Let me know if I've misunderstood something:
TID store:
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
I was confused for a while, and I realized the bits are in reverse order from how they are usually pictured (high on left, low on the right).
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
+ * XXX: if we want to support non-heap table AM that want to use the full
+ * range of possible offset numbers, we'll need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallback for other table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine.
+typedef dsa_pointer tidstore_handle;
It's not clear why we need a typedef here, since here:
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
...
+ control = handle;
...there is a differently-named dsa_pointer variable that just gets the function parameter.
+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)
size_t is more suitable for memory.
+ /*
+ * Since the shared radix tree supports concurrent insert,
+ * we don't need to acquire the lock.
+ */
Hmm? IIUC, the caller only acquires the lock after returning from here, to update statistics. Why is it safe to insert with no lock? Am I missing something?
VACUUM integration:
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
Seems like unnecessary churn? It is still all about dead items, after all. I understand using "DSA" for the LWLock, since that matches surrounding code.
+#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)
This macro helps the patch readability in some places, but I'm not sure it helps readability of the file as a whole. The following is in the patch and seems perfectly clear without the macro:
- if (lpdead_items > 0)
+ if (prunestate->lpdead_items > 0)
About shared memory: I have some mild reservations about the naming of the "control object", which may be in shared memory. Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memory is the actual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree. I don't have a better naming scheme, though, and might not be that important. (Added a WIP comment)
Now might be a good time to look at earlier XXX comments and come up with a plan to address them.
That's all I have for now.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
- v21-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v21-0001-introduce-vector8_min-and-vector8_highbit_mask.patch
- v21-0005-Restore-RT_GROW_NODE_KIND.patch
- v21-0004-Clean-up-some-nomenclature-around-node-insertion.patch
- v21-0003-Add-radixtree-template.patch
- v21-0006-Free-all-radix-tree-nodes-recursively.patch
- v21-0009-Remove-hard-coded-128.patch
- v21-0008-Streamline-calculation-of-slab-blocksize.patch
- v21-0010-Reduce-node4-to-node3.patch
- v21-0007-Make-value-type-configurable.patch
- v21-0012-Tool-for-measuring-radix-tree-performance.patch
- v21-0013-Get-rid-of-NODE_IS_EMPTY-macro.patch
- v21-0015-Get-rid-of-FIXED_NODE_HAS_FREE_SLOT.patch
- v21-0014-Add-some-comments-for-insert-logic.patch
- v21-0011-Expand-commentary-for-kinds-vs.-size-classes.patch
- v21-0019-Standardize-on-testing-for-is-leaf.patch
- v21-0016-s-VAR_NODE_HAS_FREE_SLOT-RT_NODE_MUST_GROW.patch
- v21-0017-Remove-some-maintenance-hazards-in-growing-nodes.patch
- v21-0018-Clean-up-symbols.patch
- v21-0020-Do-some-rewriting-and-proofreading-of-comments.patch
- v21-0021-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
- v21-0022-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch
Attached is a rebase to fix conflicts from recent commits.
--
John Naylor
EDB: http://www.enterprisedb.com
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
- v22-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v22-0004-Clean-up-some-nomenclature-around-node-insertion.patch
- v22-0001-introduce-vector8_min-and-vector8_highbit_mask.patch
- v22-0005-Restore-RT_GROW_NODE_KIND.patch
- v22-0003-Add-radixtree-template.patch
- v22-0007-Make-value-type-configurable.patch
- v22-0006-Free-all-radix-tree-nodes-recursively.patch
- v22-0008-Streamline-calculation-of-slab-blocksize.patch
- v22-0009-Remove-hard-coded-128.patch
- v22-0010-Reduce-node4-to-node3.patch
- v22-0011-Expand-commentary-for-kinds-vs.-size-classes.patch
- v22-0012-Tool-for-measuring-radix-tree-performance.patch
- v22-0013-Get-rid-of-NODE_IS_EMPTY-macro.patch
- v22-0014-Add-some-comments-for-insert-logic.patch
- v22-0015-Get-rid-of-FIXED_NODE_HAS_FREE_SLOT.patch
- v22-0016-s-VAR_NODE_HAS_FREE_SLOT-RT_NODE_MUST_GROW.patch
- v22-0018-Clean-up-symbols.patch
- v22-0017-Remove-some-maintenance-hazards-in-growing-nodes.patch
- v22-0019-Standardize-on-testing-for-is-leaf.patch
- v22-0020-Do-some-rewriting-and-proofreading-of-comments.patch
- v22-0022-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch
- v22-0021-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
On Mon, Jan 23, 2023 at 6:00 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > Attached is a rebase to fix conflicts from recent commits. I have reviewed v22-0022* patch and I have some comments. 1. >It also changes to the column names max_dead_tuples and num_dead_tuples and to >show the progress information in bytes. I think this statement needs to be rephrased. 2. /* * vac_tid_reaped() -- is a particular tid deletable? * * This has the right signature to be an IndexBulkDeleteCallback. * * Assumes dead_items array is sorted (in ascending TID order). */ I think this comment 'Assumes dead_items array is sorted' is not valid anymore. 3. We are changing the min value of 'maintenance_work_mem' to 2MB. Should we do the same for the 'autovacuum_work_mem'? 4. + + /* collected LP_DEAD items including existing LP_DEAD items */ + int lpdead_items; + OffsetNumber deadoffsets[MaxHeapTuplesPerPage]; We are actually collecting dead offsets but the variable name says 'lpdead_items' instead of something like ndeadoffsets num_deadoffsets. And the comment is also saying dead items. 5. /* * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the * vacrel->dead_items array. * * Caller must have an exclusive buffer lock on the buffer (though a full * cleanup lock is also acceptable). vmbuffer must be valid and already have * a pin on blkno's visibility map page. * * index is an offset into the vacrel->dead_items array for the first listed * LP_DEAD item on the page. The return value is the first index immediately * after all LP_DEAD items for the same page in the array. */ This comment needs to be changed as this is referring to the 'vacrel->dead_items array' which no longer exists. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 23, 2023 at 8:20 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Jan 16, 2023 at 2:02 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one exception:v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I believeone of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with thatapproach? I don't recall where that discussion went. Hmm, I don't remember I proposed such a patch, either. One idea to address it would be that we pass a shared memory to RT_CREATE() and we create a DSA area dedicated to the radix tree in place. We should return the created DSA area along with the radix tree so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(), and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA area. A downside of this idea would be that one DSA area only for a radix tree is always required. Another idea would be that we allocate a big enough DSA area and quarry small memory for nodes from there. But it would need to introduce another complexity so I prefer to avoid it. FYI the current design is inspired by dshash.c. In dshash_destory(), we dsa_free() each elements allocated by dshash.c > > > + * XXX: Most functions in this file have two variants for inner nodes and leaf > > + * nodes, therefore there are duplication codes. While this sometimes makes the > > + * code maintenance tricky, this reduces branch prediction misses when judging > > + * whether the node is a inner node of a leaf node. > > > > This comment seems to be out-of-date since we made it a template. > > Done in 0020, along with a bunch of other comment editing. > > > The following macros are defined but not undefined in radixtree.h: > > Fixed in v21-0018. > > Also: > > 0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest is agnostic. radixtree_search_impl.h still assumes that the value type is an integer type as follows: #ifdef RT_NODE_LEVEL_LEAF RT_VALUE_TYPE value = 0; Assert(RT_NODE_IS_LEAF(node)); #else Also, I think if we make the value type configurable, it's better to pass the pointer of the value to RT_SET() instead of copying the values since the value size could be large. > 0010 turns node4 into node3, as discussed, going from 48 bytes to 32. > 0012 adopts the benchmark module to the template, and adds meson support (builds with warnings, but okay because not meantfor commit). > > The rest are cleanups, small refactorings, and more comment rewrites. I've kept them separate for visibility. Next patchcan squash them unless there is any discussion. 0008 patch for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++) - fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n", + fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\t%zu\n", RT_SIZE_CLASS_INFO[i].name, RT_SIZE_CLASS_INFO[i].inner_size, - RT_SIZE_CLASS_INFO[i].inner_blocksize, - RT_SIZE_CLASS_INFO[i].leaf_size, - RT_SIZE_CLASS_INFO[i].leaf_blocksize); + RT_SIZE_CLASS_INFO[i].leaf_size); There is additional '%zu' at the end of the format string: --- 0011 patch + * 1. With 5 or more kinds, gcc tends to use a jump table for switch + * statments. typo: s/statments/statements/ The rest look good to me. I'll incorporate these fixes in the next version patch. > > > > uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better. > > > > Agreed, will fix. > > Great, but it's now uint64, not int64. All the large counters in struct LVRelState, for example, are signed integers, asthe usual practice. Unsigned ints are "usually" for things like bit patterns and where explicit wraparound is desired.There's probably more that can be done here to change to signed types, but I think it's still a bit early to getto that level of nitpicking. (Soon, I hope :-) ) Agreed. I'll change it in the next version patch. > > > > + * We calculate the maximum bytes for the TidStore in different ways > > > + * for non-shared case and shared case. Please refer to the comment > > > + * TIDSTORE_MEMORY_DEDUCT for details. > > > + */ > > > > > > Maybe the #define and comment should be close to here. > > > > Will fix. > > For this, I intended that "here" meant "in or just above the function". > > +#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT (1024L * 70) /* 70kB */ > +#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2 (float) 0.75 > +#define TIDSTORE_SHARED_MAX_MEMORY_RATIO (float) 0.6 > > These symbols are used only once, in tidstore_create(), and are difficult to read. That function has few comments. Thesymbols have several paragraphs, but they are far away. It might be better for readability to just hard-code numbers inthe function, with the explanation about the numbers near where they are used. Agreed, will fix. > > > > + * Destroy a TidStore, returning all memory. The caller must be certain that > > > + * no other backend will attempt to access the TidStore before calling this > > > + * function. Other backend must explicitly call tidstore_detach to free up > > > + * backend-local memory associated with the TidStore. The backend that calls > > > + * tidstore_destroy must not call tidstore_detach. > > > + */ > > > +void > > > +tidstore_destroy(TidStore *ts) > > > > > > If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain. > > > > Will fix. > > Did anything change here? Oops, the fix is missed in the patch for some reason. I'll fix it. > There is also this, in the template, which I'm not sure has been addressed: > > * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter > * has the local pointers to nodes, rather than RT_PTR_ALLOC. > * We need either a safeguard to disallow other processes to begin the iteration > * while one process is doing or to allow multiple processes to do the iteration. It's not addressed yet. I think adding a safeguard is better for the first version. A simple solution is to add a flag, say iter_active, to allow only one process to enable the iteration. What do you think? > > > > This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because lazy_scan_prune()populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune() couldpass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as needed.It's quite possible I'm missing some detail, so some description of the design choices made would be helpful. > > > > I agree that we don't need complexity here. I'll try this idea. > > Keeping the offsets array in the prunestate seems to work out well. > > Some other quick comments on tid store and vacuum, not comprehensive. Let me know if I've misunderstood something: > > TID store: > > + * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu > + * > + * X = bits used for offset number > + * Y = bits used for block number > + * u = unused bit > > I was confused for a while, and I realized the bits are in reverse order from how they are usually pictured (high on left,low on the right). I borrowed it from ginpostinglist.c but it seems better to write in the common order. > > + * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11 > + * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with > > + * XXX: if we want to support non-heap table AM that want to use the full > + * range of possible offset numbers, we'll need to reconsider > + * TIDSTORE_OFFSET_NBITS value. > > Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallback forother table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine. I think we can pass the maximum offset numbers to tidstore_create() and calculate these values. > > +typedef dsa_pointer tidstore_handle; > > It's not clear why we need a typedef here, since here: > > +tidstore_attach(dsa_area *area, tidstore_handle handle) > +{ > + TidStore *ts; > + dsa_pointer control; > ... > + control = handle; > > ...there is a differently-named dsa_pointer variable that just gets the function parameter. I guess one reason is to improve compatibility; we can stash the actual value of the handle, which could help some cases, for example, when we need to change the actual value of the handle. dshash.c uses the same idea. Another reason would be to improve readability. > > +/* Return the maximum memory TidStore can use */ > +uint64 > +tidstore_max_memory(TidStore *ts) > > size_t is more suitable for memory. WIll fix. > > + /* > + * Since the shared radix tree supports concurrent insert, > + * we don't need to acquire the lock. > + */ > > Hmm? IIUC, the caller only acquires the lock after returning from here, to update statistics. Why is it safe to insertwith no lock? Am I missing something? You're right. I was missing something. The lock should be taken before adding key-value pairs. > > VACUUM integration: > > -#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2 > +#define PARALLEL_VACUUM_KEY_DSA 2 > > Seems like unnecessary churn? It is still all about dead items, after all. I understand using "DSA" for the LWLock, sincethat matches surrounding code. Agreed, will remove. > > +#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0) > > This macro helps the patch readability in some places, but I'm not sure it helps readability of the file as a whole. Thefollowing is in the patch and seems perfectly clear without the macro: > > - if (lpdead_items > 0) > + if (prunestate->lpdead_items > 0) Will remove the macro. > > About shared memory: I have some mild reservations about the naming of the "control object", which may be in shared memory.Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memory is theactual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree. I don't havea better naming scheme, though, and might not be that important. (Added a WIP comment) That seems a valid concern. I borrowed the "control object" from dshash.c but it supports only shared cases. The fact that the radix tree supports both local and shared seems to introduce this confusion. I came up with other names such as RT_RADIX_TREE_CORE or RT_RADIX_TREE_ROOT but not sure these are better than the current one. > > Now might be a good time to look at earlier XXX comments and come up with a plan to address them. Agreed. Other XXX comments that are not mentioned yet are: + /* XXX: memory context support */ + tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE)); I'm not sure we really need memory context support for RT_ATTACH() since in the shared case, we allocate backend-local memory only for RT_RADIX_TREE. --- +RT_SCOPE uint64 +RT_MEMORY_USAGE(RT_RADIX_TREE *tree) +{ + // XXX is this necessary? + Size total = sizeof(RT_RADIX_TREE); Regarding this, I followed intset_memory_usage(). But in the radix tree, RT_RADIX_TREE is very small so probably we can ignore it. --- +/* XXX For display, assumes value type is numeric */ +static void +RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse) I think we can display values in hex encoded format but given the value could be large, we don't necessarily need to display actual values. Displaying the tree structure and chunks would be helpful for debugging the radix tree. --- There is no XXX comment but I'll try to add lock support in the next version patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Jan 25, 2023 at 8:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 23, 2023 at 8:20 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
> >
> > In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one exception: v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I believe one of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with that approach? I don't recall where that discussion went.
>
> Hmm, I don't remember I proposed such a patch, either.
I went looking, and it turns out I remembered wrong, sorry.
> One idea to address it would be that we pass a shared memory to
> RT_CREATE() and we create a DSA area dedicated to the radix tree in
> place. We should return the created DSA area along with the radix tree
> so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(),
> and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA
> area. A downside of this idea would be that one DSA area only for a
> radix tree is always required.
>
> Another idea would be that we allocate a big enough DSA area and
> quarry small memory for nodes from there. But it would need to
> introduce another complexity so I prefer to avoid it.
>
> FYI the current design is inspired by dshash.c. In dshash_destory(),
> we dsa_free() each elements allocated by dshash.c
Okay, thanks for the info.
> > 0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest is agnostic.
>
> radixtree_search_impl.h still assumes that the value type is an
> integer type as follows:
>
> #ifdef RT_NODE_LEVEL_LEAF
> RT_VALUE_TYPE value = 0;
>
> Assert(RT_NODE_IS_LEAF(node));
> #else
>
> Also, I think if we make the value type configurable, it's better to
> pass the pointer of the value to RT_SET() instead of copying the
> values since the value size could be large.
Thanks, I will remove the assignment and look into pass-by-reference.
> Oops, the fix is missed in the patch for some reason. I'll fix it.
>
> > There is also this, in the template, which I'm not sure has been addressed:
> >
> > * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
> > * has the local pointers to nodes, rather than RT_PTR_ALLOC.
> > * We need either a safeguard to disallow other processes to begin the iteration
> > * while one process is doing or to allow multiple processes to do the iteration.
>
> It's not addressed yet. I think adding a safeguard is better for the
> first version. A simple solution is to add a flag, say iter_active, to
> allow only one process to enable the iteration. What do you think?
I don't quite have enough info to offer an opinion, but this sounds like a different form of locking. I'm sure it's come up before, but could you describe why iteration is different from other operations, regarding concurrency?
> > Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallback for other table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine.
>
> I think we can pass the maximum offset numbers to tidstore_create()
> and calculate these values.
That would work easily for vacuumlazy.c, since it's in the "heap" subdir so we know the max possible offset. I haven't looked at vacuumparallel.c, but I can tell it is not in a heap-specific directory, so I don't know how easy that would be to pass along the right value.
> > About shared memory: I have some mild reservations about the naming of the "control object", which may be in shared memory. Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memory is the actual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree. I don't have a better naming scheme, though, and might not be that important. (Added a WIP comment)
>
> That seems a valid concern. I borrowed the "control object" from
> dshash.c but it supports only shared cases. The fact that the radix
> tree supports both local and shared seems to introduce this confusion.
> I came up with other names such as RT_RADIX_TREE_CORE or
> RT_RADIX_TREE_ROOT but not sure these are better than the current
> one.
Okay, if dshash uses it, we have some precedent.
> > Now might be a good time to look at earlier XXX comments and come up with a plan to address them.
>
> Agreed.
>
> Other XXX comments that are not mentioned yet are:
>
> + /* XXX: memory context support */
> + tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
>
> I'm not sure we really need memory context support for RT_ATTACH()
> since in the shared case, we allocate backend-local memory only for
> RT_RADIX_TREE.
Okay, we can remove this.
> ---
> +RT_SCOPE uint64
> +RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
> +{
> + // XXX is this necessary?
> + Size total = sizeof(RT_RADIX_TREE);
>
> Regarding this, I followed intset_memory_usage(). But in the radix
> tree, RT_RADIX_TREE is very small so probably we can ignore it.
That was more a note to myself that I forgot about, so here is my reasoning: In the shared case, we just overwrite that initial total, but for the local case we add to it. A future reader could think this is inconsistent and needs to be fixed. Since we deduct from the guc limit to guard against worst-case re-allocation, and that deduction is not very precise (nor needs to be), I agree we should just forget about tiny sizes like this in both cases.
> ---
> +/* XXX For display, assumes value type is numeric */
> +static void
> +RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
>
> I think we can display values in hex encoded format but given the
> value could be large, we don't necessarily need to display actual
> values. Displaying the tree structure and chunks would be helpful for
> debugging the radix tree.
Okay, I can try that unless you do it first.
> There is no XXX comment but I'll try to add lock support in the next
> version patch.
Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the next patch, the email should contain a few sentences describing how locking is intended to work, including for iteration.
Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of this data structure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads and writes and so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs, we can't rely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So we must come up with test cases ourselves.
>
> On Mon, Jan 23, 2023 at 8:20 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
> >
> > In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one exception: v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I believe one of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with that approach? I don't recall where that discussion went.
>
> Hmm, I don't remember I proposed such a patch, either.
I went looking, and it turns out I remembered wrong, sorry.
> One idea to address it would be that we pass a shared memory to
> RT_CREATE() and we create a DSA area dedicated to the radix tree in
> place. We should return the created DSA area along with the radix tree
> so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(),
> and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA
> area. A downside of this idea would be that one DSA area only for a
> radix tree is always required.
>
> Another idea would be that we allocate a big enough DSA area and
> quarry small memory for nodes from there. But it would need to
> introduce another complexity so I prefer to avoid it.
>
> FYI the current design is inspired by dshash.c. In dshash_destory(),
> we dsa_free() each elements allocated by dshash.c
Okay, thanks for the info.
> > 0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest is agnostic.
>
> radixtree_search_impl.h still assumes that the value type is an
> integer type as follows:
>
> #ifdef RT_NODE_LEVEL_LEAF
> RT_VALUE_TYPE value = 0;
>
> Assert(RT_NODE_IS_LEAF(node));
> #else
>
> Also, I think if we make the value type configurable, it's better to
> pass the pointer of the value to RT_SET() instead of copying the
> values since the value size could be large.
Thanks, I will remove the assignment and look into pass-by-reference.
> Oops, the fix is missed in the patch for some reason. I'll fix it.
>
> > There is also this, in the template, which I'm not sure has been addressed:
> >
> > * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
> > * has the local pointers to nodes, rather than RT_PTR_ALLOC.
> > * We need either a safeguard to disallow other processes to begin the iteration
> > * while one process is doing or to allow multiple processes to do the iteration.
>
> It's not addressed yet. I think adding a safeguard is better for the
> first version. A simple solution is to add a flag, say iter_active, to
> allow only one process to enable the iteration. What do you think?
I don't quite have enough info to offer an opinion, but this sounds like a different form of locking. I'm sure it's come up before, but could you describe why iteration is different from other operations, regarding concurrency?
> > Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallback for other table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine.
>
> I think we can pass the maximum offset numbers to tidstore_create()
> and calculate these values.
That would work easily for vacuumlazy.c, since it's in the "heap" subdir so we know the max possible offset. I haven't looked at vacuumparallel.c, but I can tell it is not in a heap-specific directory, so I don't know how easy that would be to pass along the right value.
> > About shared memory: I have some mild reservations about the naming of the "control object", which may be in shared memory. Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memory is the actual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree. I don't have a better naming scheme, though, and might not be that important. (Added a WIP comment)
>
> That seems a valid concern. I borrowed the "control object" from
> dshash.c but it supports only shared cases. The fact that the radix
> tree supports both local and shared seems to introduce this confusion.
> I came up with other names such as RT_RADIX_TREE_CORE or
> RT_RADIX_TREE_ROOT but not sure these are better than the current
> one.
Okay, if dshash uses it, we have some precedent.
> > Now might be a good time to look at earlier XXX comments and come up with a plan to address them.
>
> Agreed.
>
> Other XXX comments that are not mentioned yet are:
>
> + /* XXX: memory context support */
> + tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
>
> I'm not sure we really need memory context support for RT_ATTACH()
> since in the shared case, we allocate backend-local memory only for
> RT_RADIX_TREE.
Okay, we can remove this.
> ---
> +RT_SCOPE uint64
> +RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
> +{
> + // XXX is this necessary?
> + Size total = sizeof(RT_RADIX_TREE);
>
> Regarding this, I followed intset_memory_usage(). But in the radix
> tree, RT_RADIX_TREE is very small so probably we can ignore it.
That was more a note to myself that I forgot about, so here is my reasoning: In the shared case, we just overwrite that initial total, but for the local case we add to it. A future reader could think this is inconsistent and needs to be fixed. Since we deduct from the guc limit to guard against worst-case re-allocation, and that deduction is not very precise (nor needs to be), I agree we should just forget about tiny sizes like this in both cases.
> ---
> +/* XXX For display, assumes value type is numeric */
> +static void
> +RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
>
> I think we can display values in hex encoded format but given the
> value could be large, we don't necessarily need to display actual
> values. Displaying the tree structure and chunks would be helpful for
> debugging the radix tree.
Okay, I can try that unless you do it first.
> There is no XXX comment but I'll try to add lock support in the next
> version patch.
Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the next patch, the email should contain a few sentences describing how locking is intended to work, including for iteration.
Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of this data structure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads and writes and so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs, we can't rely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So we must come up with test cases ourselves.
On Tue, Jan 24, 2023 at 1:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jan 23, 2023 at 6:00 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > Attached is a rebase to fix conflicts from recent commits.
>
> I have reviewed v22-0022* patch and I have some comments.
>
> 1.
> >It also changes to the column names max_dead_tuples and num_dead_tuples and to
> >show the progress information in bytes.
>
> I think this statement needs to be rephrased.
Could you be more specific?
> 3.
>
> We are changing the min value of 'maintenance_work_mem' to 2MB. Should
> we do the same for the 'autovacuum_work_mem'?
Yes, we should change that, too. We've discussed previously that autovacuum_work_mem is possibly rendered unnecessary by this work, but we agreed that that should be a separate thread. And needs additional testing to verify.
I agree with your other comments.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Jan 26, 2023 at 3:54 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Wed, Jan 25, 2023 at 8:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Jan 23, 2023 at 8:20 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Mon, Jan 16, 2023 at 2:02 PM John Naylor > > > > <john.naylor@enterprisedb.com> wrote: > > > > > > In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one exception:v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I believeone of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with thatapproach? I don't recall where that discussion went. > > > > Hmm, I don't remember I proposed such a patch, either. > > I went looking, and it turns out I remembered wrong, sorry. > > > One idea to address it would be that we pass a shared memory to > > RT_CREATE() and we create a DSA area dedicated to the radix tree in > > place. We should return the created DSA area along with the radix tree > > so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(), > > and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA > > area. A downside of this idea would be that one DSA area only for a > > radix tree is always required. > > > > Another idea would be that we allocate a big enough DSA area and > > quarry small memory for nodes from there. But it would need to > > introduce another complexity so I prefer to avoid it. > > > > FYI the current design is inspired by dshash.c. In dshash_destory(), > > we dsa_free() each elements allocated by dshash.c > > Okay, thanks for the info. > > > > 0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest isagnostic. > > > > radixtree_search_impl.h still assumes that the value type is an > > integer type as follows: > > > > #ifdef RT_NODE_LEVEL_LEAF > > RT_VALUE_TYPE value = 0; > > > > Assert(RT_NODE_IS_LEAF(node)); > > #else > > > > Also, I think if we make the value type configurable, it's better to > > pass the pointer of the value to RT_SET() instead of copying the > > values since the value size could be large. > > Thanks, I will remove the assignment and look into pass-by-reference. > > > Oops, the fix is missed in the patch for some reason. I'll fix it. > > > > > There is also this, in the template, which I'm not sure has been addressed: > > > > > > * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter > > > * has the local pointers to nodes, rather than RT_PTR_ALLOC. > > > * We need either a safeguard to disallow other processes to begin the iteration > > > * while one process is doing or to allow multiple processes to do the iteration. > > > > It's not addressed yet. I think adding a safeguard is better for the > > first version. A simple solution is to add a flag, say iter_active, to > > allow only one process to enable the iteration. What do you think? > > I don't quite have enough info to offer an opinion, but this sounds like a different form of locking. I'm sure it's comeup before, but could you describe why iteration is different from other operations, regarding concurrency? I think that we need to prevent concurrent updates (RT_SET() and RT_DELETE()) during the iteration to get the consistent result through the whole iteration operation. Unlike other operations such as RT_SET(), we cannot expect that a job doing something for each key-value pair in the radix tree completes in a short time, so we cannot keep holding the radix tree lock until the end of the iteration. So the idea is that we set iter_active to true (with the lock in exclusive mode), and prevent concurrent updates when the flag is true. > > > > Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallbackfor other table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine. > > > > I think we can pass the maximum offset numbers to tidstore_create() > > and calculate these values. > > That would work easily for vacuumlazy.c, since it's in the "heap" subdir so we know the max possible offset. I haven'tlooked at vacuumparallel.c, but I can tell it is not in a heap-specific directory, so I don't know how easy that wouldbe to pass along the right value. I think the user (e.g, vacuumlazy.c) can pass the maximum offset number to the parallel vacuum. > > > > About shared memory: I have some mild reservations about the naming of the "control object", which may be in sharedmemory. Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memoryis the actual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree.I don't have a better naming scheme, though, and might not be that important. (Added a WIP comment) > > > > That seems a valid concern. I borrowed the "control object" from > > dshash.c but it supports only shared cases. The fact that the radix > > tree supports both local and shared seems to introduce this confusion. > > I came up with other names such as RT_RADIX_TREE_CORE or > > RT_RADIX_TREE_ROOT but not sure these are better than the current > > one. > > Okay, if dshash uses it, we have some precedent. > > > > Now might be a good time to look at earlier XXX comments and come up with a plan to address them. > > > > Agreed. > > > > Other XXX comments that are not mentioned yet are: > > > > + /* XXX: memory context support */ > > + tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE)); > > > > I'm not sure we really need memory context support for RT_ATTACH() > > since in the shared case, we allocate backend-local memory only for > > RT_RADIX_TREE. > > Okay, we can remove this. > > > --- > > +RT_SCOPE uint64 > > +RT_MEMORY_USAGE(RT_RADIX_TREE *tree) > > +{ > > + // XXX is this necessary? > > + Size total = sizeof(RT_RADIX_TREE); > > > > Regarding this, I followed intset_memory_usage(). But in the radix > > tree, RT_RADIX_TREE is very small so probably we can ignore it. > > That was more a note to myself that I forgot about, so here is my reasoning: In the shared case, we just overwrite thatinitial total, but for the local case we add to it. A future reader could think this is inconsistent and needs to befixed. Since we deduct from the guc limit to guard against worst-case re-allocation, and that deduction is not very precise(nor needs to be), I agree we should just forget about tiny sizes like this in both cases. Thanks for your explanation, agreed. > > > --- > > +/* XXX For display, assumes value type is numeric */ > > +static void > > +RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse) > > > > I think we can display values in hex encoded format but given the > > value could be large, we don't necessarily need to display actual > > values. Displaying the tree structure and chunks would be helpful for > > debugging the radix tree. > > Okay, I can try that unless you do it first. > > > There is no XXX comment but I'll try to add lock support in the next > > version patch. > > Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the next patch,the email should contain a few sentences describing how locking is intended to work, including for iteration. The lock I'm thinking of adding is a simple readers-writer lock. This lock is used for concurrent radix tree operations except for the iteration. For operations concurrent to the iteration, I used a flag for the reason I mentioned above. > > Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of this datastructure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads and writesand so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs, we can'trely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So we mustcome up with test cases ourselves. Using the isolation tester to test locking seems like a good idea. We can include it in test_radixtree. But given that the locking in the radix tree is very simple, the test case would be very simple. It may be controversial whether it's worth adding such testing by adding both the new test module and test cases. I'm working on the fixes I mentioned in the previous email and going to share the updated patch today. Please wait to do these fixes if you're okay. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Jan 26, 2023 at 5:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I'm working on the fixes I mentioned in the previous email and going > to share the updated patch today. Please wait to do these fixes if > you're okay. > I've attached updated version patches. As we agreed I've merged your changes in v22 into the main (0003) patch. But I still kept the patch of recursively freeing nodes separate as we might need more discussion. In v23 I attached, 0006 through 0016 patches are fixes and improvements for the radix tree. I've incorporated all comments I got unless I'm missing something. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v23-0016-Add-read-write-lock-to-radix-tree-in-RT_SHMEM-ca.patch
- v23-0014-Improve-RT_DUMP-and-RT_DUMP_SEARCH-output.patch
- v23-0018-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch
- v23-0017-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
- v23-0015-Detach-DSA-after-tests-in-test_radixtree.patch
- v23-0013-Remove-XXX-comment-for-MemoryContext-support-for.patch
- v23-0011-Add-a-safeguard-for-concurrent-iteration-in-RT_S.patch
- v23-0010-Fix-a-typo-in-simd.h.patch
- v23-0009-Miscellaneous-fixes.patch
- v23-0012-Don-t-include-the-size-of-RT_RADIX_TREE-to-memor.patch
- v23-0008-Align-indents-of-the-file-header-comments.patch
- v23-0007-undef-RT_SLOT_IDX_LIMIT.patch
- v23-0006-Fix-compile-error-when-RT_VALUE_TYPE-is-non-inte.patch
- v23-0005-Tool-for-measuring-radix-tree-performance.patch
- v23-0004-Free-all-radix-tree-nodes-recursively.patch
- v23-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v23-0003-Add-radixtree-template.patch
- v23-0001-introduce-vector8_min-and-vector8_highbit_mask.patch
On Thu, Jan 26, 2023 at 3:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jan 26, 2023 at 3:54 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> I think that we need to prevent concurrent updates (RT_SET() and
> RT_DELETE()) during the iteration to get the consistent result through
> the whole iteration operation. Unlike other operations such as
> RT_SET(), we cannot expect that a job doing something for each
> key-value pair in the radix tree completes in a short time, so we
> cannot keep holding the radix tree lock until the end of the
> iteration.
This sounds like a performance concern, rather than a correctness concern, is that right? If so, I don't think we should worry too much about optimizing simple locking, because it will *never* be fast enough for highly-concurrent read-write workloads anyway, and anyone interested in those workloads will have to completely replace the locking scheme, possibly using one of the ideas in the last ART paper you mentioned.
The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as possible anyway.
> So the idea is that we set iter_active to true (with the
> lock in exclusive mode), and prevent concurrent updates when the flag
> is true.
...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.
> > Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the next patch, the email should contain a few sentences describing how locking is intended to work, including for iteration.
>
> The lock I'm thinking of adding is a simple readers-writer lock. This
> lock is used for concurrent radix tree operations except for the
> iteration. For operations concurrent to the iteration, I used a flag
> for the reason I mentioned above.
This doesn't tell me anything -- we already agreed on "simple reader-writer lock", months ago I believe. And I only have a vague idea about the tradeoffs made regarding iteration.
+ * WIP: describe about how locking works.
A first draft of what is intended for this WIP would be a good start. This WIP is from v23-0016, which contains no comments and a one-line commit message. I'd rather not try closely studying that patch (or how it works with 0011) until I have a clearer understanding of what requirements are assumed, what trade-offs are considered, and how it should be tested.
[thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameter for whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriate lock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note, I'm just guessing here, and I don't want to make things more difficult for future improvements.
> > Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of this data structure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads and writes and so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs, we can't rely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So we must come up with test cases ourselves.
>
> Using the isolation tester to test locking seems like a good idea. We
> can include it in test_radixtree. But given that the locking in the
> radix tree is very simple, the test case would be very simple. It may
> be controversial whether it's worth adding such testing by adding both
> the new test module and test cases.
I mean that the isolation tester (or something else) would contain test cases. I didn't mean to imply redundant testing.
> I think the user (e.g, vacuumlazy.c) can pass the maximum offset
> number to the parallel vacuum.
Okay, sounds good.
Most of v23's cleanups/fixes in the radix template look good to me, although I didn't read the debugging code very closely. There is one exception:
0006 - I've never heard of memset'ing a variable to avoid "variable unused" compiler warnings, and it seems strange. It turns out we don't actually need this variable in the first place. The attached .txt patch removes the local variable and just writes to the passed pointer. This required callers to initialize a couple of their own variables, but only child pointers, at least on gcc 12. And I will work later on making "value" in the public API a pointer.
0017 - I haven't taken a close look at the new changes, but I did notice this some time ago:
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+ else
+ return sizeof(TidStore) + sizeof(TidStore) +
+ local_rt_memory_usage(ts->tree.local);
There is repetition in the else branch.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Thu, Jan 26, 2023 at 3:54 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> I think that we need to prevent concurrent updates (RT_SET() and
> RT_DELETE()) during the iteration to get the consistent result through
> the whole iteration operation. Unlike other operations such as
> RT_SET(), we cannot expect that a job doing something for each
> key-value pair in the radix tree completes in a short time, so we
> cannot keep holding the radix tree lock until the end of the
> iteration.
This sounds like a performance concern, rather than a correctness concern, is that right? If so, I don't think we should worry too much about optimizing simple locking, because it will *never* be fast enough for highly-concurrent read-write workloads anyway, and anyone interested in those workloads will have to completely replace the locking scheme, possibly using one of the ideas in the last ART paper you mentioned.
The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as possible anyway.
> So the idea is that we set iter_active to true (with the
> lock in exclusive mode), and prevent concurrent updates when the flag
> is true.
...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.
> > Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the next patch, the email should contain a few sentences describing how locking is intended to work, including for iteration.
>
> The lock I'm thinking of adding is a simple readers-writer lock. This
> lock is used for concurrent radix tree operations except for the
> iteration. For operations concurrent to the iteration, I used a flag
> for the reason I mentioned above.
This doesn't tell me anything -- we already agreed on "simple reader-writer lock", months ago I believe. And I only have a vague idea about the tradeoffs made regarding iteration.
+ * WIP: describe about how locking works.
A first draft of what is intended for this WIP would be a good start. This WIP is from v23-0016, which contains no comments and a one-line commit message. I'd rather not try closely studying that patch (or how it works with 0011) until I have a clearer understanding of what requirements are assumed, what trade-offs are considered, and how it should be tested.
[thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameter for whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriate lock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note, I'm just guessing here, and I don't want to make things more difficult for future improvements.
> > Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of this data structure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads and writes and so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs, we can't rely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So we must come up with test cases ourselves.
>
> Using the isolation tester to test locking seems like a good idea. We
> can include it in test_radixtree. But given that the locking in the
> radix tree is very simple, the test case would be very simple. It may
> be controversial whether it's worth adding such testing by adding both
> the new test module and test cases.
I mean that the isolation tester (or something else) would contain test cases. I didn't mean to imply redundant testing.
> I think the user (e.g, vacuumlazy.c) can pass the maximum offset
> number to the parallel vacuum.
Okay, sounds good.
Most of v23's cleanups/fixes in the radix template look good to me, although I didn't read the debugging code very closely. There is one exception:
0006 - I've never heard of memset'ing a variable to avoid "variable unused" compiler warnings, and it seems strange. It turns out we don't actually need this variable in the first place. The attached .txt patch removes the local variable and just writes to the passed pointer. This required callers to initialize a couple of their own variables, but only child pointers, at least on gcc 12. And I will work later on making "value" in the public API a pointer.
0017 - I haven't taken a close look at the new changes, but I did notice this some time ago:
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+ else
+ return sizeof(TidStore) + sizeof(TidStore) +
+ local_rt_memory_usage(ts->tree.local);
There is repetition in the else branch.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
On Sat, Jan 28, 2023 at 8:33 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Thu, Jan 26, 2023 at 3:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Jan 26, 2023 at 3:54 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > I think that we need to prevent concurrent updates (RT_SET() and > > RT_DELETE()) during the iteration to get the consistent result through > > the whole iteration operation. Unlike other operations such as > > RT_SET(), we cannot expect that a job doing something for each > > key-value pair in the radix tree completes in a short time, so we > > cannot keep holding the radix tree lock until the end of the > > iteration. > > This sounds like a performance concern, rather than a correctness concern, is that right? If so, I don't think we shouldworry too much about optimizing simple locking, because it will *never* be fast enough for highly-concurrent read-writeworkloads anyway, and anyone interested in those workloads will have to completely replace the locking scheme,possibly using one of the ideas in the last ART paper you mentioned. > > The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as possibleanyway. Yes, but if a concurrent writer waits for another process to finish the iteration, it ends up waiting on a lwlock, which is not interruptible. > > > So the idea is that we set iter_active to true (with the > > lock in exclusive mode), and prevent concurrent updates when the flag > > is true. > > ...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting. Right. I think if we want to wait rather than an ERROR, the waiter should wait in an interruptible way, for example, a condition variable. I did a simpler way in the v22 patch. ...but looking at dshash.c, dshash_seq_next() seems to return an entry while holding a lwlock on the partition. My assumption might be wrong. > > > > Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the nextpatch, the email should contain a few sentences describing how locking is intended to work, including for iteration. > > > > The lock I'm thinking of adding is a simple readers-writer lock. This > > lock is used for concurrent radix tree operations except for the > > iteration. For operations concurrent to the iteration, I used a flag > > for the reason I mentioned above. > > This doesn't tell me anything -- we already agreed on "simple reader-writer lock", months ago I believe. And I only havea vague idea about the tradeoffs made regarding iteration. > > + * WIP: describe about how locking works. > > A first draft of what is intended for this WIP would be a good start. This WIP is from v23-0016, which contains no commentsand a one-line commit message. I'd rather not try closely studying that patch (or how it works with 0011) until Ihave a clearer understanding of what requirements are assumed, what trade-offs are considered, and how it should be tested. > > [thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameterfor whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriatelock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note,I'm just guessing here, and I don't want to make things more difficult for future improvements. Seems a good idea. Given the use case for parallel heap vacuum, it would be a good idea to support having multiple read-only writers. The iteration of the v22 is read-only, so if we want to support read-write iterator, we would need to support a function that modifies the current key-value returned by the iteration. > > > > Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of thisdata structure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads andwrites and so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs, wecan't rely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So wemust come up with test cases ourselves. > > > > Using the isolation tester to test locking seems like a good idea. We > > can include it in test_radixtree. But given that the locking in the > > radix tree is very simple, the test case would be very simple. It may > > be controversial whether it's worth adding such testing by adding both > > the new test module and test cases. > > I mean that the isolation tester (or something else) would contain test cases. I didn't mean to imply redundant testing. Okay, understood. > > > I think the user (e.g, vacuumlazy.c) can pass the maximum offset > > number to the parallel vacuum. > > Okay, sounds good. > > Most of v23's cleanups/fixes in the radix template look good to me, although I didn't read the debugging code very closely.There is one exception: > > 0006 - I've never heard of memset'ing a variable to avoid "variable unused" compiler warnings, and it seems strange. Itturns out we don't actually need this variable in the first place. The attached .txt patch removes the local variable andjust writes to the passed pointer. This required callers to initialize a couple of their own variables, but only childpointers, at least on gcc 12. Agreed with the attached patch. > And I will work later on making "value" in the public API a pointer. Thanks! > > 0017 - I haven't taken a close look at the new changes, but I did notice this some time ago: > > + if (TidStoreIsShared(ts)) > + return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared); > + else > + return sizeof(TidStore) + sizeof(TidStore) + > + local_rt_memory_usage(ts->tree.local); > > There is repetition in the else branch. Agreed, will remove. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Jan 26, 2023 at 12:39 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Tue, Jan 24, 2023 at 1:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jan 23, 2023 at 6:00 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > Attached is a rebase to fix conflicts from recent commits. > > > > I have reviewed v22-0022* patch and I have some comments. > > > > 1. > > >It also changes to the column names max_dead_tuples and num_dead_tuples and to > > >show the progress information in bytes. > > > > I think this statement needs to be rephrased. > > Could you be more specific? I mean the below statement in the commit message doesn't look grammatically correct to me. "It also changes to the column names max_dead_tuples and num_dead_tuples and to show the progress information in bytes." -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sun, Jan 29, 2023 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Jan 28, 2023 at 8:33 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as possible anyway.
>
> Yes, but if a concurrent writer waits for another process to finish
> the iteration, it ends up waiting on a lwlock, which is not
> interruptible.
>
> >
> > > So the idea is that we set iter_active to true (with the
> > > lock in exclusive mode), and prevent concurrent updates when the flag
> > > is true.
> >
> > ...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.
>
> Right. I think if we want to wait rather than an ERROR, the waiter
> should wait in an interruptible way, for example, a condition
> variable. I did a simpler way in the v22 patch.
>
> ...but looking at dshash.c, dshash_seq_next() seems to return an entry
> while holding a lwlock on the partition. My assumption might be wrong.
Using partitions there makes holding a lock less painful on average, I imagine, but I don't know the details there.
If we make it clear that the first committed version is not (yet) designed for high concurrency with mixed read-write workloads, I think waiting (as a protocol) is fine. If waiting is a problem for some use case, at that point we should just go all the way and replace the locking entirely. In fact, it might be good to spell this out in the top-level comment and include a link to the second ART paper.
> > [thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameter for whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriate lock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note, I'm just guessing here, and I don't want to make things more difficult for future improvements.
>
> Seems a good idea. Given the use case for parallel heap vacuum, it
> would be a good idea to support having multiple read-only writers. The
> iteration of the v22 is read-only, so if we want to support read-write
> iterator, we would need to support a function that modifies the
> current key-value returned by the iteration.
Okay, so updating during iteration is not currently supported. It could in the future, but I'd say that can also wait for fine-grained concurrency support. Intermediate-term, we should at least make it straightforward to support:
1) parallel heap vacuum -> multiple read-only iterators
2) parallel heap pruning -> multiple writers
It may or may not be worth it for someone to actually start either of those projects, and there are other ways to improve vacuum that may be more pressing. That said, it seems the tid store with global locking would certainly work fine for #1 and maybe "not too bad" for #2. #2 can also mitigate waiting by using larger batching, or the leader process could "pre-warm" the tid store with zero-values using block numbers from the visibility map.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Sat, Jan 28, 2023 at 8:33 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as possible anyway.
>
> Yes, but if a concurrent writer waits for another process to finish
> the iteration, it ends up waiting on a lwlock, which is not
> interruptible.
>
> >
> > > So the idea is that we set iter_active to true (with the
> > > lock in exclusive mode), and prevent concurrent updates when the flag
> > > is true.
> >
> > ...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.
>
> Right. I think if we want to wait rather than an ERROR, the waiter
> should wait in an interruptible way, for example, a condition
> variable. I did a simpler way in the v22 patch.
>
> ...but looking at dshash.c, dshash_seq_next() seems to return an entry
> while holding a lwlock on the partition. My assumption might be wrong.
Using partitions there makes holding a lock less painful on average, I imagine, but I don't know the details there.
If we make it clear that the first committed version is not (yet) designed for high concurrency with mixed read-write workloads, I think waiting (as a protocol) is fine. If waiting is a problem for some use case, at that point we should just go all the way and replace the locking entirely. In fact, it might be good to spell this out in the top-level comment and include a link to the second ART paper.
> > [thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameter for whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriate lock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note, I'm just guessing here, and I don't want to make things more difficult for future improvements.
>
> Seems a good idea. Given the use case for parallel heap vacuum, it
> would be a good idea to support having multiple read-only writers. The
> iteration of the v22 is read-only, so if we want to support read-write
> iterator, we would need to support a function that modifies the
> current key-value returned by the iteration.
Okay, so updating during iteration is not currently supported. It could in the future, but I'd say that can also wait for fine-grained concurrency support. Intermediate-term, we should at least make it straightforward to support:
1) parallel heap vacuum -> multiple read-only iterators
2) parallel heap pruning -> multiple writers
It may or may not be worth it for someone to actually start either of those projects, and there are other ways to improve vacuum that may be more pressing. That said, it seems the tid store with global locking would certainly work fine for #1 and maybe "not too bad" for #2. #2 can also mitigate waiting by using larger batching, or the leader process could "pre-warm" the tid store with zero-values using block numbers from the visibility map.
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Jan 30, 2023 at 1:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Jan 26, 2023 at 12:39 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > > > On Tue, Jan 24, 2023 at 1:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Mon, Jan 23, 2023 at 6:00 PM John Naylor > > > <john.naylor@enterprisedb.com> wrote: > > > > > > > > Attached is a rebase to fix conflicts from recent commits. > > > > > > I have reviewed v22-0022* patch and I have some comments. > > > > > > 1. > > > >It also changes to the column names max_dead_tuples and num_dead_tuples and to > > > >show the progress information in bytes. > > > > > > I think this statement needs to be rephrased. > > > > Could you be more specific? > > I mean the below statement in the commit message doesn't look > grammatically correct to me. > > "It also changes to the column names max_dead_tuples and > num_dead_tuples and to show the progress information in bytes." > I've changed the commit message in the v23 patch. Please check it. Other comments are also incorporated in the v23 patch. Thank you for the comments! Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Jan 30, 2023 at 1:31 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Sun, Jan 29, 2023 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sat, Jan 28, 2023 at 8:33 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much aspossible anyway. > > > > Yes, but if a concurrent writer waits for another process to finish > > the iteration, it ends up waiting on a lwlock, which is not > > interruptible. > > > > > > > > > So the idea is that we set iter_active to true (with the > > > > lock in exclusive mode), and prevent concurrent updates when the flag > > > > is true. > > > > > > ...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting. > > > > Right. I think if we want to wait rather than an ERROR, the waiter > > should wait in an interruptible way, for example, a condition > > variable. I did a simpler way in the v22 patch. > > > > ...but looking at dshash.c, dshash_seq_next() seems to return an entry > > while holding a lwlock on the partition. My assumption might be wrong. > > Using partitions there makes holding a lock less painful on average, I imagine, but I don't know the details there. > > If we make it clear that the first committed version is not (yet) designed for high concurrency with mixed read-write workloads,I think waiting (as a protocol) is fine. If waiting is a problem for some use case, at that point we should justgo all the way and replace the locking entirely. In fact, it might be good to spell this out in the top-level commentand include a link to the second ART paper. Agreed. Will update the comments. > > > > [thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameterfor whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriatelock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note,I'm just guessing here, and I don't want to make things more difficult for future improvements. > > > > Seems a good idea. Given the use case for parallel heap vacuum, it > > would be a good idea to support having multiple read-only writers. The > > iteration of the v22 is read-only, so if we want to support read-write > > iterator, we would need to support a function that modifies the > > current key-value returned by the iteration. > > Okay, so updating during iteration is not currently supported. It could in the future, but I'd say that can also wait forfine-grained concurrency support. Intermediate-term, we should at least make it straightforward to support: > > 1) parallel heap vacuum -> multiple read-only iterators > 2) parallel heap pruning -> multiple writers > > It may or may not be worth it for someone to actually start either of those projects, and there are other ways to improvevacuum that may be more pressing. That said, it seems the tid store with global locking would certainly work finefor #1 and maybe "not too bad" for #2. #2 can also mitigate waiting by using larger batching, or the leader process could"pre-warm" the tid store with zero-values using block numbers from the visibility map. True. Using a larger batching method seems to be worth testing when we implement the parallel heap pruning. In the next version patch, I'm going to update the locking support part and incorporate other comments I got. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Jan 30, 2023 at 11:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jan 30, 2023 at 1:31 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > On Sun, Jan 29, 2023 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Sat, Jan 28, 2023 at 8:33 PM John Naylor > > > <john.naylor@enterprisedb.com> wrote: > > > > > > The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As muchas possible anyway. > > > > > > Yes, but if a concurrent writer waits for another process to finish > > > the iteration, it ends up waiting on a lwlock, which is not > > > interruptible. > > > > > > > > > > > > So the idea is that we set iter_active to true (with the > > > > > lock in exclusive mode), and prevent concurrent updates when the flag > > > > > is true. > > > > > > > > ...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting. > > > > > > Right. I think if we want to wait rather than an ERROR, the waiter > > > should wait in an interruptible way, for example, a condition > > > variable. I did a simpler way in the v22 patch. > > > > > > ...but looking at dshash.c, dshash_seq_next() seems to return an entry > > > while holding a lwlock on the partition. My assumption might be wrong. > > > > Using partitions there makes holding a lock less painful on average, I imagine, but I don't know the details there. > > > > If we make it clear that the first committed version is not (yet) designed for high concurrency with mixed read-writeworkloads, I think waiting (as a protocol) is fine. If waiting is a problem for some use case, at that point weshould just go all the way and replace the locking entirely. In fact, it might be good to spell this out in the top-levelcomment and include a link to the second ART paper. > > Agreed. Will update the comments. > > > > > > > [thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameterfor whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriatelock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note,I'm just guessing here, and I don't want to make things more difficult for future improvements. > > > > > > Seems a good idea. Given the use case for parallel heap vacuum, it > > > would be a good idea to support having multiple read-only writers. The > > > iteration of the v22 is read-only, so if we want to support read-write > > > iterator, we would need to support a function that modifies the > > > current key-value returned by the iteration. > > > > Okay, so updating during iteration is not currently supported. It could in the future, but I'd say that can also waitfor fine-grained concurrency support. Intermediate-term, we should at least make it straightforward to support: > > > > 1) parallel heap vacuum -> multiple read-only iterators > > 2) parallel heap pruning -> multiple writers > > > > It may or may not be worth it for someone to actually start either of those projects, and there are other ways to improvevacuum that may be more pressing. That said, it seems the tid store with global locking would certainly work finefor #1 and maybe "not too bad" for #2. #2 can also mitigate waiting by using larger batching, or the leader process could"pre-warm" the tid store with zero-values using block numbers from the visibility map. > > True. Using a larger batching method seems to be worth testing when we > implement the parallel heap pruning. > > In the next version patch, I'm going to update the locking support > part and incorporate other comments I got. > I've attached v24 patches. The locking support patch is separated (0005 patch). Also I kept the updates for TidStore and the vacuum integration from v23 separate. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v24-0005-Add-read-write-lock-to-radix-tree-in-RT_SHMEM-ca.patch
- v24-0008-Update-TidStore-patch-from-v23.patch
- v24-0009-Update-vacuum-integration-patch-from-v23.patch
- v24-0007-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch
- v24-0006-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
- v24-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v24-0001-introduce-vector8_min-and-vector8_highbit_mask.patch
- v24-0003-Add-radixtree-template.patch
- v24-0004-Tool-for-measuring-radix-tree-performance.patch
On Tue, Jan 31, 2023 at 9:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I've attached v24 patches. The locking support patch is separated
> (0005 patch). Also I kept the updates for TidStore and the vacuum
> integration from v23 separate.
Okay, that's a lot more simple, and closer to what I imagined. For v25, I squashed v24's additions and added a couple of my own. I've kept the CF status at "needs review" because no specific action is required at the moment.
I did start to review the TID store some more, but that's on hold because something else came up: On a lark I decided to re-run some benchmarks to see if anything got lost in converting to a template, and that led me down a rabbit hole -- some good and bad news on that below.
0001:
I removed the uint64 case, as discussed. There is now a brief commit message, but needs to be fleshed out a bit. I took another look at the Arm optimization that Nathan found some month ago, for forming the highbit mask, but that doesn't play nicely with how node32 uses it, so I decided against it. I added a comment to describe the reasoning in case someone else gets a similar idea.
I briefly looked into "separate-commit TODO: move non-SIMD fallbacks to their own header to clean up the #ifdef maze.", but decided it wasn't such a clear win to justify starting the work now. It's still in the back of my mind, but I removed the reminder from the commit message.
0003:
The template now requires the value to be passed as a pointer. That was a pretty trivial change, but affected multiple other patches, so not sent separately. Also adds a forgotten RT_ prefix to the bitmap macros and adds a top comment to the *_impl.h headers. There are some comment fixes. The changes were either trivial or discussed earlier, so also not sent separately.
0004/5: I wanted to measure the load time as well as search time in bench_search_random_nodes(). That's kept separate to make it easier to test other patch versions.
The bad news is that the speed of loading TIDs in bench_seq/shuffle_search() has regressed noticeably. I can't reproduce this in any other bench function and was the reason for writing 0005 to begin with. More confusingly, my efforts to fix this improved *other* functions, but the former didn't budge at all. First the patches:
0006 adds and removes some "inline" declarations (where it made sense), and added some for "pg_noinline" based on Andres' advice some months ago.
0007 removes some dead code. RT_NODE_INSERT_INNER is only called during RT_SET_EXTEND, so it can't possibly find an existing key. This kind of change is much easier with the inner/node cases handled together in a template, as far as being sure of how those cases are different. I thought about trying the search in assert builds and verifying it doesn't exist, but thought yet another #ifdef would be too messy.
v25-addendum-try-no-maintain-order.txt -- It makes optional keeping the key chunks in order for the linear-search nodes. I believe the TID store no longer cares about the ordering, but this is a text file for now because I don't want to clutter the CI with a behavior change. Also, the second ART paper (on concurrency) mentioned that some locking schemes don't allow these arrays to be shifted. So it might make sense to give up entirely on guaranteeing ordered iteration, or at least make it optional as in the patch.
Now for some numbers:
========================================
psql -c "select * from bench_search_random_nodes(10*1000*1000)"
(min load time of three)
v15:
mem_allocated | load_ms | search_ms
---------------+---------+-----------
334182184 | 3352 | 2073
v25-0005:
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 3426 | 2126
v25-0006 (inlining or not):
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 3327 | 2035
v25-0007 (remove dead code):
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 3313 | 2037
v25-addendum...txt (no ordering):
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 2762 | 2042
Allowing unordered inserts helps a lot here in loading. That's expected because there are a lot of inserts into the linear nodes. 0006 might help a little.
========================================
psql -c "select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a"
v15:
avg
----------------------
207.3000000000000000
v25-0005:
avg
----------------------
190.6000000000000000
v25-0006 (inlining or not):
avg
----------------------
189.3333333333333333
v25-0007 (remove dead code):
avg
----------------------
186.4666666666666667
v25-addendum...txt (no ordering):
avg
----------------------
179.7000000000000000
Most of the improvement from v15 to v25 probably comes from the change from node4 to node3, and this test stresses that node the most. That shows in the total memory used: it goes from 152MB to 132MB. Allowing unordered inserts helps some, the others are not convincing.
========================================
psql -c "select rt_load_ms, rt_search_ms from bench_seq_search(0, 1 * 1000 * 1000)"
(min load time of three)
v15:
rt_load_ms | rt_search_ms
------------+--------------
113 | 455
v25-0005:
rt_load_ms | rt_search_ms
------------+--------------
135 | 456
v25-0006 (inlining or not):
rt_load_ms | rt_search_ms
------------+--------------
136 | 455
v25-0007 (remove dead code):
rt_load_ms | rt_search_ms
------------+--------------
135 | 455
v25-addendum...txt (no ordering):
rt_load_ms | rt_search_ms
------------+--------------
134 | 455
Note: The regression seems to have started in v17, which is the first with a full template.
Nothing so far has helped here, and previous experience has shown that trying to profile 100ms will not be useful. Instead of putting more effort into diving deeper, it seems a better use of time to write a benchmark that calls the tid store itself. That's more realistic, since this function was intended to test load and search of tids, but the tid store doesn't quite operate so simply anymore. What do you think, Masahiko?
I'm inclined to keep 0006, because it might give a slight boost, and 0007 because it's never a bad idea to remove dead code.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
- v25-addendum-try-no-maintain-order.txt
- v25-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v25-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patch
- v25-0005-Measure-load-time-of-bench_search_random_nodes.patch
- v25-0004-Tool-for-measuring-radix-tree-performance.patch
- v25-0003-Add-radixtree-template.patch
- v25-0006-Adjust-some-inlining-declarations.patch
- v25-0007-Skip-unnecessary-searches-in-RT_NODE_INSERT_INNE.patch
- v25-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
- v25-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch
On Tue, Feb 7, 2023 at 4:25 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> [v25]
This conflicted with a commit from earlier today, so rebased in v26 with no further changes.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
- v25-addendum-try-no-maintain-order.txt
- v26-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v26-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patch
- v26-0005-Measure-load-time-of-bench_search_random_nodes.patch
- v26-0004-Tool-for-measuring-radix-tree-performance.patch
- v26-0003-Add-radixtree-template.patch
- v26-0006-Adjust-some-inlining-declarations.patch
- v26-0007-Skip-unnecessary-searches-in-RT_NODE_INSERT_INNE.patch
- v26-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
- v26-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch
Hi, On Tue, Feb 7, 2023 at 6:25 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Tue, Jan 31, 2023 at 9:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I've attached v24 patches. The locking support patch is separated > > (0005 patch). Also I kept the updates for TidStore and the vacuum > > integration from v23 separate. > > Okay, that's a lot more simple, and closer to what I imagined. For v25, I squashed v24's additions and added a couple ofmy own. I've kept the CF status at "needs review" because no specific action is required at the moment. > > I did start to review the TID store some more, but that's on hold because something else came up: On a lark I decided tore-run some benchmarks to see if anything got lost in converting to a template, and that led me down a rabbit hole -- somegood and bad news on that below. > > 0001: > > I removed the uint64 case, as discussed. There is now a brief commit message, but needs to be fleshed out a bit. I tookanother look at the Arm optimization that Nathan found some month ago, for forming the highbit mask, but that doesn'tplay nicely with how node32 uses it, so I decided against it. I added a comment to describe the reasoning in casesomeone else gets a similar idea. > > I briefly looked into "separate-commit TODO: move non-SIMD fallbacks to their own header to clean up the #ifdef maze.",but decided it wasn't such a clear win to justify starting the work now. It's still in the back of my mind, but Iremoved the reminder from the commit message. The changes make sense to me. > > 0003: > > The template now requires the value to be passed as a pointer. That was a pretty trivial change, but affected multipleother patches, so not sent separately. Also adds a forgotten RT_ prefix to the bitmap macros and adds a top commentto the *_impl.h headers. There are some comment fixes. The changes were either trivial or discussed earlier, so alsonot sent separately. Great. > > 0004/5: I wanted to measure the load time as well as search time in bench_search_random_nodes(). That's kept separate tomake it easier to test other patch versions. > > The bad news is that the speed of loading TIDs in bench_seq/shuffle_search() has regressed noticeably. I can't reproducethis in any other bench function and was the reason for writing 0005 to begin with. More confusingly, my effortsto fix this improved *other* functions, but the former didn't budge at all. First the patches: > > 0006 adds and removes some "inline" declarations (where it made sense), and added some for "pg_noinline" based on Andres'advice some months ago. Agreed. > > 0007 removes some dead code. RT_NODE_INSERT_INNER is only called during RT_SET_EXTEND, so it can't possibly find an existingkey. This kind of change is much easier with the inner/node cases handled together in a template, as far as beingsure of how those cases are different. I thought about trying the search in assert builds and verifying it doesn't exist,but thought yet another #ifdef would be too messy. Agreed. > > v25-addendum-try-no-maintain-order.txt -- It makes optional keeping the key chunks in order for the linear-search nodes.I believe the TID store no longer cares about the ordering, but this is a text file for now because I don't want toclutter the CI with a behavior change. Also, the second ART paper (on concurrency) mentioned that some locking schemesdon't allow these arrays to be shifted. So it might make sense to give up entirely on guaranteeing ordered iteration,or at least make it optional as in the patch. I think it's still important for lazy vacuum that an iteration over a TID store returns TIDs in ascending order, because otherwise a heap vacuum does random writes. That being said, we can have RT_ITERATE_NEXT() return key-value pairs in an order regardless of how the key chunks are stored in a node. > ======================================== > psql -c "select rt_load_ms, rt_search_ms from bench_seq_search(0, 1 * 1000 * 1000)" > (min load time of three) > > v15: > rt_load_ms | rt_search_ms > ------------+-------------- > 113 | 455 > > v25-0005: > rt_load_ms | rt_search_ms > ------------+-------------- > 135 | 456 > > v25-0006 (inlining or not): > rt_load_ms | rt_search_ms > ------------+-------------- > 136 | 455 > > v25-0007 (remove dead code): > rt_load_ms | rt_search_ms > ------------+-------------- > 135 | 455 > > v25-addendum...txt (no ordering): > rt_load_ms | rt_search_ms > ------------+-------------- > 134 | 455 > > Note: The regression seems to have started in v17, which is the first with a full template. > > Nothing so far has helped here, and previous experience has shown that trying to profile 100ms will not be useful. Insteadof putting more effort into diving deeper, it seems a better use of time to write a benchmark that calls the tid storeitself. That's more realistic, since this function was intended to test load and search of tids, but the tid store doesn'tquite operate so simply anymore. What do you think, Masahiko? Yeah, that's more realistic. TidStore now encodes TIDs slightly differently from the benchmark test. I've attached the patch that adds a simple benchmark test using TidStore. With this test, I got similar trends of results to yours with gcc, but I've not analyzed them in depth yet. query: select * from bench_tidstore_load(0, 10 * 1000 * 1000) v15: load_ms --------- 816 v25-0007 (remove dead code): load_ms --------- 839 v25-addendum...txt (no ordering): load_ms --------- 820 BTW it would be better to remove the RT_DEBUG macro from bench_radix_tree.c. > > I'm inclined to keep 0006, because it might give a slight boost, and 0007 because it's never a bad idea to remove deadcode. Yeah, these two changes make sense to me too. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Feb 9, 2023 at 2:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I think it's still important for lazy vacuum that an iteration over a
> TID store returns TIDs in ascending order, because otherwise a heap
> vacuum does random writes. That being said, we can have
> RT_ITERATE_NEXT() return key-value pairs in an order regardless of how
> the key chunks are stored in a node.
Okay, we can keep that possibility in mind if we need to go there.
> > Note: The regression seems to have started in v17, which is the first with a full template.
> > 0007 removes some dead code. RT_NODE_INSERT_INNER is only called during RT_SET_EXTEND, so it can't possibly find an existing key. This kind of change is much easier with the inner/node cases handled together in a template, as far as being sure of how those cases are different. I thought about trying the search in assert builds and verifying it doesn't exist, but thought yet another #ifdef would be too messy.
It just occurred to me that these facts might be related. v17 was the first use of the full template, and I decided then I liked one of your earlier patches where replace_node() calls node_update_inner() better than calling node_insert_inner() with a NULL parent, which was a bit hard to understand. That now-dead code was actually used in the latter case for updating the (original) parent. It's possible that trying to use separate paths contributed to the regression. I'll try the other way and report back.
> I've attached the patch that adds a simple benchmark test using
> TidStore. With this test, I got similar trends of results to yours
> with gcc, but I've not analyzed them in depth yet.
Thanks for that! I'll take a look.
> BTW it would be better to remove the RT_DEBUG macro from bench_radix_tree.c.
Absolutely.
> I think it's still important for lazy vacuum that an iteration over a
> TID store returns TIDs in ascending order, because otherwise a heap
> vacuum does random writes. That being said, we can have
> RT_ITERATE_NEXT() return key-value pairs in an order regardless of how
> the key chunks are stored in a node.
Okay, we can keep that possibility in mind if we need to go there.
> > Note: The regression seems to have started in v17, which is the first with a full template.
> > 0007 removes some dead code. RT_NODE_INSERT_INNER is only called during RT_SET_EXTEND, so it can't possibly find an existing key. This kind of change is much easier with the inner/node cases handled together in a template, as far as being sure of how those cases are different. I thought about trying the search in assert builds and verifying it doesn't exist, but thought yet another #ifdef would be too messy.
It just occurred to me that these facts might be related. v17 was the first use of the full template, and I decided then I liked one of your earlier patches where replace_node() calls node_update_inner() better than calling node_insert_inner() with a NULL parent, which was a bit hard to understand. That now-dead code was actually used in the latter case for updating the (original) parent. It's possible that trying to use separate paths contributed to the regression. I'll try the other way and report back.
> I've attached the patch that adds a simple benchmark test using
> TidStore. With this test, I got similar trends of results to yours
> with gcc, but I've not analyzed them in depth yet.
Thanks for that! I'll take a look.
> BTW it would be better to remove the RT_DEBUG macro from bench_radix_tree.c.
Absolutely.
On Thu, Feb 9, 2023 at 2:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> query: select * from bench_tidstore_load(0, 10 * 1000 * 1000)
>
> v15:
> load_ms
> ---------
> 816
How did you build the tid store and test on v15? I first tried to apply v15-0009-PoC-lazy-vacuum-integration.patch, which conflicts with vacuum now, so reset all that, but still getting build errors because the tid store types and functions have changed.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Feb 10, 2023 at 3:51 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Thu, Feb 9, 2023 at 2:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > query: select * from bench_tidstore_load(0, 10 * 1000 * 1000) > > > > v15: > > load_ms > > --------- > > 816 > > How did you build the tid store and test on v15? I first tried to apply v15-0009-PoC-lazy-vacuum-integration.patch, whichconflicts with vacuum now, so reset all that, but still getting build errors because the tid store types and functionshave changed. I applied v26-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch on top of v15 radix tree and changed the TidStore so that it uses v15 (non-templated) radixtree. That way, we can test TidStore using v15 radix tree. I've attached the patch that I applied on top of v26-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
I didn't get any closer to radix-tree regression, but I did find some inefficiencies in tidstore_add_tids() that are worth talking about first, addressed in a rough fashion in the attached .txt addendums that I can clean up and incorporate later.
To start, I can reproduce the regression with this test as well:
select * from bench_tidstore_load(0, 10 * 1000 * 1000);
v15 + v26 store + adjustments:
mem_allocated | load_ms
---------------+---------
98202152 | 1676
v26 0001-0008
mem_allocated | load_ms
---------------+---------
98202032 | 1826
...and reverting to the alternate way to update the parent didn't help:
v26 0001-6, 0008, insert_inner w/ null parent
mem_allocated | load_ms
---------------+---------
98202032 | 1825
...and I'm kind of glad that wasn't the problem, because going back to that would be a pain for the shmem case.
Running perf doesn't show anything much different in the proportions (note that rt_set must have been inlined when declared locally in v26):
v15 + v26 store + adjustments:
65.88% postgres postgres [.] tidstore_add_tids
10.74% postgres postgres [.] rt_set
9.20% postgres postgres [.] palloc0
6.49% postgres postgres [.] rt_node_insert_leaf
v26 0001-0008
78.50% postgres postgres [.] tidstore_add_tids
8.88% postgres postgres [.] palloc0
6.24% postgres postgres [.] local_rt_node_insert_leaf
v2699-0001: The first thing I noticed is that palloc0 is taking way more time than it should, and it's because the compiler doesn't know the values[] array is small. One reason we need to zero the array is to make the algorithm agnostic about what order the offsets come in, as I requested in a previous review. Thinking some more, I was way too paranoid about that. As long as access methods scan the line pointer array in the usual way, maybe we can just assert that the keys we create are in order, and zero any unused array entries as we find them. (I admit I can't actually think of a reason we would ever encounter offsets out of order.) Also, we can keep track of the last key we need to consider for insertion into the radix tree, and ignore the rest. That might shave a few cycles during the exclusive lock when the max offset of an LP_DEAD item < 64 on a given page, which I think would be common in the wild. I also got rid of the special case for non-encoding, since shifting by zero should work the same way. These together led to a nice speedup on the v26 branch:
mem_allocated | load_ms
---------------+---------
98202032 | 1386
v2699-0002: The next thing I noticed is forming a full ItemIdPointer to pass to tid_to_key_off(). That's bad for tidstore_add_tids() because ItemPointerSetBlockNumber() must do this in order to allow the struct to be SHORTALIGN'd:
static inline void
BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber)
{
blockId->bi_hi = blockNumber >> 16;
blockId->bi_lo = blockNumber & 0xffff;
}
Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must reverse the above process:
static inline BlockNumber
BlockIdGetBlockNumber(const BlockIdData *blockId)
{
return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber) blockId->bi_lo);
}
There is no reason to do any of this if we're not reading/writing directly to/from an on-disk tid etc. To avoid this, I created a new function encode_key_off() [name could be better], which deals with the raw block number that we already have. Then turn tid_to_key_off() into a wrapper around that, since we still need the full conversion for tidstore_lookup_tid().
v2699-0003: Get rid of all the remaining special cases for encoding/or not. I am unaware of the need to optimize that case or treat it in any way differently. I haven't tested this on an installation with non-default blocksize and didn't measure this separately, but 0002+0003 gives:
mem_allocated | load_ms
---------------+---------
98202032 | 1259
If these are acceptable, I can incorporate them into a later patchset. In any case, speeding up tidstore_add_tids() will make any regressions in the backing radix tree more obvious. I will take a look at that next week.
--
John Naylor
EDB: http://www.enterprisedb.com
To start, I can reproduce the regression with this test as well:
select * from bench_tidstore_load(0, 10 * 1000 * 1000);
v15 + v26 store + adjustments:
mem_allocated | load_ms
---------------+---------
98202152 | 1676
v26 0001-0008
mem_allocated | load_ms
---------------+---------
98202032 | 1826
...and reverting to the alternate way to update the parent didn't help:
v26 0001-6, 0008, insert_inner w/ null parent
mem_allocated | load_ms
---------------+---------
98202032 | 1825
...and I'm kind of glad that wasn't the problem, because going back to that would be a pain for the shmem case.
Running perf doesn't show anything much different in the proportions (note that rt_set must have been inlined when declared locally in v26):
v15 + v26 store + adjustments:
65.88% postgres postgres [.] tidstore_add_tids
10.74% postgres postgres [.] rt_set
9.20% postgres postgres [.] palloc0
6.49% postgres postgres [.] rt_node_insert_leaf
v26 0001-0008
78.50% postgres postgres [.] tidstore_add_tids
8.88% postgres postgres [.] palloc0
6.24% postgres postgres [.] local_rt_node_insert_leaf
v2699-0001: The first thing I noticed is that palloc0 is taking way more time than it should, and it's because the compiler doesn't know the values[] array is small. One reason we need to zero the array is to make the algorithm agnostic about what order the offsets come in, as I requested in a previous review. Thinking some more, I was way too paranoid about that. As long as access methods scan the line pointer array in the usual way, maybe we can just assert that the keys we create are in order, and zero any unused array entries as we find them. (I admit I can't actually think of a reason we would ever encounter offsets out of order.) Also, we can keep track of the last key we need to consider for insertion into the radix tree, and ignore the rest. That might shave a few cycles during the exclusive lock when the max offset of an LP_DEAD item < 64 on a given page, which I think would be common in the wild. I also got rid of the special case for non-encoding, since shifting by zero should work the same way. These together led to a nice speedup on the v26 branch:
mem_allocated | load_ms
---------------+---------
98202032 | 1386
v2699-0002: The next thing I noticed is forming a full ItemIdPointer to pass to tid_to_key_off(). That's bad for tidstore_add_tids() because ItemPointerSetBlockNumber() must do this in order to allow the struct to be SHORTALIGN'd:
static inline void
BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber)
{
blockId->bi_hi = blockNumber >> 16;
blockId->bi_lo = blockNumber & 0xffff;
}
Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must reverse the above process:
static inline BlockNumber
BlockIdGetBlockNumber(const BlockIdData *blockId)
{
return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber) blockId->bi_lo);
}
There is no reason to do any of this if we're not reading/writing directly to/from an on-disk tid etc. To avoid this, I created a new function encode_key_off() [name could be better], which deals with the raw block number that we already have. Then turn tid_to_key_off() into a wrapper around that, since we still need the full conversion for tidstore_lookup_tid().
v2699-0003: Get rid of all the remaining special cases for encoding/or not. I am unaware of the need to optimize that case or treat it in any way differently. I haven't tested this on an installation with non-default blocksize and didn't measure this separately, but 0002+0003 gives:
mem_allocated | load_ms
---------------+---------
98202032 | 1259
If these are acceptable, I can incorporate them into a later patchset. In any case, speeding up tidstore_add_tids() will make any regressions in the backing radix tree more obvious. I will take a look at that next week.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
On Sat, Feb 11, 2023 at 2:33 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > I didn't get any closer to radix-tree regression, Me neither. It seems that in v26, inserting chunks into node-32 is slow but needs more analysis. I'll share if I found something interesting. > but I did find some inefficiencies in tidstore_add_tids() that are worth talking about first, addressed in a rough fashionin the attached .txt addendums that I can clean up and incorporate later. > > To start, I can reproduce the regression with this test as well: > > select * from bench_tidstore_load(0, 10 * 1000 * 1000); > > v15 + v26 store + adjustments: > mem_allocated | load_ms > ---------------+--------- > 98202152 | 1676 > > v26 0001-0008 > mem_allocated | load_ms > ---------------+--------- > 98202032 | 1826 > > ...and reverting to the alternate way to update the parent didn't help: > > v26 0001-6, 0008, insert_inner w/ null parent > > mem_allocated | load_ms > ---------------+--------- > 98202032 | 1825 > > ...and I'm kind of glad that wasn't the problem, because going back to that would be a pain for the shmem case. > > Running perf doesn't show anything much different in the proportions (note that rt_set must have been inlined when declaredlocally in v26): > > v15 + v26 store + adjustments: > 65.88% postgres postgres [.] tidstore_add_tids > 10.74% postgres postgres [.] rt_set > 9.20% postgres postgres [.] palloc0 > 6.49% postgres postgres [.] rt_node_insert_leaf > > v26 0001-0008 > 78.50% postgres postgres [.] tidstore_add_tids > 8.88% postgres postgres [.] palloc0 > 6.24% postgres postgres [.] local_rt_node_insert_leaf > > v2699-0001: The first thing I noticed is that palloc0 is taking way more time than it should, and it's because the compilerdoesn't know the values[] array is small. One reason we need to zero the array is to make the algorithm agnosticabout what order the offsets come in, as I requested in a previous review. Thinking some more, I was way too paranoidabout that. As long as access methods scan the line pointer array in the usual way, maybe we can just assert thatthe keys we create are in order, and zero any unused array entries as we find them. (I admit I can't actually think ofa reason we would ever encounter offsets out of order.) I can think that something like traversing a HOT chain could visit offsets out of order. But fortunately we prune such collected TIDs before heap vacuum in heap case. > Also, we can keep track of the last key we need to consider for insertion into the radix tree, and ignore the rest. Thatmight shave a few cycles during the exclusive lock when the max offset of an LP_DEAD item < 64 on a given page, whichI think would be common in the wild. I also got rid of the special case for non-encoding, since shifting by zero shouldwork the same way. These together led to a nice speedup on the v26 branch: > > mem_allocated | load_ms > ---------------+--------- > 98202032 | 1386 > > v2699-0002: The next thing I noticed is forming a full ItemIdPointer to pass to tid_to_key_off(). That's bad for tidstore_add_tids()because ItemPointerSetBlockNumber() must do this in order to allow the struct to be SHORTALIGN'd: > > static inline void > BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber) > { > blockId->bi_hi = blockNumber >> 16; > blockId->bi_lo = blockNumber & 0xffff; > } > > Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must reverse the above process: > > static inline BlockNumber > BlockIdGetBlockNumber(const BlockIdData *blockId) > { > return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber) blockId->bi_lo); > } > > There is no reason to do any of this if we're not reading/writing directly to/from an on-disk tid etc. To avoid this, Icreated a new function encode_key_off() [name could be better], which deals with the raw block number that we already have.Then turn tid_to_key_off() into a wrapper around that, since we still need the full conversion for tidstore_lookup_tid(). > > v2699-0003: Get rid of all the remaining special cases for encoding/or not. I am unaware of the need to optimize that caseor treat it in any way differently. I haven't tested this on an installation with non-default blocksize and didn't measurethis separately, but 0002+0003 gives: > > mem_allocated | load_ms > ---------------+--------- > 98202032 | 1259 > > If these are acceptable, I can incorporate them into a later patchset. These are nice improvements! I agree with all changes. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Feb 13, 2023 at 2:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Feb 11, 2023 at 2:33 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > I didn't get any closer to radix-tree regression,
>
> Me neither. It seems that in v26, inserting chunks into node-32 is
> slow but needs more analysis. I'll share if I found something
> interesting.
If that were the case, then the other benchmarks I ran would likely have slowed down as well, but they are the same or faster. There is one microbenchmark I didn't run before: "select * from bench_fixed_height_search(15)" (15 to reduce noise from growing size class, and despite the name it measures load time as well). Trying this now shows no difference: a few runs range 19 to 21ms in each version. That also reinforces that update_inner is fine and that the move to value pointer API didn't regress.
Changing TIDS_PER_BLOCK_FOR_LOAD to 1 to stress the tree more gives (min of 5, perf run separate from measurements):
v15 + v26 store:
mem_allocated | load_ms
---------------+---------
98202152 | 553
19.71% postgres postgres [.] tidstore_add_tids
+ 31.47% postgres postgres [.] rt_set
= 51.18%
20.62% postgres postgres [.] rt_node_insert_leaf
6.05% postgres postgres [.] AllocSetAlloc
4.74% postgres postgres [.] AllocSetFree
4.62% postgres postgres [.] palloc
2.23% postgres postgres [.] SlabAlloc
v26:
mem_allocated | load_ms
---------------+---------
98202032 | 617
57.45% postgres postgres [.] tidstore_add_tids
20.67% postgres postgres [.] local_rt_node_insert_leaf
5.99% postgres postgres [.] AllocSetAlloc
3.55% postgres postgres [.] palloc
3.05% postgres postgres [.] AllocSetFree
2.05% postgres postgres [.] SlabAlloc
So it seems the store itself got faster when we removed shared memory paths from the v26 store to test it against v15.
I thought to favor the local memory case in the tidstore by controlling inlining -- it's smaller and will be called much more often, so I tried the following (done in 0007)
#define RT_PREFIX shared_rt
#define RT_SHMEM
-#define RT_SCOPE static
+#define RT_SCOPE static pg_noinline
That brings it down to
mem_allocated | load_ms
---------------+---------
98202032 | 590
That's better, but not still not within noise level. Perhaps some slowdown is unavoidable, but it would be nice to understand why.
> I can think that something like traversing a HOT chain could visit
> offsets out of order. But fortunately we prune such collected TIDs
> before heap vacuum in heap case.
Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continue assuming that (with an assert added since it's more public in this form). I'm not sure why such basic common sense evaded me a few versions ago...
> > If these are acceptable, I can incorporate them into a later patchset.
>
> These are nice improvements! I agree with all changes.
Great, I've squashed these into the tidstore patch (0004). Also added 0005, which is just a simplification.
I squashed the earlier dead code removal into the radix tree patch.
v27-0008 measures tid store iteration performance and adds a stub function to prevent spurious warnings, so the benchmarking module can always be built.
Getting the list of offsets from the old array for a given block is always trivial, but tidstore_iter_extract_tids() is doing a huge amount of unnecessary work when TIDS_PER_BLOCK_FOR_LOAD is 1, enough to exceed the load time:
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 589 | 915
Fortunately, it's an easy fix, done in 0009.
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 589 | 153
I'll soon resume more cosmetic review of the tid store, but this is enough to post.
>
> On Sat, Feb 11, 2023 at 2:33 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > I didn't get any closer to radix-tree regression,
>
> Me neither. It seems that in v26, inserting chunks into node-32 is
> slow but needs more analysis. I'll share if I found something
> interesting.
If that were the case, then the other benchmarks I ran would likely have slowed down as well, but they are the same or faster. There is one microbenchmark I didn't run before: "select * from bench_fixed_height_search(15)" (15 to reduce noise from growing size class, and despite the name it measures load time as well). Trying this now shows no difference: a few runs range 19 to 21ms in each version. That also reinforces that update_inner is fine and that the move to value pointer API didn't regress.
Changing TIDS_PER_BLOCK_FOR_LOAD to 1 to stress the tree more gives (min of 5, perf run separate from measurements):
v15 + v26 store:
mem_allocated | load_ms
---------------+---------
98202152 | 553
19.71% postgres postgres [.] tidstore_add_tids
+ 31.47% postgres postgres [.] rt_set
= 51.18%
20.62% postgres postgres [.] rt_node_insert_leaf
6.05% postgres postgres [.] AllocSetAlloc
4.74% postgres postgres [.] AllocSetFree
4.62% postgres postgres [.] palloc
2.23% postgres postgres [.] SlabAlloc
v26:
mem_allocated | load_ms
---------------+---------
98202032 | 617
57.45% postgres postgres [.] tidstore_add_tids
20.67% postgres postgres [.] local_rt_node_insert_leaf
5.99% postgres postgres [.] AllocSetAlloc
3.55% postgres postgres [.] palloc
3.05% postgres postgres [.] AllocSetFree
2.05% postgres postgres [.] SlabAlloc
So it seems the store itself got faster when we removed shared memory paths from the v26 store to test it against v15.
I thought to favor the local memory case in the tidstore by controlling inlining -- it's smaller and will be called much more often, so I tried the following (done in 0007)
#define RT_PREFIX shared_rt
#define RT_SHMEM
-#define RT_SCOPE static
+#define RT_SCOPE static pg_noinline
That brings it down to
mem_allocated | load_ms
---------------+---------
98202032 | 590
That's better, but not still not within noise level. Perhaps some slowdown is unavoidable, but it would be nice to understand why.
> I can think that something like traversing a HOT chain could visit
> offsets out of order. But fortunately we prune such collected TIDs
> before heap vacuum in heap case.
Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continue assuming that (with an assert added since it's more public in this form). I'm not sure why such basic common sense evaded me a few versions ago...
> > If these are acceptable, I can incorporate them into a later patchset.
>
> These are nice improvements! I agree with all changes.
Great, I've squashed these into the tidstore patch (0004). Also added 0005, which is just a simplification.
I squashed the earlier dead code removal into the radix tree patch.
v27-0008 measures tid store iteration performance and adds a stub function to prevent spurious warnings, so the benchmarking module can always be built.
Getting the list of offsets from the old array for a given block is always trivial, but tidstore_iter_extract_tids() is doing a huge amount of unnecessary work when TIDS_PER_BLOCK_FOR_LOAD is 1, enough to exceed the load time:
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 589 | 915
Fortunately, it's an easy fix, done in 0009.
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 589 | 153
I'll soon resume more cosmetic review of the tid store, but this is enough to post.
Attachment
- v27-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patch
- v27-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v27-0005-Do-bitmap-conversion-in-one-place-rather-than-fo.patch
- v27-0003-Add-radixtree-template.patch
- v27-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
- v27-0006-Tool-for-measuring-radix-tree-and-tidstore-perfo.patch
- v27-0008-Measure-iteration-of-tidstore.patch
- v27-0007-Prevent-inlining-of-interface-functions-for-shme.patch
- v27-0009-Speed-up-tidstore_iter_extract_tids.patch
The benchmark module shouldn't have been un-commented-out, so attached a revert of that.
--
John Naylor
EDB: http://www.enterprisedb.com
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
- v28-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v28-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patch
- v28-0005-Do-bitmap-conversion-in-one-place-rather-than-fo.patch
- v28-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
- v28-0003-Add-radixtree-template.patch
- v28-0006-Tool-for-measuring-radix-tree-and-tidstore-perfo.patch
- v28-0007-Prevent-inlining-of-interface-functions-for-shme.patch
- v28-0009-Speed-up-tidstore_iter_extract_tids.patch
- v28-0008-Measure-iteration-of-tidstore.patch
- v28-0010-Revert-building-benchmark-module-for-CI.patch
On Tue, Feb 14, 2023 at 8:24 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Feb 13, 2023 at 2:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sat, Feb 11, 2023 at 2:33 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > I didn't get any closer to radix-tree regression, > > > > Me neither. It seems that in v26, inserting chunks into node-32 is > > slow but needs more analysis. I'll share if I found something > > interesting. > > If that were the case, then the other benchmarks I ran would likely have slowed down as well, but they are the same orfaster. There is one microbenchmark I didn't run before: "select * from bench_fixed_height_search(15)" (15 to reduce noisefrom growing size class, and despite the name it measures load time as well). Trying this now shows no difference: afew runs range 19 to 21ms in each version. That also reinforces that update_inner is fine and that the move to value pointerAPI didn't regress. > > Changing TIDS_PER_BLOCK_FOR_LOAD to 1 to stress the tree more gives (min of 5, perf run separate from measurements): > > v15 + v26 store: > > mem_allocated | load_ms > ---------------+--------- > 98202152 | 553 > > 19.71% postgres postgres [.] tidstore_add_tids > + 31.47% postgres postgres [.] rt_set > = 51.18% > > 20.62% postgres postgres [.] rt_node_insert_leaf > 6.05% postgres postgres [.] AllocSetAlloc > 4.74% postgres postgres [.] AllocSetFree > 4.62% postgres postgres [.] palloc > 2.23% postgres postgres [.] SlabAlloc > > v26: > > mem_allocated | load_ms > ---------------+--------- > 98202032 | 617 > > 57.45% postgres postgres [.] tidstore_add_tids > > 20.67% postgres postgres [.] local_rt_node_insert_leaf > 5.99% postgres postgres [.] AllocSetAlloc > 3.55% postgres postgres [.] palloc > 3.05% postgres postgres [.] AllocSetFree > 2.05% postgres postgres [.] SlabAlloc > > So it seems the store itself got faster when we removed shared memory paths from the v26 store to test it against v15. > > I thought to favor the local memory case in the tidstore by controlling inlining -- it's smaller and will be called muchmore often, so I tried the following (done in 0007) > > #define RT_PREFIX shared_rt > #define RT_SHMEM > -#define RT_SCOPE static > +#define RT_SCOPE static pg_noinline > > That brings it down to > > mem_allocated | load_ms > ---------------+--------- > 98202032 | 590 The improvement makes sense to me. I've also done the same test (with changing TIDS_PER_BLOCK_FOR_LOAD to 1): w/o 0007 patch: mem_allocated | load_ms | iter_ms ---------------+---------+--------- 98202032 | 334 | 445 (1 row) w/ 0007 patch: mem_allocated | load_ms | iter_ms ---------------+---------+--------- 98202032 | 316 | 434 (1 row) On the other hand, with TIDS_PER_BLOCK_FOR_LOAD being 30, the load performance didn't improve: w/0 0007 patch: mem_allocated | load_ms | iter_ms ---------------+---------+--------- 98202032 | 601 | 608 (1 row) w/ 0007 patch: mem_allocated | load_ms | iter_ms ---------------+---------+--------- 98202032 | 610 | 606 (1 row) That being said, it might be within noise level, so I agree with 0007 patch. > Perhaps some slowdown is unavoidable, but it would be nice to understand why. True. > > > I can think that something like traversing a HOT chain could visit > > offsets out of order. But fortunately we prune such collected TIDs > > before heap vacuum in heap case. > > Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continueassuming that (with an assert added since it's more public in this form). I'm not sure why such basic common senseevaded me a few versions ago... Right. TidStore is implemented not only for heap, so loading out-of-order TIDs might be important in the future. > > > If these are acceptable, I can incorporate them into a later patchset. > > > > These are nice improvements! I agree with all changes. > > Great, I've squashed these into the tidstore patch (0004). Also added 0005, which is just a simplification. > I've attached some small patches to improve the radix tree and tidstrore: We have the following WIP comment in test_radixtree: // WIP: compiles with warnings because rt_attach is defined but not used // #define RT_SHMEM How about unsetting RT_SCOPE to suppress warnings for unused rt_attach and friends? FYI I've briefly tested the TidStore with blocksize = 32kb, and it seems to work fine. > I squashed the earlier dead code removal into the radix tree patch. Thanks! > > v27-0008 measures tid store iteration performance and adds a stub function to prevent spurious warnings, so the benchmarkingmodule can always be built. > > Getting the list of offsets from the old array for a given block is always trivial, but tidstore_iter_extract_tids() isdoing a huge amount of unnecessary work when TIDS_PER_BLOCK_FOR_LOAD is 1, enough to exceed the load time: > > mem_allocated | load_ms | iter_ms > ---------------+---------+--------- > 98202032 | 589 | 915 > > Fortunately, it's an easy fix, done in 0009. > > mem_allocated | load_ms | iter_ms > ---------------+---------+--------- > 98202032 | 589 | 153 Cool! > > I'll soon resume more cosmetic review of the tid store, but this is enough to post. Thanks! You removed the vacuum integration patch from v27, is there any reason for that? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Feb 14, 2023 at 8:24 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > > I can think that something like traversing a HOT chain could visit
> > > offsets out of order. But fortunately we prune such collected TIDs
> > > before heap vacuum in heap case.
> >
> > Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continue assuming that (with an assert added since it's more public in this form). I'm not sure why such basic common sense evaded me a few versions ago...
>
> Right. TidStore is implemented not only for heap, so loading
> out-of-order TIDs might be important in the future.
That's what I was probably thinking about some weeks ago, but I'm having a hard time imagining how it would come up, even for something like the conveyor-belt concept.
> We have the following WIP comment in test_radixtree:
>
> // WIP: compiles with warnings because rt_attach is defined but not used
> // #define RT_SHMEM
>
> How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
> and friends?
Sounds good to me, and the other fixes make sense as well.
> FYI I've briefly tested the TidStore with blocksize = 32kb, and it
> seems to work fine.
That was on my list, so great! How about the other end -- nominally we allow 512b. (In practice it won't matter, but this would make sure I didn't mess anything up when forcing all MaxTuplesPerPage to encode.)
> You removed the vacuum integration patch from v27, is there any reason for that?
Just an oversight.
Now for some general comments on the tid store...
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
Do we need to do anything for this todo?
It might help readability to have a concept of "off_upper/off_lower", just so we can describe things more clearly. The key is block + off_upper, and the value is a bitmap of all the off_lower bits. I hinted at that in my addition of encode_key_off(). Along those lines, maybe s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even sure the TIDSTORE_ prefix is valuable for these local macros.
The word "value" as a variable name is pretty generic in this context, and it might be better to call it the off_lower_bitmap, at least in some places. The "key" doesn't have a good short term for naming, but in comments we should make sure we're clear it's "block# + off_upper".
I'm not a fan of the name "tid_i", even as a temp variable -- maybe "compressed_tid"?
maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/
It might be worth using typedefs for key and value type. Actually, since key type is fixed for the foreseeable future, maybe the radix tree template should define a key typedef?
The term "result" is probably fine within the tidstore, but as a public name used by vacuum, it's not very descriptive. I don't have a good idea, though.
Some files in backend/access use CamelCase for public functions, although it's not consistent. I think doing that for tidstore would help readability, since they would stand out from rt_* functions and vacuum functions. It's a matter of taste, though.
I don't understand the control flow in tidstore_iterate_next(), or when BlockNumberIsValid() is true. If this is the best way to code this, it needs more commentary.
Some comments on vacuum:
I think we'd better get some real-world testing of this, fairly soon.
I had an idea: If it's not too much effort, it might be worth splitting it into two parts: one that just adds the store (not caring about its memory limits or progress reporting etc). During index scan, check both the new store and the array and log a warning (we don't want to exit or crash, better to try to investigate while live if possible) if the result doesn't match. Then perhaps set up an instance and let something like TPC-C run for a few days. The second patch would just restore the rest of the current patch. That would help reassure us it's working as designed. Soon I plan to do some measurements with vacuuming large tables to get some concrete numbers that the community can get excited about.
We also want to verify that progress reporting works as designed and has no weird corner cases.
* autovacuum_work_mem) memory space to keep track of dead TIDs. We initially
...
+ * create a TidStore with the maximum bytes that can be used by the TidStore.
This kind of implies that we allocate the maximum bytes upfront. I think this sentence can be removed. We already mentioned in the previous paragraph that we set an upper bound.
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ vacuumed_pages)));
I don't think the format string has to change, since num_tids was changed back to int64 in an earlier patch version?
- * the memory space for storing dead items allocated in the DSM segment. We
[a lot of whitespace adjustment]
+ * the shared TidStore. We launch parallel worker processes at the start of
The old comment still seems mostly ok? Maybe just s/DSM segment/DSA area/ or something else minor.
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
If we're starting from the minimum, "estimate" doesn't really describe it anymore? Maybe "Initial size"?
What does dsa_minimum_size() work out to in practice? 1MB?
Also, I think PARALLEL_VACUUM_KEY_DSA is left over from an earlier patch.
Lastly, on the radix tree:
I find extend, set, and set_extend hard to keep straight when studying the code. Maybe EXTEND -> EXTEND_UP , SET_EXTEND -> EXTEND_DOWN ?
RT_ITER_UPDATE_KEY is unused, but I somehow didn't notice when turning it into a template.
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
These comments don't really help readers unfamiliar with the code. The iteration coding in general needs clearer description.
In the test:
+ 4, /* RT_NODE_KIND_4 */
The small size was changed to 3 -- if this test needs to know the max size for each kind (class?), I wonder why it didn't fail. Should it? Maybe we need symbols for the various fanouts.
I also want to mention now that we better decide soon if we want to support shrinking of nodes for v16, even if the tidstore never shrinks. We'll need to do it at some point, but I'm not sure if doing it now would make more work for future changes targeting highly concurrent workloads. If so, doing it now would just be wasted work. On the other hand, someone might have a use that needs deletion before someone else needs concurrency. Just in case, I have a start of node-shrinking logic, but needs some work because we need the (local pointer) parent to update to the new smaller node, just like the growing case.
>
> On Tue, Feb 14, 2023 at 8:24 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > > I can think that something like traversing a HOT chain could visit
> > > offsets out of order. But fortunately we prune such collected TIDs
> > > before heap vacuum in heap case.
> >
> > Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continue assuming that (with an assert added since it's more public in this form). I'm not sure why such basic common sense evaded me a few versions ago...
>
> Right. TidStore is implemented not only for heap, so loading
> out-of-order TIDs might be important in the future.
That's what I was probably thinking about some weeks ago, but I'm having a hard time imagining how it would come up, even for something like the conveyor-belt concept.
> We have the following WIP comment in test_radixtree:
>
> // WIP: compiles with warnings because rt_attach is defined but not used
> // #define RT_SHMEM
>
> How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
> and friends?
Sounds good to me, and the other fixes make sense as well.
> FYI I've briefly tested the TidStore with blocksize = 32kb, and it
> seems to work fine.
That was on my list, so great! How about the other end -- nominally we allow 512b. (In practice it won't matter, but this would make sure I didn't mess anything up when forcing all MaxTuplesPerPage to encode.)
> You removed the vacuum integration patch from v27, is there any reason for that?
Just an oversight.
Now for some general comments on the tid store...
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
Do we need to do anything for this todo?
It might help readability to have a concept of "off_upper/off_lower", just so we can describe things more clearly. The key is block + off_upper, and the value is a bitmap of all the off_lower bits. I hinted at that in my addition of encode_key_off(). Along those lines, maybe s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even sure the TIDSTORE_ prefix is valuable for these local macros.
The word "value" as a variable name is pretty generic in this context, and it might be better to call it the off_lower_bitmap, at least in some places. The "key" doesn't have a good short term for naming, but in comments we should make sure we're clear it's "block# + off_upper".
I'm not a fan of the name "tid_i", even as a temp variable -- maybe "compressed_tid"?
maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/
It might be worth using typedefs for key and value type. Actually, since key type is fixed for the foreseeable future, maybe the radix tree template should define a key typedef?
The term "result" is probably fine within the tidstore, but as a public name used by vacuum, it's not very descriptive. I don't have a good idea, though.
Some files in backend/access use CamelCase for public functions, although it's not consistent. I think doing that for tidstore would help readability, since they would stand out from rt_* functions and vacuum functions. It's a matter of taste, though.
I don't understand the control flow in tidstore_iterate_next(), or when BlockNumberIsValid() is true. If this is the best way to code this, it needs more commentary.
Some comments on vacuum:
I think we'd better get some real-world testing of this, fairly soon.
I had an idea: If it's not too much effort, it might be worth splitting it into two parts: one that just adds the store (not caring about its memory limits or progress reporting etc). During index scan, check both the new store and the array and log a warning (we don't want to exit or crash, better to try to investigate while live if possible) if the result doesn't match. Then perhaps set up an instance and let something like TPC-C run for a few days. The second patch would just restore the rest of the current patch. That would help reassure us it's working as designed. Soon I plan to do some measurements with vacuuming large tables to get some concrete numbers that the community can get excited about.
We also want to verify that progress reporting works as designed and has no weird corner cases.
* autovacuum_work_mem) memory space to keep track of dead TIDs. We initially
...
+ * create a TidStore with the maximum bytes that can be used by the TidStore.
This kind of implies that we allocate the maximum bytes upfront. I think this sentence can be removed. We already mentioned in the previous paragraph that we set an upper bound.
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ vacuumed_pages)));
I don't think the format string has to change, since num_tids was changed back to int64 in an earlier patch version?
- * the memory space for storing dead items allocated in the DSM segment. We
[a lot of whitespace adjustment]
+ * the shared TidStore. We launch parallel worker processes at the start of
The old comment still seems mostly ok? Maybe just s/DSM segment/DSA area/ or something else minor.
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
If we're starting from the minimum, "estimate" doesn't really describe it anymore? Maybe "Initial size"?
What does dsa_minimum_size() work out to in practice? 1MB?
Also, I think PARALLEL_VACUUM_KEY_DSA is left over from an earlier patch.
Lastly, on the radix tree:
I find extend, set, and set_extend hard to keep straight when studying the code. Maybe EXTEND -> EXTEND_UP , SET_EXTEND -> EXTEND_DOWN ?
RT_ITER_UPDATE_KEY is unused, but I somehow didn't notice when turning it into a template.
+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
These comments don't really help readers unfamiliar with the code. The iteration coding in general needs clearer description.
In the test:
+ 4, /* RT_NODE_KIND_4 */
The small size was changed to 3 -- if this test needs to know the max size for each kind (class?), I wonder why it didn't fail. Should it? Maybe we need symbols for the various fanouts.
I also want to mention now that we better decide soon if we want to support shrinking of nodes for v16, even if the tidstore never shrinks. We'll need to do it at some point, but I'm not sure if doing it now would make more work for future changes targeting highly concurrent workloads. If so, doing it now would just be wasted work. On the other hand, someone might have a use that needs deletion before someone else needs concurrency. Just in case, I have a start of node-shrinking logic, but needs some work because we need the (local pointer) parent to update to the new smaller node, just like the growing case.
Hi, On 2023-02-16 16:22:56 +0700, John Naylor wrote: > On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> > > Right. TidStore is implemented not only for heap, so loading > > out-of-order TIDs might be important in the future. > > That's what I was probably thinking about some weeks ago, but I'm having a > hard time imagining how it would come up, even for something like the > conveyor-belt concept. We really ought to replace the tid bitmap used for bitmap heap scans. The hashtable we use is a pretty awful data structure for it. And that's not filled in-order, for example. Greetings, Andres Freund
On Thu, Feb 16, 2023 at 11:44 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-02-16 16:22:56 +0700, John Naylor wrote:
> > On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com>
> > > Right. TidStore is implemented not only for heap, so loading
> > > out-of-order TIDs might be important in the future.
> >
> > That's what I was probably thinking about some weeks ago, but I'm having a
> > hard time imagining how it would come up, even for something like the
> > conveyor-belt concept.
>
> We really ought to replace the tid bitmap used for bitmap heap scans. The
> hashtable we use is a pretty awful data structure for it. And that's not
> filled in-order, for example.
I took a brief look at that and agree we should sometime make it work there as well.
v26 tidstore_add_tids() appears to assume that it's only called once per blocknumber. While the order of offsets doesn't matter there for a single block, calling it again with the same block would wipe out the earlier offsets, IIUC. To do an actual "add tid" where the order doesn't matter, it seems we would need to (acquire lock if needed), read the current bitmap and OR in the new bit if it exists, then write it back out.
That sounds slow, so it might still be good for vacuum to call a function that passes a block and an array of offsets that are assumed ordered (as in v28), but with a more accurate name, like tidstore_set_block_offsets().
--
John Naylor
EDB: http://www.enterprisedb.com
EDB: http://www.enterprisedb.com
On Thu, Feb 16, 2023 at 6:23 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Feb 14, 2023 at 8:24 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > I can think that something like traversing a HOT chain could visit > > > > offsets out of order. But fortunately we prune such collected TIDs > > > > before heap vacuum in heap case. > > > > > > Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continueassuming that (with an assert added since it's more public in this form). I'm not sure why such basic common senseevaded me a few versions ago... > > > > Right. TidStore is implemented not only for heap, so loading > > out-of-order TIDs might be important in the future. > > That's what I was probably thinking about some weeks ago, but I'm having a hard time imagining how it would come up, evenfor something like the conveyor-belt concept. > > > We have the following WIP comment in test_radixtree: > > > > // WIP: compiles with warnings because rt_attach is defined but not used > > // #define RT_SHMEM > > > > How about unsetting RT_SCOPE to suppress warnings for unused rt_attach > > and friends? > > Sounds good to me, and the other fixes make sense as well. Thanks, I merged them. > > > FYI I've briefly tested the TidStore with blocksize = 32kb, and it > > seems to work fine. > > That was on my list, so great! How about the other end -- nominally we allow 512b. (In practice it won't matter, but thiswould make sure I didn't mess anything up when forcing all MaxTuplesPerPage to encode.) According to the doc, the minimum block size is 1kB. It seems to work fine with 1kB blocks. > > > You removed the vacuum integration patch from v27, is there any reason for that? > > Just an oversight. > > Now for some general comments on the tid store... > > + * TODO: The caller must be certain that no other backend will attempt to > + * access the TidStore before calling this function. Other backend must > + * explicitly call tidstore_detach to free up backend-local memory associated > + * with the TidStore. The backend that calls tidstore_destroy must not call > + * tidstore_detach. > + */ > +void > +tidstore_destroy(TidStore *ts) > > Do we need to do anything for this todo? Since it's practically no problem, I think we can live with it for now. dshash also has the same todo. > > It might help readability to have a concept of "off_upper/off_lower", just so we can describe things more clearly. Thekey is block + off_upper, and the value is a bitmap of all the off_lower bits. I hinted at that in my addition of encode_key_off().Along those lines, maybe s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even surethe TIDSTORE_ prefix is valuable for these local macros. > > The word "value" as a variable name is pretty generic in this context, and it might be better to call it the off_lower_bitmap,at least in some places. The "key" doesn't have a good short term for naming, but in comments we shouldmake sure we're clear it's "block# + off_upper". > > I'm not a fan of the name "tid_i", even as a temp variable -- maybe "compressed_tid"? > > maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/ > > It might be worth using typedefs for key and value type. Actually, since key type is fixed for the foreseeable future,maybe the radix tree template should define a key typedef? > > The term "result" is probably fine within the tidstore, but as a public name used by vacuum, it's not very descriptive.I don't have a good idea, though. > > Some files in backend/access use CamelCase for public functions, although it's not consistent. I think doing that for tidstorewould help readability, since they would stand out from rt_* functions and vacuum functions. It's a matter of taste,though. > > I don't understand the control flow in tidstore_iterate_next(), or when BlockNumberIsValid() is true. If this is the bestway to code this, it needs more commentary. The attached 0008 patch addressed all above comments on tidstore. > Some comments on vacuum: > > I think we'd better get some real-world testing of this, fairly soon. > > I had an idea: If it's not too much effort, it might be worth splitting it into two parts: one that just adds the store(not caring about its memory limits or progress reporting etc). During index scan, check both the new store and thearray and log a warning (we don't want to exit or crash, better to try to investigate while live if possible) if the resultdoesn't match. Then perhaps set up an instance and let something like TPC-C run for a few days. The second patch wouldjust restore the rest of the current patch. That would help reassure us it's working as designed. Yeah, I did a similar thing in an earlier version of tidstore patch. Since we're trying to introduce two new components: radix tree and tidstore, I sometimes find it hard to investigate failures happening during lazy (parallel) vacuum due to a bug either in tidstore or radix tree. If there is a bug in lazy vacuum, we cannot even do initdb. So it might be a good idea to do such checks in USE_ASSERT_CHECKING (or with another macro say DEBUG_TIDSTORE) builds. For example, TidStore stores tids to both the radix tree and array, and checks if the results match when lookup or iteration. It will use more memory but it would not be a big problem in USE_ASSERT_CHECKING builds. It would also be great if we can enable such checks on some bf animals. > Soon I plan to do some measurements with vacuuming large tables to get some concrete numbers that the community can getexcited about. Thanks! > > We also want to verify that progress reporting works as designed and has no weird corner cases. > > * autovacuum_work_mem) memory space to keep track of dead TIDs. We initially > ... > + * create a TidStore with the maximum bytes that can be used by the TidStore. > > This kind of implies that we allocate the maximum bytes upfront. I think this sentence can be removed. We already mentionedin the previous paragraph that we set an upper bound. Agreed. > > - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages", > - vacrel->relname, (long long) index, vacuumed_pages))); > + (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages", > + vacrel->relname, tidstore_num_tids(vacrel->dead_items), > + vacuumed_pages))); > > I don't think the format string has to change, since num_tids was changed back to int64 in an earlier patch version? I think we need to change the format to INT64_FORMAT. > > - * the memory space for storing dead items allocated in the DSM segment. We > [a lot of whitespace adjustment] > + * the shared TidStore. We launch parallel worker processes at the start of > > The old comment still seems mostly ok? Maybe just s/DSM segment/DSA area/ or something else minor. > > - /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */ > - est_dead_items_len = vac_max_items_to_alloc_size(max_items); > - shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len); > + /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */ > + shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize); > > If we're starting from the minimum, "estimate" doesn't really describe it anymore? Maybe "Initial size"? > What does dsa_minimum_size() work out to in practice? 1MB? > Also, I think PARALLEL_VACUUM_KEY_DSA is left over from an earlier patch. > Right. The attached 0009 patch addressed comments on vacuum integration except for the correctness checking. > Lastly, on the radix tree: > > I find extend, set, and set_extend hard to keep straight when studying the code. Maybe EXTEND -> EXTEND_UP , SET_EXTEND-> EXTEND_DOWN ? > > RT_ITER_UPDATE_KEY is unused, but I somehow didn't notice when turning it into a template. It was used in radixtree_iter_impl.h. But I removed it as it was not necessary. > > + /* > + * Set the node to the node iterator and update the iterator stack > + * from this node. > + */ > + RT_UPDATE_ITER_STACK(iter, child, level - 1); > > +/* > + * Update each node_iter for inner nodes in the iterator node stack. > + */ > +static void > +RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from) > > These comments don't really help readers unfamiliar with the code. The iteration coding in general needs clearer description. > I agree with all of the above comments. The attached 0007 patch addressed comments on the radix tree. > In the test: > > + 4, /* RT_NODE_KIND_4 */ > > The small size was changed to 3 -- if this test needs to know the max size for each kind (class?), I wonder why it didn'tfail. Should it? Maybe we need symbols for the various fanouts. > Since this information is used to the number of keys inserted, it doesn't check the node kind. So we just didn't test node-3. It might be better to expose and use both RT_SIZE_CLASS and RT_SIZE_CLASS_INFO. > I also want to mention now that we better decide soon if we want to support shrinking of nodes for v16, even if the tidstorenever shrinks. We'll need to do it at some point, but I'm not sure if doing it now would make more work for futurechanges targeting highly concurrent workloads. If so, doing it now would just be wasted work. On the other hand, someonemight have a use that needs deletion before someone else needs concurrency. Just in case, I have a start of node-shrinkinglogic, but needs some work because we need the (local pointer) parent to update to the new smaller node, justlike the growing case. Thanks, that's also on my todo list. TBH I'm not sure we should improve the deletion at this stage as there is no use case of deletion in the core. I'd prefer to focus on improving the quality of the current radix tree and tidstore now, and I think we can support node-shrinking once we are confident with the current implementation. On Fri, Feb 17, 2023 at 5:00 PM John Naylor <john.naylor@enterprisedb.com> wrote: > >That sounds slow, so it might still be good for vacuum to call a function that passes a block and an array of offsets thatare assumed ordered (as in v28), but with a more accurate name, like tidstore_set_block_offsets(). tidstore_set_block_offsets() sounds better. I used TidStoreSetBlockOffsets() in the latest patch set. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v29-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch
- v29-0007-Review-radix-tree.patch
- v29-0010-Revert-building-benchmark-module-for-CI.patch
- v29-0009-Review-vacuum-integration.patch
- v29-0008-Review-TidStore.patch
- v29-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patch
- v29-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v29-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patch
- v29-0003-Add-radixtree-template.patch
- v29-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Feb 16, 2023 at 6:23 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, Feb 14, 2023 at 8:24 PM John Naylor > > > <john.naylor@enterprisedb.com> wrote: > > > > > > > I can think that something like traversing a HOT chain could visit > > > > > offsets out of order. But fortunately we prune such collected TIDs > > > > > before heap vacuum in heap case. > > > > > > > > Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continueassuming that (with an assert added since it's more public in this form). I'm not sure why such basic common senseevaded me a few versions ago... > > > > > > Right. TidStore is implemented not only for heap, so loading > > > out-of-order TIDs might be important in the future. > > > > That's what I was probably thinking about some weeks ago, but I'm having a hard time imagining how it would come up,even for something like the conveyor-belt concept. > > > > > We have the following WIP comment in test_radixtree: > > > > > > // WIP: compiles with warnings because rt_attach is defined but not used > > > // #define RT_SHMEM > > > > > > How about unsetting RT_SCOPE to suppress warnings for unused rt_attach > > > and friends? > > > > Sounds good to me, and the other fixes make sense as well. > > Thanks, I merged them. > > > > > > FYI I've briefly tested the TidStore with blocksize = 32kb, and it > > > seems to work fine. > > > > That was on my list, so great! How about the other end -- nominally we allow 512b. (In practice it won't matter, butthis would make sure I didn't mess anything up when forcing all MaxTuplesPerPage to encode.) > > According to the doc, the minimum block size is 1kB. It seems to work > fine with 1kB blocks. > > > > > > You removed the vacuum integration patch from v27, is there any reason for that? > > > > Just an oversight. > > > > Now for some general comments on the tid store... > > > > + * TODO: The caller must be certain that no other backend will attempt to > > + * access the TidStore before calling this function. Other backend must > > + * explicitly call tidstore_detach to free up backend-local memory associated > > + * with the TidStore. The backend that calls tidstore_destroy must not call > > + * tidstore_detach. > > + */ > > +void > > +tidstore_destroy(TidStore *ts) > > > > Do we need to do anything for this todo? > > Since it's practically no problem, I think we can live with it for > now. dshash also has the same todo. > > > > > It might help readability to have a concept of "off_upper/off_lower", just so we can describe things more clearly. Thekey is block + off_upper, and the value is a bitmap of all the off_lower bits. I hinted at that in my addition of encode_key_off().Along those lines, maybe s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even surethe TIDSTORE_ prefix is valuable for these local macros. > > > > The word "value" as a variable name is pretty generic in this context, and it might be better to call it the off_lower_bitmap,at least in some places. The "key" doesn't have a good short term for naming, but in comments we shouldmake sure we're clear it's "block# + off_upper". > > > > I'm not a fan of the name "tid_i", even as a temp variable -- maybe "compressed_tid"? > > > > maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/ > > > > It might be worth using typedefs for key and value type. Actually, since key type is fixed for the foreseeable future,maybe the radix tree template should define a key typedef? > > > > The term "result" is probably fine within the tidstore, but as a public name used by vacuum, it's not very descriptive.I don't have a good idea, though. > > > > Some files in backend/access use CamelCase for public functions, although it's not consistent. I think doing that fortidstore would help readability, since they would stand out from rt_* functions and vacuum functions. It's a matter oftaste, though. > > > > I don't understand the control flow in tidstore_iterate_next(), or when BlockNumberIsValid() is true. If this is thebest way to code this, it needs more commentary. > > The attached 0008 patch addressed all above comments on tidstore. > > > Some comments on vacuum: > > > > I think we'd better get some real-world testing of this, fairly soon. > > > > I had an idea: If it's not too much effort, it might be worth splitting it into two parts: one that just adds the store(not caring about its memory limits or progress reporting etc). During index scan, check both the new store and thearray and log a warning (we don't want to exit or crash, better to try to investigate while live if possible) if the resultdoesn't match. Then perhaps set up an instance and let something like TPC-C run for a few days. The second patch wouldjust restore the rest of the current patch. That would help reassure us it's working as designed. > > Yeah, I did a similar thing in an earlier version of tidstore patch. > Since we're trying to introduce two new components: radix tree and > tidstore, I sometimes find it hard to investigate failures happening > during lazy (parallel) vacuum due to a bug either in tidstore or radix > tree. If there is a bug in lazy vacuum, we cannot even do initdb. So > it might be a good idea to do such checks in USE_ASSERT_CHECKING (or > with another macro say DEBUG_TIDSTORE) builds. For example, TidStore > stores tids to both the radix tree and array, and checks if the > results match when lookup or iteration. It will use more memory but it > would not be a big problem in USE_ASSERT_CHECKING builds. It would > also be great if we can enable such checks on some bf animals. I've tried this idea. Enabling this check on all debug builds (i.e., with USE_ASSERT_CHECKING macro) seems not a good idea so I use a special macro for that, TIDSTORE_DEBUG. I think we can define this macro on some bf animals (or possibly a new bf animal). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Feb 22, 2023 at 1:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Yeah, I did a similar thing in an earlier version of tidstore patch.
Okay, if you had checks against the old array lookup in development, that gives us better confidence.
> > Since we're trying to introduce two new components: radix tree and
> > tidstore, I sometimes find it hard to investigate failures happening
> > during lazy (parallel) vacuum due to a bug either in tidstore or radix
> > tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
> > it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
> > with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
> > stores tids to both the radix tree and array, and checks if the
> > results match when lookup or iteration. It will use more memory but it
> > would not be a big problem in USE_ASSERT_CHECKING builds. It would
> > also be great if we can enable such checks on some bf animals.
>
> I've tried this idea. Enabling this check on all debug builds (i.e.,
> with USE_ASSERT_CHECKING macro) seems not a good idea so I use a
> special macro for that, TIDSTORE_DEBUG. I think we can define this
> macro on some bf animals (or possibly a new bf animal).
I don't think any vacuum calls in regression tests would stress any of this code very much, so it's not worth carrying the old way forward. I was thinking of only doing this as a short-time sanity check for testing a real-world workload.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Feb 22, 2023 at 4:35 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Wed, Feb 22, 2023 at 1:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > Yeah, I did a similar thing in an earlier version of tidstore patch. > > Okay, if you had checks against the old array lookup in development, that gives us better confidence. > > > > Since we're trying to introduce two new components: radix tree and > > > tidstore, I sometimes find it hard to investigate failures happening > > > during lazy (parallel) vacuum due to a bug either in tidstore or radix > > > tree. If there is a bug in lazy vacuum, we cannot even do initdb. So > > > it might be a good idea to do such checks in USE_ASSERT_CHECKING (or > > > with another macro say DEBUG_TIDSTORE) builds. For example, TidStore > > > stores tids to both the radix tree and array, and checks if the > > > results match when lookup or iteration. It will use more memory but it > > > would not be a big problem in USE_ASSERT_CHECKING builds. It would > > > also be great if we can enable such checks on some bf animals. > > > > I've tried this idea. Enabling this check on all debug builds (i.e., > > with USE_ASSERT_CHECKING macro) seems not a good idea so I use a > > special macro for that, TIDSTORE_DEBUG. I think we can define this > > macro on some bf animals (or possibly a new bf animal). > > I don't think any vacuum calls in regression tests would stress any of this code very much, so it's not worth carryingthe old way forward. I was thinking of only doing this as a short-time sanity check for testing a real-world workload. I guess that It would also be helpful at least until the GA release. People will be able to test them easily on their workloads or their custom test scenarios. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Feb 22, 2023 at 3:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Feb 22, 2023 at 4:35 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > I don't think any vacuum calls in regression tests would stress any of this code very much, so it's not worth carrying the old way forward. I was thinking of only doing this as a short-time sanity check for testing a real-world workload.
>
> I guess that It would also be helpful at least until the GA release.
> People will be able to test them easily on their workloads or their
> custom test scenarios.
That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed.
TPC-C was just an example. It should have testing comparing the old and new methods. If you have already done that to some degree, that might be enough. After performance tests, I'll also try some vacuums that use the comparison patch.
--
John Naylor
EDB: http://www.enterprisedb.com
--
John Naylor
EDB: http://www.enterprisedb.com
I ran a couple "in situ" tests on server hardware using UUID columns, since they are common in the real world and have bad correlation to heap order, so are a challenge for index vacuum.
=== test 1, delete everything from a small table, with very small maintenance_work_mem:
alter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;
-- unrealistically low
alter system set maintenance_work_mem = '32MB';
create table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from generate_series(1,50*1000*1000);
create index on test (x);
delete from test;
vacuum (verbose, truncate off) test;
--
master:
INFO: finished vacuuming "john.naylor.public.test": index scans: 9
system usage: CPU: user: 70.04 s, system: 19.85 s, elapsed: 802.06 s
v29 patch:
INFO: finished vacuuming "john.naylor.public.test": index scans: 1
system usage: CPU: user: 9.80 s, system: 2.62 s, elapsed: 36.68 s
This is a bit artificial, but it's easy to construct cases where the array leads to multiple index scans but the new tid store can fit everythin without breaking a sweat. I didn't save the progress reporting, but v29 was using about 11MB for tid storage.
=== test 2: try to stress tid lookup with production maintenance_work_mem:
1. use unlogged table to reduce noise
2. vacuum freeze first to reduce heap scan time
3. delete some records at the beginning and end of heap to defeat binary search's pre-check
alter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;
alter system set maintenance_work_mem = '1GB';
create unlogged table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from generate_series(1,1000*1000*1000);
vacuum_freeze test;
select pg_size_pretty(pg_table_size('test'));
pg_size_pretty
----------------
41 GB
create index on test (x);
select pg_size_pretty(pg_total_relation_size('test'));
pg_size_pretty
----------------
71 GB
select max(ctid) from test;
max
--------------
(5405405,75)
delete from test where ctid < '(100000,0)'::tid;
delete from test where ctid > '(5300000,0)'::tid;
vacuum (verbose, truncate off) test;
both:
INFO: vacuuming "john.naylor.public.test"
INFO: finished vacuuming "john.naylor.public.test": index scans: 1
index scan needed: 205406 pages from table (3.80% of total) had 38000000 dead item identifiers removed
--
master:
system usage: CPU: user: 134.32 s, system: 19.24 s, elapsed: 286.14 s
v29 patch:
system usage: CPU: user: 97.71 s, system: 45.78 s, elapsed: 573.94 s
The entire vacuum took 25% less wall clock time. Reminder that this is without wal logging, and also unscientific because only one run.
--
I took 10 seconds of perf data while index vacuuming was going on (showing calls > 2%):
master:
40.59% postgres postgres [.] vac_cmp_itemptr
24.97% postgres libc-2.17.so [.] bsearch
6.67% postgres postgres [.] btvacuumpage
4.61% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
3.48% postgres postgres [.] PageIndexMultiDelete
2.67% postgres postgres [.] vac_tid_reaped
2.03% postgres postgres [.] compactify_tuples
2.01% postgres libc-2.17.so [.] __memcpy_ssse3_back
v29 patch:
29.22% postgres postgres [.] TidStoreIsMember
9.30% postgres postgres [.] btvacuumpage
7.76% postgres postgres [.] PageIndexMultiDelete
6.31% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
5.60% postgres postgres [.] compactify_tuples
4.26% postgres libc-2.17.so [.] __memcpy_ssse3_back
4.12% postgres postgres [.] hash_search_with_hash_value
--
master:
psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuples, num_dead_tuples from pg_stat_progress_vacuum"
phase | heap_blks_total | heap_blks_scanned | max_dead_tuples | num_dead_tuples
-------------------+-----------------+-------------------+-----------------+-----------------
vacuuming indexes | 5405406 | 5405406 | 178956969 | 38000000
v29 patch:
psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuple_bytes, dead_tuple_bytes from pg_stat_progress_vacuum"
phase | heap_blks_total | heap_blks_scanned | max_dead_tuple_bytes | dead_tuple_bytes
-------------------+-----------------+-------------------+----------------------+------------------
vacuuming indexes | 5405406 | 5405406 | 1073670144 | 8678064
Here, the old array pessimistically needs 1GB allocated (as for any table > ~5GB), but only fills 228MB for tid lookup. The patch reports 8.7MB. Tables that only fit, say, 30-50 tuples per page will have less extreme differences in memory use. Same for the case where only a couple dead items occur per page, with many uninteresting pages in between. Even so, the allocation will be much more accurately sized in the patch, especially in non-parallel vacuum.
There are other cases that could be tested (I mentioned some above), but this is enough to show the improvements possible.
I still need to do some cosmetic follow-up to v29 as well as a status report, and I will try to get back to that soon.
=== test 1, delete everything from a small table, with very small maintenance_work_mem:
alter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;
-- unrealistically low
alter system set maintenance_work_mem = '32MB';
create table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from generate_series(1,50*1000*1000);
create index on test (x);
delete from test;
vacuum (verbose, truncate off) test;
--
master:
INFO: finished vacuuming "john.naylor.public.test": index scans: 9
system usage: CPU: user: 70.04 s, system: 19.85 s, elapsed: 802.06 s
v29 patch:
INFO: finished vacuuming "john.naylor.public.test": index scans: 1
system usage: CPU: user: 9.80 s, system: 2.62 s, elapsed: 36.68 s
This is a bit artificial, but it's easy to construct cases where the array leads to multiple index scans but the new tid store can fit everythin without breaking a sweat. I didn't save the progress reporting, but v29 was using about 11MB for tid storage.
=== test 2: try to stress tid lookup with production maintenance_work_mem:
1. use unlogged table to reduce noise
2. vacuum freeze first to reduce heap scan time
3. delete some records at the beginning and end of heap to defeat binary search's pre-check
alter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;
alter system set maintenance_work_mem = '1GB';
create unlogged table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from generate_series(1,1000*1000*1000);
vacuum_freeze test;
select pg_size_pretty(pg_table_size('test'));
pg_size_pretty
----------------
41 GB
create index on test (x);
select pg_size_pretty(pg_total_relation_size('test'));
pg_size_pretty
----------------
71 GB
select max(ctid) from test;
max
--------------
(5405405,75)
delete from test where ctid < '(100000,0)'::tid;
delete from test where ctid > '(5300000,0)'::tid;
vacuum (verbose, truncate off) test;
both:
INFO: vacuuming "john.naylor.public.test"
INFO: finished vacuuming "john.naylor.public.test": index scans: 1
index scan needed: 205406 pages from table (3.80% of total) had 38000000 dead item identifiers removed
--
master:
system usage: CPU: user: 134.32 s, system: 19.24 s, elapsed: 286.14 s
v29 patch:
system usage: CPU: user: 97.71 s, system: 45.78 s, elapsed: 573.94 s
The entire vacuum took 25% less wall clock time. Reminder that this is without wal logging, and also unscientific because only one run.
--
I took 10 seconds of perf data while index vacuuming was going on (showing calls > 2%):
master:
40.59% postgres postgres [.] vac_cmp_itemptr
24.97% postgres libc-2.17.so [.] bsearch
6.67% postgres postgres [.] btvacuumpage
4.61% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
3.48% postgres postgres [.] PageIndexMultiDelete
2.67% postgres postgres [.] vac_tid_reaped
2.03% postgres postgres [.] compactify_tuples
2.01% postgres libc-2.17.so [.] __memcpy_ssse3_back
v29 patch:
29.22% postgres postgres [.] TidStoreIsMember
9.30% postgres postgres [.] btvacuumpage
7.76% postgres postgres [.] PageIndexMultiDelete
6.31% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
5.60% postgres postgres [.] compactify_tuples
4.26% postgres libc-2.17.so [.] __memcpy_ssse3_back
4.12% postgres postgres [.] hash_search_with_hash_value
--
master:
psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuples, num_dead_tuples from pg_stat_progress_vacuum"
phase | heap_blks_total | heap_blks_scanned | max_dead_tuples | num_dead_tuples
-------------------+-----------------+-------------------+-----------------+-----------------
vacuuming indexes | 5405406 | 5405406 | 178956969 | 38000000
v29 patch:
psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuple_bytes, dead_tuple_bytes from pg_stat_progress_vacuum"
phase | heap_blks_total | heap_blks_scanned | max_dead_tuple_bytes | dead_tuple_bytes
-------------------+-----------------+-------------------+----------------------+------------------
vacuuming indexes | 5405406 | 5405406 | 1073670144 | 8678064
Here, the old array pessimistically needs 1GB allocated (as for any table > ~5GB), but only fills 228MB for tid lookup. The patch reports 8.7MB. Tables that only fit, say, 30-50 tuples per page will have less extreme differences in memory use. Same for the case where only a couple dead items occur per page, with many uninteresting pages in between. Even so, the allocation will be much more accurately sized in the patch, especially in non-parallel vacuum.
There are other cases that could be tested (I mentioned some above), but this is enough to show the improvements possible.
I still need to do some cosmetic follow-up to v29 as well as a status report, and I will try to get back to that soon.
On Wed, Feb 22, 2023 at 6:55 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Wed, Feb 22, 2023 at 3:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Feb 22, 2023 at 4:35 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > I don't think any vacuum calls in regression tests would stress any of this code very much, so it's not worth carryingthe old way forward. I was thinking of only doing this as a short-time sanity check for testing a real-world workload. > > > > I guess that It would also be helpful at least until the GA release. > > People will be able to test them easily on their workloads or their > > custom test scenarios. > > That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, theold way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchsetwon't be committed. True. Even if we're done enough testing we cannot claim there is no bug. My idea is to make the bug investigation easier but on reflection, it seems not the best idea given this purpose. Instead, it seems to be better to add more necessary assertions. What do you think about the attached patch? Please note that it also includes the changes for minimum memory requirement. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Feb 23, 2023 at 6:41 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > I ran a couple "in situ" tests on server hardware using UUID columns, since they are common in the real world and havebad correlation to heap order, so are a challenge for index vacuum. Thank you for the test! > > === test 1, delete everything from a small table, with very small maintenance_work_mem: > > alter system set shared_buffers ='4GB'; > alter system set max_wal_size ='10GB'; > alter system set checkpoint_timeout ='30 min'; > alter system set autovacuum =off; > > -- unrealistically low > alter system set maintenance_work_mem = '32MB'; > > create table if not exists test (x uuid); > truncate table test; > insert into test (x) select gen_random_uuid() from generate_series(1,50*1000*1000); > create index on test (x); > > delete from test; > vacuum (verbose, truncate off) test; > -- > > master: > INFO: finished vacuuming "john.naylor.public.test": index scans: 9 > system usage: CPU: user: 70.04 s, system: 19.85 s, elapsed: 802.06 s > > v29 patch: > INFO: finished vacuuming "john.naylor.public.test": index scans: 1 > system usage: CPU: user: 9.80 s, system: 2.62 s, elapsed: 36.68 s > > This is a bit artificial, but it's easy to construct cases where the array leads to multiple index scans but the new tidstore can fit everythin without breaking a sweat. I didn't save the progress reporting, but v29 was using about 11MB fortid storage. Cool. > > > === test 2: try to stress tid lookup with production maintenance_work_mem: > 1. use unlogged table to reduce noise > 2. vacuum freeze first to reduce heap scan time > 3. delete some records at the beginning and end of heap to defeat binary search's pre-check > > alter system set shared_buffers ='4GB'; > alter system set max_wal_size ='10GB'; > alter system set checkpoint_timeout ='30 min'; > alter system set autovacuum =off; > > alter system set maintenance_work_mem = '1GB'; > > create unlogged table if not exists test (x uuid); > truncate table test; > insert into test (x) select gen_random_uuid() from generate_series(1,1000*1000*1000); > vacuum_freeze test; > > select pg_size_pretty(pg_table_size('test')); > pg_size_pretty > ---------------- > 41 GB > > create index on test (x); > > select pg_size_pretty(pg_total_relation_size('test')); > pg_size_pretty > ---------------- > 71 GB > > select max(ctid) from test; > max > -------------- > (5405405,75) > > delete from test where ctid < '(100000,0)'::tid; > delete from test where ctid > '(5300000,0)'::tid; > > vacuum (verbose, truncate off) test; > > both: > INFO: vacuuming "john.naylor.public.test" > INFO: finished vacuuming "john.naylor.public.test": index scans: 1 > index scan needed: 205406 pages from table (3.80% of total) had 38000000 dead item identifiers removed > > -- > master: > system usage: CPU: user: 134.32 s, system: 19.24 s, elapsed: 286.14 s > > v29 patch: > system usage: CPU: user: 97.71 s, system: 45.78 s, elapsed: 573.94 s In v29 vacuum took twice as long (286 s vs. 573 s)? > > The entire vacuum took 25% less wall clock time. Reminder that this is without wal logging, and also unscientific becauseonly one run. > > -- > I took 10 seconds of perf data while index vacuuming was going on (showing calls > 2%): > > master: > 40.59% postgres postgres [.] vac_cmp_itemptr > 24.97% postgres libc-2.17.so [.] bsearch > 6.67% postgres postgres [.] btvacuumpage > 4.61% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string > 3.48% postgres postgres [.] PageIndexMultiDelete > 2.67% postgres postgres [.] vac_tid_reaped > 2.03% postgres postgres [.] compactify_tuples > 2.01% postgres libc-2.17.so [.] __memcpy_ssse3_back > > v29 patch: > > 29.22% postgres postgres [.] TidStoreIsMember > 9.30% postgres postgres [.] btvacuumpage > 7.76% postgres postgres [.] PageIndexMultiDelete > 6.31% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string > 5.60% postgres postgres [.] compactify_tuples > 4.26% postgres libc-2.17.so [.] __memcpy_ssse3_back > 4.12% postgres postgres [.] hash_search_with_hash_value > > -- > master: > psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuples, num_dead_tuples from pg_stat_progress_vacuum" > phase | heap_blks_total | heap_blks_scanned | max_dead_tuples | num_dead_tuples > -------------------+-----------------+-------------------+-----------------+----------------- > vacuuming indexes | 5405406 | 5405406 | 178956969 | 38000000 > > v29 patch: > psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuple_bytes, dead_tuple_bytes from pg_stat_progress_vacuum" > phase | heap_blks_total | heap_blks_scanned | max_dead_tuple_bytes | dead_tuple_bytes > -------------------+-----------------+-------------------+----------------------+------------------ > vacuuming indexes | 5405406 | 5405406 | 1073670144 | 8678064 > > Here, the old array pessimistically needs 1GB allocated (as for any table > ~5GB), but only fills 228MB for tid lookup.The patch reports 8.7MB. Tables that only fit, say, 30-50 tuples per page will have less extreme differences in memoryuse. Same for the case where only a couple dead items occur per page, with many uninteresting pages in between. Evenso, the allocation will be much more accurately sized in the patch, especially in non-parallel vacuum. Agreed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Feb 24, 2023 at 3:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> In v29 vacuum took twice as long (286 s vs. 573 s)?
Not sure what happened there, and clearly I was looking at the wrong number :/
I scripted the test for reproducibility and ran it three times. Also included some variations (attached):
UUID times look comparable here, so no speedup or regression:
master:
system usage: CPU: user: 216.05 s, system: 35.81 s, elapsed: 634.22 s
system usage: CPU: user: 173.71 s, system: 31.24 s, elapsed: 599.04 s
system usage: CPU: user: 171.16 s, system: 30.21 s, elapsed: 583.21 s
v29:
system usage: CPU: user: 93.47 s, system: 40.92 s, elapsed: 594.10 s
system usage: CPU: user: 99.58 s, system: 44.73 s, elapsed: 606.80 s
system usage: CPU: user: 96.29 s, system: 42.74 s, elapsed: 600.10 s
Then, I tried sequential integers, which is a much more favorable access pattern in general, and the new tid storage shows substantial improvement:
master:
system usage: CPU: user: 100.39 s, system: 7.79 s, elapsed: 121.57 s
system usage: CPU: user: 104.90 s, system: 8.81 s, elapsed: 124.24 s
system usage: CPU: user: 95.04 s, system: 7.55 s, elapsed: 116.44 s
v29:
system usage: CPU: user: 24.57 s, system: 8.53 s, elapsed: 61.07 s
system usage: CPU: user: 23.18 s, system: 8.25 s, elapsed: 58.99 s
system usage: CPU: user: 23.20 s, system: 8.98 s, elapsed: 66.86 s
That's fast enough that I thought an improvement would show up even with standard WAL logging (no separate attachment, since it's a trivial change). Seems a bit faster:
master:
system usage: CPU: user: 152.27 s, system: 11.76 s, elapsed: 216.86 s
system usage: CPU: user: 137.25 s, system: 11.07 s, elapsed: 213.62 s
system usage: CPU: user: 149.48 s, system: 12.15 s, elapsed: 220.96 s
v29:
system usage: CPU: user: 40.88 s, system: 15.99 s, elapsed: 170.98 s
system usage: CPU: user: 41.33 s, system: 15.45 s, elapsed: 166.75 s
system usage: CPU: user: 41.51 s, system: 18.20 s, elapsed: 203.94 s
There is more we could test here, but I feel better about these numbers.
In the next few days, I'll resume style review and list the remaining issues we need to address.
>
> In v29 vacuum took twice as long (286 s vs. 573 s)?
Not sure what happened there, and clearly I was looking at the wrong number :/
I scripted the test for reproducibility and ran it three times. Also included some variations (attached):
UUID times look comparable here, so no speedup or regression:
master:
system usage: CPU: user: 216.05 s, system: 35.81 s, elapsed: 634.22 s
system usage: CPU: user: 173.71 s, system: 31.24 s, elapsed: 599.04 s
system usage: CPU: user: 171.16 s, system: 30.21 s, elapsed: 583.21 s
v29:
system usage: CPU: user: 93.47 s, system: 40.92 s, elapsed: 594.10 s
system usage: CPU: user: 99.58 s, system: 44.73 s, elapsed: 606.80 s
system usage: CPU: user: 96.29 s, system: 42.74 s, elapsed: 600.10 s
Then, I tried sequential integers, which is a much more favorable access pattern in general, and the new tid storage shows substantial improvement:
master:
system usage: CPU: user: 100.39 s, system: 7.79 s, elapsed: 121.57 s
system usage: CPU: user: 104.90 s, system: 8.81 s, elapsed: 124.24 s
system usage: CPU: user: 95.04 s, system: 7.55 s, elapsed: 116.44 s
v29:
system usage: CPU: user: 24.57 s, system: 8.53 s, elapsed: 61.07 s
system usage: CPU: user: 23.18 s, system: 8.25 s, elapsed: 58.99 s
system usage: CPU: user: 23.20 s, system: 8.98 s, elapsed: 66.86 s
That's fast enough that I thought an improvement would show up even with standard WAL logging (no separate attachment, since it's a trivial change). Seems a bit faster:
master:
system usage: CPU: user: 152.27 s, system: 11.76 s, elapsed: 216.86 s
system usage: CPU: user: 137.25 s, system: 11.07 s, elapsed: 213.62 s
system usage: CPU: user: 149.48 s, system: 12.15 s, elapsed: 220.96 s
v29:
system usage: CPU: user: 40.88 s, system: 15.99 s, elapsed: 170.98 s
system usage: CPU: user: 41.33 s, system: 15.45 s, elapsed: 166.75 s
system usage: CPU: user: 41.51 s, system: 18.20 s, elapsed: 203.94 s
There is more we could test here, but I feel better about these numbers.
In the next few days, I'll resume style review and list the remaining issues we need to address.
Attachment
On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Feb 22, 2023 at 6:55 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed.
> My idea is to make the bug investigation easier but on
> reflection, it seems not the best idea given this purpose.
My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't seem like a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same thing as comparing existing code with new code. That was my idea upthread.
Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that if I want that level of paranoia, I'm going to have to do it myself.
> What do you think
> about the attached patch? Please note that it also includes the
> changes for minimum memory requirement.
Most of the asserts look logical, or at least harmless.
- int max_off; /* the maximum offset number */
+ OffsetNumber max_off; /* the maximum offset number */
I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If we decided against the new asserts, this would be easy to lose.
This change, however, defies common sense:
+/*
+ * The minimum amount of memory required by TidStore is 2MB, the current minimum
+ * valid value for the maintenance_work_mem GUC. This is required to allocate the
+ * DSA initial segment, 1MB, and some meta data. This number is applied also to
+ * the local TidStore cases for simplicity.
+ */
+#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
+ /* Sanity check for the max_bytes */
+ if (max_bytes < TIDSTORE_MIN_MEMORY)
+ elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
+ TIDSTORE_MIN_MEMORY, max_bytes);
Aside from the fact that this elog's something that would never get past development, the #define just adds a hard-coded copy of something that is already hard-coded somewhere else, whose size depends on an implementation detail in a third place.
This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of some day unifying with tidbitmap.c, and maybe other applications will be limited by work_mem.
But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a design problem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even though operations like parallel hash and parallel bitmap heap scan are limited by work_mem. It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is not even considered?).
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Feb 28, 2023 at 3:42 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer,the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the wholepatchset won't be committed. > > > My idea is to make the bug investigation easier but on > > reflection, it seems not the best idea given this purpose. > > My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't seemlike a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same thingas comparing existing code with new code. That was my idea upthread. > > Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that ifI want that level of paranoia, I'm going to have to do it myself. > > > What do you think > > about the attached patch? Please note that it also includes the > > changes for minimum memory requirement. > > Most of the asserts look logical, or at least harmless. > > - int max_off; /* the maximum offset number */ > + OffsetNumber max_off; /* the maximum offset number */ > > I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If we decidedagainst the new asserts, this would be easy to lose. Right. I'll separate this change as a separate patch. > > This change, however, defies common sense: > > +/* > + * The minimum amount of memory required by TidStore is 2MB, the current minimum > + * valid value for the maintenance_work_mem GUC. This is required to allocate the > + * DSA initial segment, 1MB, and some meta data. This number is applied also to > + * the local TidStore cases for simplicity. > + */ > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */ > > + /* Sanity check for the max_bytes */ > + if (max_bytes < TIDSTORE_MIN_MEMORY) > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided", > + TIDSTORE_MIN_MEMORY, max_bytes); > > Aside from the fact that this elog's something that would never get past development, the #define just adds a hard-codedcopy of something that is already hard-coded somewhere else, whose size depends on an implementation detail ina third place. > > This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of someday unifying with tidbitmap.c, and maybe other applications will be limited by work_mem. > > But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a designproblem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even thoughoperations like parallel hash and parallel bitmap heap scan are limited by work_mem. Right. > It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is noteven considered?). IIUC both don't care about the allocated DSA segment size. Parallel hash accounts actual tuple (+ header) size as used memory but doesn't consider how much DSA segment is allocated behind. Both parallel hash and parallel bitmap scan can work even with work_mem = 64kB, but when checking the total DSA segment size allocated during these operations, it was 1MB. I realized that there is a similar memory limit design issue also on the non-shared tidstore cases. We deduct 70kB from max_bytes but it won't work fine with work_mem = 64kB. Probably we need to reconsider it. FYI 70kB comes from the maximum slab block size for node256. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Feb 28, 2023 at 3:42 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > > > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor > > > <john.naylor@enterprisedb.com> wrote: > > > > > > > > That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer,the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the wholepatchset won't be committed. > > > > > My idea is to make the bug investigation easier but on > > > reflection, it seems not the best idea given this purpose. > > > > My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't seemlike a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same thingas comparing existing code with new code. That was my idea upthread. > > > > Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that ifI want that level of paranoia, I'm going to have to do it myself. > > > > > What do you think > > > about the attached patch? Please note that it also includes the > > > changes for minimum memory requirement. > > > > Most of the asserts look logical, or at least harmless. > > > > - int max_off; /* the maximum offset number */ > > + OffsetNumber max_off; /* the maximum offset number */ > > > > I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If wedecided against the new asserts, this would be easy to lose. > > Right. I'll separate this change as a separate patch. > > > > > This change, however, defies common sense: > > > > +/* > > + * The minimum amount of memory required by TidStore is 2MB, the current minimum > > + * valid value for the maintenance_work_mem GUC. This is required to allocate the > > + * DSA initial segment, 1MB, and some meta data. This number is applied also to > > + * the local TidStore cases for simplicity. > > + */ > > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */ > > > > + /* Sanity check for the max_bytes */ > > + if (max_bytes < TIDSTORE_MIN_MEMORY) > > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided", > > + TIDSTORE_MIN_MEMORY, max_bytes); > > > > Aside from the fact that this elog's something that would never get past development, the #define just adds a hard-codedcopy of something that is already hard-coded somewhere else, whose size depends on an implementation detail ina third place. > > > > This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of someday unifying with tidbitmap.c, and maybe other applications will be limited by work_mem. > > > > But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a designproblem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even thoughoperations like parallel hash and parallel bitmap heap scan are limited by work_mem. > > Right. > > > It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is noteven considered?). > > IIUC both don't care about the allocated DSA segment size. Parallel > hash accounts actual tuple (+ header) size as used memory but doesn't > consider how much DSA segment is allocated behind. Both parallel hash > and parallel bitmap scan can work even with work_mem = 64kB, but when > checking the total DSA segment size allocated during these operations, > it was 1MB. > > I realized that there is a similar memory limit design issue also on > the non-shared tidstore cases. We deduct 70kB from max_bytes but it > won't work fine with work_mem = 64kB. Probably we need to reconsider > it. FYI 70kB comes from the maximum slab block size for node256. Currently, we calculate the slab block size enough to allocate 32 chunks from there. For node256, the leaf node is 2,088 bytes and the slab block size is 66,816 bytes. One idea to fix this issue to decrease it. For example, with 16 chunks the slab block size is 33,408 bytes and with 8 chunks it's 16,704 bytes. I ran a brief benchmark test with 70kB block size and 16kB block size: * 70kB slab blocks: select * from bench_search_random_nodes(20 * 1000 * 1000, '0xFFFFFF'); height = 2, n3 = 0, n15 = 0, n32 = 0, n125 = 0, n256 = 65793 mem_allocated | load_ms | search_ms ---------------+---------+----------- 143085184 | 1216 | 750 (1 row) * 16kB slab blocks: select * from bench_search_random_nodes(20 * 1000 * 1000, '0xFFFFFF'); height = 2, n3 = 0, n15 = 0, n32 = 0, n125 = 0, n256 = 65793 mem_allocated | load_ms | search_ms ---------------+---------+----------- 157601248 | 1220 | 786 (1 row) There is a performance difference a bit but a smaller slab block size seems to be acceptable if there is no other better way. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Feb 28, 2023 at 10:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Feb 28, 2023 at 3:42 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > >
> > > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor
> > > > <john.naylor@enterprisedb.com> wrote:
> > > > >
> > > > > That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed.
> > >
> > > > My idea is to make the bug investigation easier but on
> > > > reflection, it seems not the best idea given this purpose.
> > >
> > > My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't seem like a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same thing as comparing existing code with new code. That was my idea upthread.
> > >
> > > Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that if I want that level of paranoia, I'm going to have to do it myself.
> > >
> > > > What do you think
> > > > about the attached patch? Please note that it also includes the
> > > > changes for minimum memory requirement.
> > >
> > > Most of the asserts look logical, or at least harmless.
> > >
> > > - int max_off; /* the maximum offset number */
> > > + OffsetNumber max_off; /* the maximum offset number */
> > >
> > > I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If we decided against the new asserts, this would be easy to lose.
> >
> > Right. I'll separate this change as a separate patch.
> >
> > >
> > > This change, however, defies common sense:
> > >
> > > +/*
> > > + * The minimum amount of memory required by TidStore is 2MB, the current minimum
> > > + * valid value for the maintenance_work_mem GUC. This is required to allocate the
> > > + * DSA initial segment, 1MB, and some meta data. This number is applied also to
> > > + * the local TidStore cases for simplicity.
> > > + */
> > > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
> > >
> > > + /* Sanity check for the max_bytes */
> > > + if (max_bytes < TIDSTORE_MIN_MEMORY)
> > > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
> > > + TIDSTORE_MIN_MEMORY, max_bytes);
> > >
> > > Aside from the fact that this elog's something that would never get past development, the #define just adds a hard-coded copy of something that is already hard-coded somewhere else, whose size depends on an implementation detail in a third place.
> > >
> > > This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of some day unifying with tidbitmap.c, and maybe other applications will be limited by work_mem.
> > >
> > > But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a design problem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even though operations like parallel hash and parallel bitmap heap scan are limited by work_mem.
> >
> > Right.
> >
> > > It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is not even considered?).
> >
> > IIUC both don't care about the allocated DSA segment size. Parallel
> > hash accounts actual tuple (+ header) size as used memory but doesn't
> > consider how much DSA segment is allocated behind. Both parallel hash
> > and parallel bitmap scan can work even with work_mem = 64kB, but when
> > checking the total DSA segment size allocated during these operations,
> > it was 1MB.
> >
> > I realized that there is a similar memory limit design issue also on
> > the non-shared tidstore cases. We deduct 70kB from max_bytes but it
> > won't work fine with work_mem = 64kB. Probably we need to reconsider
> > it. FYI 70kB comes from the maximum slab block size for node256.
>
> Currently, we calculate the slab block size enough to allocate 32
> chunks from there. For node256, the leaf node is 2,088 bytes and the
> slab block size is 66,816 bytes. One idea to fix this issue to
> decrease it.
I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing there is what invokes a page fault. If we're not keeping track of each and every chunk space, for speed, it doesn't follow that we need to keep every block allocation within the configured limit. I'm guessing we can just ask the context if the block space has gone *over* the limit, and we can assume that the last allocation we perform will only fault one additional page. We need to have a clear answer on this before doing anything else.
If that's correct, and I'm not positive yet, we can get rid of all the fragile assumptions about things the tid store has no business knowing about, as well as the guc change. I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Feb 28, 2023 at 3:42 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > >
> > > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor
> > > > <john.naylor@enterprisedb.com> wrote:
> > > > >
> > > > > That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed.
> > >
> > > > My idea is to make the bug investigation easier but on
> > > > reflection, it seems not the best idea given this purpose.
> > >
> > > My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't seem like a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same thing as comparing existing code with new code. That was my idea upthread.
> > >
> > > Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that if I want that level of paranoia, I'm going to have to do it myself.
> > >
> > > > What do you think
> > > > about the attached patch? Please note that it also includes the
> > > > changes for minimum memory requirement.
> > >
> > > Most of the asserts look logical, or at least harmless.
> > >
> > > - int max_off; /* the maximum offset number */
> > > + OffsetNumber max_off; /* the maximum offset number */
> > >
> > > I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If we decided against the new asserts, this would be easy to lose.
> >
> > Right. I'll separate this change as a separate patch.
> >
> > >
> > > This change, however, defies common sense:
> > >
> > > +/*
> > > + * The minimum amount of memory required by TidStore is 2MB, the current minimum
> > > + * valid value for the maintenance_work_mem GUC. This is required to allocate the
> > > + * DSA initial segment, 1MB, and some meta data. This number is applied also to
> > > + * the local TidStore cases for simplicity.
> > > + */
> > > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
> > >
> > > + /* Sanity check for the max_bytes */
> > > + if (max_bytes < TIDSTORE_MIN_MEMORY)
> > > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
> > > + TIDSTORE_MIN_MEMORY, max_bytes);
> > >
> > > Aside from the fact that this elog's something that would never get past development, the #define just adds a hard-coded copy of something that is already hard-coded somewhere else, whose size depends on an implementation detail in a third place.
> > >
> > > This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of some day unifying with tidbitmap.c, and maybe other applications will be limited by work_mem.
> > >
> > > But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a design problem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even though operations like parallel hash and parallel bitmap heap scan are limited by work_mem.
> >
> > Right.
> >
> > > It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is not even considered?).
> >
> > IIUC both don't care about the allocated DSA segment size. Parallel
> > hash accounts actual tuple (+ header) size as used memory but doesn't
> > consider how much DSA segment is allocated behind. Both parallel hash
> > and parallel bitmap scan can work even with work_mem = 64kB, but when
> > checking the total DSA segment size allocated during these operations,
> > it was 1MB.
> >
> > I realized that there is a similar memory limit design issue also on
> > the non-shared tidstore cases. We deduct 70kB from max_bytes but it
> > won't work fine with work_mem = 64kB. Probably we need to reconsider
> > it. FYI 70kB comes from the maximum slab block size for node256.
>
> Currently, we calculate the slab block size enough to allocate 32
> chunks from there. For node256, the leaf node is 2,088 bytes and the
> slab block size is 66,816 bytes. One idea to fix this issue to
> decrease it.
I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing there is what invokes a page fault. If we're not keeping track of each and every chunk space, for speed, it doesn't follow that we need to keep every block allocation within the configured limit. I'm guessing we can just ask the context if the block space has gone *over* the limit, and we can assume that the last allocation we perform will only fault one additional page. We need to have a clear answer on this before doing anything else.
If that's correct, and I'm not positive yet, we can get rid of all the fragile assumptions about things the tid store has no business knowing about, as well as the guc change. I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Tue, Feb 28, 2023 at 10:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, Feb 28, 2023 at 3:42 PM John Naylor > > > <john.naylor@enterprisedb.com> wrote: > > > > > > > > > > > > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor > > > > > <john.naylor@enterprisedb.com> wrote: > > > > > > > > > > > > That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the sameanswer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then thewhole patchset won't be committed. > > > > > > > > > My idea is to make the bug investigation easier but on > > > > > reflection, it seems not the best idea given this purpose. > > > > > > > > My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn'tseem like a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not thesame thing as comparing existing code with new code. That was my idea upthread. > > > > > > > > Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear thatif I want that level of paranoia, I'm going to have to do it myself. > > > > > > > > > What do you think > > > > > about the attached patch? Please note that it also includes the > > > > > changes for minimum memory requirement. > > > > > > > > Most of the asserts look logical, or at least harmless. > > > > > > > > - int max_off; /* the maximum offset number */ > > > > + OffsetNumber max_off; /* the maximum offset number */ > > > > > > > > I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. Ifwe decided against the new asserts, this would be easy to lose. > > > > > > Right. I'll separate this change as a separate patch. > > > > > > > > > > > This change, however, defies common sense: > > > > > > > > +/* > > > > + * The minimum amount of memory required by TidStore is 2MB, the current minimum > > > > + * valid value for the maintenance_work_mem GUC. This is required to allocate the > > > > + * DSA initial segment, 1MB, and some meta data. This number is applied also to > > > > + * the local TidStore cases for simplicity. > > > > + */ > > > > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */ > > > > > > > > + /* Sanity check for the max_bytes */ > > > > + if (max_bytes < TIDSTORE_MIN_MEMORY) > > > > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided", > > > > + TIDSTORE_MIN_MEMORY, max_bytes); > > > > > > > > Aside from the fact that this elog's something that would never get past development, the #define just adds a hard-codedcopy of something that is already hard-coded somewhere else, whose size depends on an implementation detail ina third place. > > > > > > > > This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example ofsome day unifying with tidbitmap.c, and maybe other applications will be limited by work_mem. > > > > > > > > But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a designproblem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even thoughoperations like parallel hash and parallel bitmap heap scan are limited by work_mem. > > > > > > Right. > > > > > > > It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelismis not even considered?). > > > > > > IIUC both don't care about the allocated DSA segment size. Parallel > > > hash accounts actual tuple (+ header) size as used memory but doesn't > > > consider how much DSA segment is allocated behind. Both parallel hash > > > and parallel bitmap scan can work even with work_mem = 64kB, but when > > > checking the total DSA segment size allocated during these operations, > > > it was 1MB. > > > > > > I realized that there is a similar memory limit design issue also on > > > the non-shared tidstore cases. We deduct 70kB from max_bytes but it > > > won't work fine with work_mem = 64kB. Probably we need to reconsider > > > it. FYI 70kB comes from the maximum slab block size for node256. > > > > Currently, we calculate the slab block size enough to allocate 32 > > chunks from there. For node256, the leaf node is 2,088 bytes and the > > slab block size is 66,816 bytes. One idea to fix this issue to > > decrease it. > > I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to staywithin a memory limit only needs to track what's been allocated in chunks within a block, since writing there is whatinvokes a page fault. Right. I guess we've discussed what we use for calculating the *used* memory amount but I don't remember. I think I was confused by the fact that we use some different approaches to calculate the amount of used memory. Parallel hash and tidbitmap use the allocated chunk size whereas hash_agg_check_limits() in nodeAgg.c uses MemoryContextMemAllocated(), which uses the allocated block size. > If we're not keeping track of each and every chunk space, for speed, it doesn't follow that we need to keep every blockallocation within the configured limit. I'm guessing we can just ask the context if the block space has gone *over*the limit, and we can assume that the last allocation we perform will only fault one additional page. We need to havea clear answer on this before doing anything else. > > If that's correct, and I'm not positive yet, we can get rid of all the fragile assumptions about things the tid store hasno business knowing about, as well as the guc change. True. > I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes biggerthan max_dead_tuple_bytes. Yes, the progress reporting could be confusable. Particularly, in shared tidstore cases, the dead_tuple_bytes could be much bigger than max_dead_tuple_bytes. Probably what we need might be functions for MemoryContext and dsa_area to get the amount of memory that has been allocated, by not tracking every chunk space. For example, the functions would be like what SlabStats() does; iterate over every block and calculates the total/free memory usage. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> > I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing there is what invokes a page fault.
>
> Right. I guess we've discussed what we use for calculating the *used*
> memory amount but I don't remember.
>
> I think I was confused by the fact that we use some different
> approaches to calculate the amount of used memory. Parallel hash and
> tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
> in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
> allocated block size.
That's good to know. The latter says:
* After adding a new group to the hash table, check whether we need to enter
* spill mode. Allocations may happen without adding new groups (for instance,
* if the transition state size grows), so this check is imperfect.
I'm willing to claim that vacuum can be imperfect also, given the tid store's properties: 1) on average much more efficient in used space, and 2) no longer bound by the 1GB limit.
> > I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.
>
> Yes, the progress reporting could be confusable. Particularly, in
> shared tidstore cases, the dead_tuple_bytes could be much bigger than
> max_dead_tuple_bytes. Probably what we need might be functions for
> MemoryContext and dsa_area to get the amount of memory that has been
> allocated, by not tracking every chunk space. For example, the
> functions would be like what SlabStats() does; iterate over every
> block and calculates the total/free memory usage.
I'm not sure we need to invent new infrastructure for this. Looking at v29 in vacuumlazy.c, the order of operations for memory accounting is:
First, get the block-level space -- stop and vacuum indexes if we exceed the limit:
/*
* Consider if we definitely have enough space to process TIDs on page
* already. If we are close to overrunning the available space for
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if (TidStoreMemoryUsage(ts) > ts->control->max_bytes)"
Then, after pruning the current page, store the tids and then get the block-level space again:
else if (prunestate.num_offsets > 0)
{
/* Save details of the LP_DEAD items from the page in dead_items */
TidStoreSetBlockOffsets(...);
pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
TidStoreMemoryUsage(dead_items));
}
Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
Thoughts?
But now that I'm looking more closely at the details of memory accounting, I don't like that TidStoreMemoryUsage() is called twice per page pruned (see above). Maybe it wouldn't noticeably slow things down, but it's a bit sloppy. It seems like we should call it once per loop and save the result somewhere. If that's the right way to go, that possibly indicates that TidStoreIsFull() is not a useful interface, at least in this form.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> > I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing there is what invokes a page fault.
>
> Right. I guess we've discussed what we use for calculating the *used*
> memory amount but I don't remember.
>
> I think I was confused by the fact that we use some different
> approaches to calculate the amount of used memory. Parallel hash and
> tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
> in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
> allocated block size.
That's good to know. The latter says:
* After adding a new group to the hash table, check whether we need to enter
* spill mode. Allocations may happen without adding new groups (for instance,
* if the transition state size grows), so this check is imperfect.
I'm willing to claim that vacuum can be imperfect also, given the tid store's properties: 1) on average much more efficient in used space, and 2) no longer bound by the 1GB limit.
> > I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.
>
> Yes, the progress reporting could be confusable. Particularly, in
> shared tidstore cases, the dead_tuple_bytes could be much bigger than
> max_dead_tuple_bytes. Probably what we need might be functions for
> MemoryContext and dsa_area to get the amount of memory that has been
> allocated, by not tracking every chunk space. For example, the
> functions would be like what SlabStats() does; iterate over every
> block and calculates the total/free memory usage.
I'm not sure we need to invent new infrastructure for this. Looking at v29 in vacuumlazy.c, the order of operations for memory accounting is:
First, get the block-level space -- stop and vacuum indexes if we exceed the limit:
/*
* Consider if we definitely have enough space to process TIDs on page
* already. If we are close to overrunning the available space for
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if (TidStoreMemoryUsage(ts) > ts->control->max_bytes)"
Then, after pruning the current page, store the tids and then get the block-level space again:
else if (prunestate.num_offsets > 0)
{
/* Save details of the LP_DEAD items from the page in dead_items */
TidStoreSetBlockOffsets(...);
pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
TidStoreMemoryUsage(dead_items));
}
Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
Thoughts?
But now that I'm looking more closely at the details of memory accounting, I don't like that TidStoreMemoryUsage() is called twice per page pruned (see above). Maybe it wouldn't noticeably slow things down, but it's a bit sloppy. It seems like we should call it once per loop and save the result somewhere. If that's the right way to go, that possibly indicates that TidStoreIsFull() is not a useful interface, at least in this form.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Mar 3, 2023 at 8:04 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > > > I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needsto stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing thereis what invokes a page fault. > > > > Right. I guess we've discussed what we use for calculating the *used* > > memory amount but I don't remember. > > > > I think I was confused by the fact that we use some different > > approaches to calculate the amount of used memory. Parallel hash and > > tidbitmap use the allocated chunk size whereas hash_agg_check_limits() > > in nodeAgg.c uses MemoryContextMemAllocated(), which uses the > > allocated block size. > > That's good to know. The latter says: > > * After adding a new group to the hash table, check whether we need to enter > * spill mode. Allocations may happen without adding new groups (for instance, > * if the transition state size grows), so this check is imperfect. > > I'm willing to claim that vacuum can be imperfect also, given the tid store's properties: 1) on average much more efficientin used space, and 2) no longer bound by the 1GB limit. > > > > I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes biggerthan max_dead_tuple_bytes. > > > > Yes, the progress reporting could be confusable. Particularly, in > > shared tidstore cases, the dead_tuple_bytes could be much bigger than > > max_dead_tuple_bytes. Probably what we need might be functions for > > MemoryContext and dsa_area to get the amount of memory that has been > > allocated, by not tracking every chunk space. For example, the > > functions would be like what SlabStats() does; iterate over every > > block and calculates the total/free memory usage. > > I'm not sure we need to invent new infrastructure for this. Looking at v29 in vacuumlazy.c, the order of operations formemory accounting is: > > First, get the block-level space -- stop and vacuum indexes if we exceed the limit: > > /* > * Consider if we definitely have enough space to process TIDs on page > * already. If we are close to overrunning the available space for > * dead_items TIDs, pause and do a cycle of vacuuming before we tackle > * this page. > */ > if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if (TidStoreMemoryUsage(ts) > ts->control->max_bytes)" > > Then, after pruning the current page, store the tids and then get the block-level space again: > > else if (prunestate.num_offsets > 0) > { > /* Save details of the LP_DEAD items from the page in dead_items */ > TidStoreSetBlockOffsets(...); > > pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES, > TidStoreMemoryUsage(dead_items)); > } > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actionshere, effectively reporting progress for the *last page* and not the current one: First update progress with the currentmemory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be writtento. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlierattempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, Ithink it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense. > > Thoughts? It looks to work but it still doesn't work in a case where a shared tidstore is created with a 64kB memory limit, right? TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true from the beginning. BTW I realized that since the caller can pass dsa_area to tidstore (and radix tree), if other data are allocated in the same DSA are, TidStoreMemoryUsage() (and RT_MEMORY_USAGE()) returns the memory usage that includes not only itself but also other data. Probably it's better to comment that the passed dsa_area should be dedicated to a tidstore (or a radix tree). > > But now that I'm looking more closely at the details of memory accounting, I don't like that TidStoreMemoryUsage() is calledtwice per page pruned (see above). Maybe it wouldn't noticeably slow things down, but it's a bit sloppy. It seems likewe should call it once per loop and save the result somewhere. If that's the right way to go, that possibly indicatesthat TidStoreIsFull() is not a useful interface, at least in this form. Agreed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Mar 6, 2023 at 1:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
> >
> > Thoughts?
>
> It looks to work but it still doesn't work in a case where a shared
> tidstore is created with a 64kB memory limit, right?
> TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
> from the beginning.
I have two ideas:
1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else that vacuum does. That would allow other users to avoid that overhead.
2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not sure such a top-down facility exists, and I'm not feeling well enough today to study this further.
--
John Naylor
EDB: http://www.enterprisedb.com
> > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
> >
> > Thoughts?
>
> It looks to work but it still doesn't work in a case where a shared
> tidstore is created with a 64kB memory limit, right?
> TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
> from the beginning.
I have two ideas:
1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else that vacuum does. That would allow other users to avoid that overhead.
2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not sure such a top-down facility exists, and I'm not feeling well enough today to study this further.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Mar 7, 2023 at 1:01 AM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Mar 6, 2023 at 1:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of theactions here, effectively reporting progress for the *last page* and not the current one: First update progress with thecurrent memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be writtento. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlierattempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, Ithink it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense. > > > > > > Thoughts? > > > > It looks to work but it still doesn't work in a case where a shared > > tidstore is created with a 64kB memory limit, right? > > TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true > > from the beginning. > > I have two ideas: > > 1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else thatvacuum does. That would allow other users to avoid that overhead. > 2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not suresuch a top-down facility exists, and I'm not feeling well enough today to study this further. I prefer option (1) as it's straight forward. I mentioned a similar idea before[1]. RT_MEMORY_USAGE() is defined only when the macro is defined. It might be worth checking if there is visible overhead of tracking chunk memory space. IIRC we've not evaluated it yet. [1] https://www.postgresql.org/message-id/CAD21AoDK3gbX-jVxT6Pfso1Na0Krzr8Q15498Aj6tmXgzMFksA%40mail.gmail.com Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Mar 7, 2023 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > 1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else that vacuum does. That would allow other users to avoid that overhead.
> > 2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not sure such a top-down facility exists, and I'm not feeling well enough today to study this further.
>
> I prefer option (1) as it's straight forward. I mentioned a similar
> idea before[1]. RT_MEMORY_USAGE() is defined only when the macro is
> defined. It might be worth checking if there is visible overhead of
> tracking chunk memory space. IIRC we've not evaluated it yet.
Ok, let's try this -- I can test and profile later this week.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Mar 8, 2023 at 1:40 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Tue, Mar 7, 2023 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > 1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything elsethat vacuum does. That would allow other users to avoid that overhead. > > > 2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm notsure such a top-down facility exists, and I'm not feeling well enough today to study this further. > > > > I prefer option (1) as it's straight forward. I mentioned a similar > > idea before[1]. RT_MEMORY_USAGE() is defined only when the macro is > > defined. It might be worth checking if there is visible overhead of > > tracking chunk memory space. IIRC we've not evaluated it yet. > > Ok, let's try this -- I can test and profile later this week. Thanks! I've attached the new version patches. I merged improvements and fixes I did in the v29 patch. 0007 through 0010 are updates from v29. The main change made in v30 is to make the memory measurement and RT_MEMORY_USAGE() optional, which is done in 0007 patch. The 0008 and 0009 patches are the updates for tidstore and the vacuum integration patches. Here are results of quick tests (an average of 3 executions): query: select * from bench_load_random_int(10 * 1000 * 1000) * w/ RT_MEASURE_MEMORY_USAGE: mem_allocated | load_ms ---------------+--------- 1996512000 | 3305 (1 row) * w/o RT_MEASURE_MEMORY_USAGE: mem_allocated | load_ms ---------------+--------- 0 | 3258 (1 row) It seems to be within a noise level but I agree to make it optional. Apart from the memory measurement stuff, I've done another todo item on my list; adding min max classes for node3 and node125. I've done that in 0010 patch, and here is a quick test result: query: select * from bench_load_random_int(10 * 1000 * 1000) * w/ 0000 patch mem_allocated | load_ms ---------------+--------- 1268630080 | 3275 (1 row) * w/o 0000 patch mem_allocated | load_ms ---------------+--------- 1996512000 | 3214 (1 row) That's a good improvement on the memory usage, without a noticeable performance overhead. FYI CLASS_3_MIN has 1 fanout and is 24 bytes in size, and CLASS_125_MIN has 61 fanouts and is 768 bytes in size. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v30-0008-Remove-the-max-memory-deduction-from-TidStore.patch
- v30-0011-Revert-building-benchmark-module-for-CI.patch
- v30-0007-Radix-tree-optionally-tracks-memory-usage-when-R.patch
- v30-0009-Revert-the-update-for-the-minimum-value-of-maint.patch
- v30-0010-Add-min-and-max-classes-for-node3-and-node125.patch
- v30-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch
- v30-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
- v30-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patch
- v30-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v30-0003-Add-a-macro-templatized-radix-tree.patch
- v30-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patch
On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I've attached the new version patches. I merged improvements and fixes
> I did in the v29 patch.
I haven't yet had a chance to look at those closely, since I've had to devote time to other commitments. I remember I wasn't particularly impressed that v29-0008 mixed my requested name-casing changes with a bunch of other random things. Separating those out would be an obvious way to make it easier for me to look at, whenever I can get back to this. I need to look at the iteration changes as well, in addition to testing memory measurement (thanks for the new results, they look encouraging).
> Apart from the memory measurement stuff, I've done another todo item
> on my list; adding min max classes for node3 and node125. I've done
This didn't help us move us closer to something committable the first time you coded this without making sure it was a good idea. It's still not helping and arguably makes it worse. To be fair, I did speak positively about _considering_ additional size classes some months ago, but that has a very obvious maintenance cost, something we can least afford right now.
I'm frankly baffled you thought this was important enough to work on again, yet thought it was a waste of time to try to prove to ourselves that autovacuum in a realistic, non-deterministic workload gave the same answer as the current tid lookup. Even if we had gone that far, it doesn't seem like a good idea to add non-essential code to critical paths right now.
We're rapidly running out of time, and we're at the point in the cycle where it's impossible to get meaningful review from anyone not already intimately familiar with the patch series. I only want to see progress on addressing possible (especially architectural) objections from the community, because if they don't notice them now, they surely will after commit. I have my own list of possible objections as well as bikeshedding points, which I'll clean up and share next week. I plan to invite Andres to look at that list and give his impressions, because it's a lot quicker than reading the patches. Based on that, I'll hopefully be able to decide whether we have enough time to address any feedback and do remaining polishing in time for feature freeze.
I'd suggest sharing your todo list in the meanwhile, it'd be good to discuss what's worth doing and what is not.
--
John Naylor
EDB: http://www.enterprisedb.com
> I've attached the new version patches. I merged improvements and fixes
> I did in the v29 patch.
I haven't yet had a chance to look at those closely, since I've had to devote time to other commitments. I remember I wasn't particularly impressed that v29-0008 mixed my requested name-casing changes with a bunch of other random things. Separating those out would be an obvious way to make it easier for me to look at, whenever I can get back to this. I need to look at the iteration changes as well, in addition to testing memory measurement (thanks for the new results, they look encouraging).
> Apart from the memory measurement stuff, I've done another todo item
> on my list; adding min max classes for node3 and node125. I've done
This didn't help us move us closer to something committable the first time you coded this without making sure it was a good idea. It's still not helping and arguably makes it worse. To be fair, I did speak positively about _considering_ additional size classes some months ago, but that has a very obvious maintenance cost, something we can least afford right now.
I'm frankly baffled you thought this was important enough to work on again, yet thought it was a waste of time to try to prove to ourselves that autovacuum in a realistic, non-deterministic workload gave the same answer as the current tid lookup. Even if we had gone that far, it doesn't seem like a good idea to add non-essential code to critical paths right now.
We're rapidly running out of time, and we're at the point in the cycle where it's impossible to get meaningful review from anyone not already intimately familiar with the patch series. I only want to see progress on addressing possible (especially architectural) objections from the community, because if they don't notice them now, they surely will after commit. I have my own list of possible objections as well as bikeshedding points, which I'll clean up and share next week. I plan to invite Andres to look at that list and give his impressions, because it's a lot quicker than reading the patches. Based on that, I'll hopefully be able to decide whether we have enough time to address any feedback and do remaining polishing in time for feature freeze.
I'd suggest sharing your todo list in the meanwhile, it'd be good to discuss what's worth doing and what is not.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Mar 10, 2023 at 3:42 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I've attached the new version patches. I merged improvements and fixes > > I did in the v29 patch. > > I haven't yet had a chance to look at those closely, since I've had to devote time to other commitments. I remember I wasn'tparticularly impressed that v29-0008 mixed my requested name-casing changes with a bunch of other random things. Separatingthose out would be an obvious way to make it easier for me to look at, whenever I can get back to this. I needto look at the iteration changes as well, in addition to testing memory measurement (thanks for the new results, theylook encouraging). Okay, I'll separate them again. > > > Apart from the memory measurement stuff, I've done another todo item > > on my list; adding min max classes for node3 and node125. I've done > > This didn't help us move us closer to something committable the first time you coded this without making sure it was agood idea. It's still not helping and arguably makes it worse. To be fair, I did speak positively about _considering_ additionalsize classes some months ago, but that has a very obvious maintenance cost, something we can least afford rightnow. > > I'm frankly baffled you thought this was important enough to work on again, yet thought it was a waste of time to try toprove to ourselves that autovacuum in a realistic, non-deterministic workload gave the same answer as the current tid lookup.Even if we had gone that far, it doesn't seem like a good idea to add non-essential code to critical paths right now. I didn't think that proving tidstore and the current tid lookup return the same result was a waste of time. I've shared a patch to do that in tidstore before. I agreed not to add it to the tree but we can test that using this patch. Actually I've done a test that ran pgbench workload for a few days. IIUC it's still important to consider whether to have node1 since it could be a good alternative for the path compression. The prototype also implemented it. Of course we can leave it for future improvement. But considering this item with the performance tests helps us to prove our decoupling approach is promising. > We're rapidly running out of time, and we're at the point in the cycle where it's impossible to get meaningful review fromanyone not already intimately familiar with the patch series. I only want to see progress on addressing possible (especiallyarchitectural) objections from the community, because if they don't notice them now, they surely will after commit. Right, we've been making many design decisions. Some of them are agreed just between you and me and some are agreed with other hackers. There are some irrevertible design decisions due to the remaining time. > I have my own list of possible objections as well as bikeshedding points, which I'll clean up and share next week. Thanks. > I plan to invite Andres to look at that list and give his impressions, because it's a lot quicker than reading the patches.Based on that, I'll hopefully be able to decide whether we have enough time to address any feedback and do remainingpolishing in time for feature freeze. > > I'd suggest sharing your todo list in the meanwhile, it'd be good to discuss what's worth doing and what is not. Apart from more rounds of reviews and tests, my todo items that need discussion and possibly implementation are: * The memory measurement in radix trees and the memory limit in tidstores. I've implemented it in v30-0007 through 0009 but we need to review it. This is the highest priority for me. * Additional size classes. It's important for an alternative of path compression as well as supporting our decoupling approach. Middle priority. * Node shrinking support. Low priority. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Mar 10, 2023 at 11:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Mar 10, 2023 at 3:42 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > I've attached the new version patches. I merged improvements and fixes > > > I did in the v29 patch. > > > > I haven't yet had a chance to look at those closely, since I've had to devote time to other commitments. I remember Iwasn't particularly impressed that v29-0008 mixed my requested name-casing changes with a bunch of other random things.Separating those out would be an obvious way to make it easier for me to look at, whenever I can get back to this.I need to look at the iteration changes as well, in addition to testing memory measurement (thanks for the new results,they look encouraging). > > Okay, I'll separate them again. Attached new patch series. In addition to separate them again, I've fixed a conflict with HEAD. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v31-0013-Add-min-and-max-classes-for-node3-and-node125.patch
- v31-0011-Remove-the-max-memory-deduction-from-TidStore.patch
- v31-0009-Review-vacuum-integration.patch
- v31-0007-Review-radix-tree.patch
- v31-0014-Revert-building-benchmark-module-for-CI.patch
- v31-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patch
- v31-0008-Review-TidStore.patch
- v31-0003-Add-radixtree-template.patch
- v31-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch
- v31-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v31-0012-Revert-the-update-for-the-minimum-value-of-maint.patch
- v31-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patch
- v31-0010-Radix-tree-optionally-tracks-memory-usage-when-R.patch
- v31-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Mar 10, 2023 at 3:42 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > I'd suggest sharing your todo list in the meanwhile, it'd be good to discuss what's worth doing and what is not.
>
> Apart from more rounds of reviews and tests, my todo items that need
> discussion and possibly implementation are:
Quick thoughts on these:
> * The memory measurement in radix trees and the memory limit in
> tidstores. I've implemented it in v30-0007 through 0009 but we need to
> review it. This is the highest priority for me.
Agreed.
> * Additional size classes. It's important for an alternative of path
> compression as well as supporting our decoupling approach. Middle
> priority.
I'm going to push back a bit and claim this doesn't bring much gain, while it does have a complexity cost. The node1 from Andres's prototype is 32 bytes in size, same as our node3, so it's roughly equivalent as a way to ameliorate the lack of path compression. I say "roughly" because the loop in node3 is probably noticeably slower. A new size class will by definition still use that loop.
>
> On Fri, Mar 10, 2023 at 3:42 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > I'd suggest sharing your todo list in the meanwhile, it'd be good to discuss what's worth doing and what is not.
>
> Apart from more rounds of reviews and tests, my todo items that need
> discussion and possibly implementation are:
Quick thoughts on these:
> * The memory measurement in radix trees and the memory limit in
> tidstores. I've implemented it in v30-0007 through 0009 but we need to
> review it. This is the highest priority for me.
Agreed.
> * Additional size classes. It's important for an alternative of path
> compression as well as supporting our decoupling approach. Middle
> priority.
I'm going to push back a bit and claim this doesn't bring much gain, while it does have a complexity cost. The node1 from Andres's prototype is 32 bytes in size, same as our node3, so it's roughly equivalent as a way to ameliorate the lack of path compression. I say "roughly" because the loop in node3 is probably noticeably slower. A new size class will by definition still use that loop.
About a smaller node125-type class: I'm actually not even sure we need to have any sub-max node bigger about 64 (node size 768 bytes). I'd just let 65+ go to the max node -- there won't be many of them, at least in synthetic workloads we've seen so far.
> * Node shrinking support. Low priority.
This is an architectural wart that's been neglected since the tid store doesn't perform deletion. We'll need it sometime. If we're not going to make this work, why ship a deletion API at all?
I took a look at this a couple weeks ago, and fixing it wouldn't be that hard. I even had an idea of how to detect when to shrink size class within a node kind, while keeping the header at 5 bytes. I'd be willing to put effort into that, but to have a chance of succeeding, I'm unwilling to make it more difficult by adding more size classes at this point.
--
John Naylor
EDB: http://www.enterprisedb.com
> * Node shrinking support. Low priority.
This is an architectural wart that's been neglected since the tid store doesn't perform deletion. We'll need it sometime. If we're not going to make this work, why ship a deletion API at all?
I took a look at this a couple weeks ago, and fixing it wouldn't be that hard. I even had an idea of how to detect when to shrink size class within a node kind, while keeping the header at 5 bytes. I'd be willing to put effort into that, but to have a chance of succeeding, I'm unwilling to make it more difficult by adding more size classes at this point.
--
John Naylor
EDB: http://www.enterprisedb.com
On Sun, Mar 12, 2023 at 12:54 AM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Mar 10, 2023 at 3:42 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > I'd suggest sharing your todo list in the meanwhile, it'd be good to discuss what's worth doing and what is not. > > > > Apart from more rounds of reviews and tests, my todo items that need > > discussion and possibly implementation are: > > Quick thoughts on these: > > > * The memory measurement in radix trees and the memory limit in > > tidstores. I've implemented it in v30-0007 through 0009 but we need to > > review it. This is the highest priority for me. > > Agreed. > > > * Additional size classes. It's important for an alternative of path > > compression as well as supporting our decoupling approach. Middle > > priority. > > I'm going to push back a bit and claim this doesn't bring much gain, while it does have a complexity cost. The node1 fromAndres's prototype is 32 bytes in size, same as our node3, so it's roughly equivalent as a way to ameliorate the lackof path compression. But does it mean that our node1 would help reduce the memory further since since our base node type (i.e. RT_NODE) is smaller than the base node type of Andres's prototype? The result I shared before showed 1.2GB vs. 1.9GB. > I say "roughly" because the loop in node3 is probably noticeably slower. A new size class will by definition still usethat loop. I've evaluated the performance of node1 but the result seems to show the opposite. I used the test query: select * from bench_search_random_nodes(100 * 1000 * 1000, '0xFF000000000000FF'); Which make the radix tree that has node1 like: max_val = 18446744073709551615 num_keys = 65536 height = 7, n1 = 1536, n3 = 0, n15 = 0, n32 = 0, n61 = 0, n256 = 257 All internal nodes except for the root node are node1. The radix tree that doesn't have node1 is: max_val = 18446744073709551615 num_keys = 65536 height = 7, n3 = 1536, n15 = 0, n32 = 0, n125 = 0, n256 = 257 Here is the result: * w/ node1 mem_allocated | load_ms | search_ms ---------------+---------+----------- 573448 | 1848 | 1707 (1 row) * w/o node1 mem_allocated | load_ms | search_ms ---------------+---------+----------- 598024 | 2014 | 1825 (1 row) Am I missing something? > > About a smaller node125-type class: I'm actually not even sure we need to have any sub-max node bigger about 64 (node size768 bytes). I'd just let 65+ go to the max node -- there won't be many of them, at least in synthetic workloads we'veseen so far. Makes sense to me. > > > * Node shrinking support. Low priority. > > This is an architectural wart that's been neglected since the tid store doesn't perform deletion. We'll need it sometime.If we're not going to make this work, why ship a deletion API at all? > > I took a look at this a couple weeks ago, and fixing it wouldn't be that hard. I even had an idea of how to detect whento shrink size class within a node kind, while keeping the header at 5 bytes. I'd be willing to put effort into that,but to have a chance of succeeding, I'm unwilling to make it more difficult by adding more size classes at this point. I think that the deletion (and locking support) doesn't have use cases in the core (i.e. tidstore) but is implemented so that external extensions can use it. There might not be such extensions. Given the lack of use cases in the core (and the rest of time), I think it's okay even if the implementation of such API is minimal and not optimized enough. For instance, the implementation of dshash.c is minimalist, and doesn't have resizing. We can improve them in the future if extensions or other core features want. Personally I think we should focus on addressing feedback that we would get and improving the existing use cases for the rest of time. That's why considering min-max size class has a higher priority than the node shrinking support in my todo list. FYI, I've run TPC-C workload over the weekend, and didn't get any failures of the assertion proving tidstore and the current tid lookup return the same result. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Mar 13, 2023 at 8:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sun, Mar 12, 2023 at 12:54 AM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > * Additional size classes. It's important for an alternative of path
> > > compression as well as supporting our decoupling approach. Middle
> > > priority.
> >
> > I'm going to push back a bit and claim this doesn't bring much gain, while it does have a complexity cost. The node1 from Andres's prototype is 32 bytes in size, same as our node3, so it's roughly equivalent as a way to ameliorate the lack of path compression.
>
> But does it mean that our node1 would help reduce the memory further
> since since our base node type (i.e. RT_NODE) is smaller than the base
> node type of Andres's prototype? The result I shared before showed
> 1.2GB vs. 1.9GB.
The benefit is found in a synthetic benchmark with random integers. I highly doubt that anyone would be willing to force us to keep binary-searching the 1GB array for one more cycle on account of not adding a size class here. I'll repeat myself and say that there are also maintenance costs.
In contrast, I'm fairly certain that our attempts thus far at memory accounting/limiting are not quite up to par, and lacking enough to jeopardize the feature. We're already discussing that, so I'll say no more.
> > I say "roughly" because the loop in node3 is probably noticeably slower. A new size class will by definition still use that loop.
>
> I've evaluated the performance of node1 but the result seems to show
> the opposite.
As an aside, I meant the loop in our node3 might make your node1 slower than the prototype's node1, which was coded for 1 member only.
> > > * Node shrinking support. Low priority.
> >
> > This is an architectural wart that's been neglected since the tid store doesn't perform deletion. We'll need it sometime. If we're not going to make this work, why ship a deletion API at all?
> >
> > I took a look at this a couple weeks ago, and fixing it wouldn't be that hard. I even had an idea of how to detect when to shrink size class within a node kind, while keeping the header at 5 bytes. I'd be willing to put effort into that, but to have a chance of succeeding, I'm unwilling to make it more difficult by adding more size classes at this point.
>
> I think that the deletion (and locking support) doesn't have use cases
> in the core (i.e. tidstore) but is implemented so that external
> extensions can use it.
I think these cases are a bit different: Doing anything with a data structure stored in shared memory without a synchronization scheme is completely unthinkable and insane. I'm not yet sure if deleting-without-shrinking is a showstopper, or if it's preferable in v16 than no deletion at all.
Anything we don't implement now is a limit on future use cases, and thus a cause for objection. On the other hand, anything we implement also represents more stuff that will have to be rewritten for high-concurrency.
> FYI, I've run TPC-C workload over the weekend, and didn't get any
> failures of the assertion proving tidstore and the current tid lookup
> return the same result.
Great!
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Sun, Mar 12, 2023 at 12:54 AM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > * Additional size classes. It's important for an alternative of path
> > > compression as well as supporting our decoupling approach. Middle
> > > priority.
> >
> > I'm going to push back a bit and claim this doesn't bring much gain, while it does have a complexity cost. The node1 from Andres's prototype is 32 bytes in size, same as our node3, so it's roughly equivalent as a way to ameliorate the lack of path compression.
>
> But does it mean that our node1 would help reduce the memory further
> since since our base node type (i.e. RT_NODE) is smaller than the base
> node type of Andres's prototype? The result I shared before showed
> 1.2GB vs. 1.9GB.
The benefit is found in a synthetic benchmark with random integers. I highly doubt that anyone would be willing to force us to keep binary-searching the 1GB array for one more cycle on account of not adding a size class here. I'll repeat myself and say that there are also maintenance costs.
In contrast, I'm fairly certain that our attempts thus far at memory accounting/limiting are not quite up to par, and lacking enough to jeopardize the feature. We're already discussing that, so I'll say no more.
> > I say "roughly" because the loop in node3 is probably noticeably slower. A new size class will by definition still use that loop.
>
> I've evaluated the performance of node1 but the result seems to show
> the opposite.
As an aside, I meant the loop in our node3 might make your node1 slower than the prototype's node1, which was coded for 1 member only.
> > > * Node shrinking support. Low priority.
> >
> > This is an architectural wart that's been neglected since the tid store doesn't perform deletion. We'll need it sometime. If we're not going to make this work, why ship a deletion API at all?
> >
> > I took a look at this a couple weeks ago, and fixing it wouldn't be that hard. I even had an idea of how to detect when to shrink size class within a node kind, while keeping the header at 5 bytes. I'd be willing to put effort into that, but to have a chance of succeeding, I'm unwilling to make it more difficult by adding more size classes at this point.
>
> I think that the deletion (and locking support) doesn't have use cases
> in the core (i.e. tidstore) but is implemented so that external
> extensions can use it.
I think these cases are a bit different: Doing anything with a data structure stored in shared memory without a synchronization scheme is completely unthinkable and insane. I'm not yet sure if deleting-without-shrinking is a showstopper, or if it's preferable in v16 than no deletion at all.
Anything we don't implement now is a limit on future use cases, and thus a cause for objection. On the other hand, anything we implement also represents more stuff that will have to be rewritten for high-concurrency.
> FYI, I've run TPC-C workload over the weekend, and didn't get any
> failures of the assertion proving tidstore and the current tid lookup
> return the same result.
Great!
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Mar 13, 2023 at 10:28 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Mar 13, 2023 at 8:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sun, Mar 12, 2023 at 12:54 AM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > * Additional size classes. It's important for an alternative of path > > > > compression as well as supporting our decoupling approach. Middle > > > > priority. > > > > > > I'm going to push back a bit and claim this doesn't bring much gain, while it does have a complexity cost. The node1from Andres's prototype is 32 bytes in size, same as our node3, so it's roughly equivalent as a way to ameliorate thelack of path compression. > > > > But does it mean that our node1 would help reduce the memory further > > since since our base node type (i.e. RT_NODE) is smaller than the base > > node type of Andres's prototype? The result I shared before showed > > 1.2GB vs. 1.9GB. > > The benefit is found in a synthetic benchmark with random integers. I highly doubt that anyone would be willing to forceus to keep binary-searching the 1GB array for one more cycle on account of not adding a size class here. I'll repeatmyself and say that there are also maintenance costs. > > In contrast, I'm fairly certain that our attempts thus far at memory accounting/limiting are not quite up to par, and lackingenough to jeopardize the feature. We're already discussing that, so I'll say no more. I agree that memory accounting/limiting stuff is the highest priority. So what kinds of size classes do you think we need? node3, 15, 32, 61 and 256? > > > > I say "roughly" because the loop in node3 is probably noticeably slower. A new size class will by definition stilluse that loop. > > > > I've evaluated the performance of node1 but the result seems to show > > the opposite. > > As an aside, I meant the loop in our node3 might make your node1 slower than the prototype's node1, which was coded for1 member only. Agreed. > > > > > * Node shrinking support. Low priority. > > > > > > This is an architectural wart that's been neglected since the tid store doesn't perform deletion. We'll need it sometime.If we're not going to make this work, why ship a deletion API at all? > > > > > > I took a look at this a couple weeks ago, and fixing it wouldn't be that hard. I even had an idea of how to detectwhen to shrink size class within a node kind, while keeping the header at 5 bytes. I'd be willing to put effort intothat, but to have a chance of succeeding, I'm unwilling to make it more difficult by adding more size classes at thispoint. > > > > I think that the deletion (and locking support) doesn't have use cases > > in the core (i.e. tidstore) but is implemented so that external > > extensions can use it. > > I think these cases are a bit different: Doing anything with a data structure stored in shared memory without a synchronizationscheme is completely unthinkable and insane. Right. > I'm not yet sure if deleting-without-shrinking is a showstopper, or if it's preferable in v16 than no deletion at all. > > Anything we don't implement now is a limit on future use cases, and thus a cause for objection. On the other hand, anythingwe implement also represents more stuff that will have to be rewritten for high-concurrency. Okay. Given that adding shrinking support also requires maintenance costs (and probably new test cases?) and there are no use cases in the core, I'm not sure it's worth supporting it at this stage. So I prefer either shipping the deletion API as it is and removing the deletion API. I think that it's a discussion point that we'd like to hear feedback from other hackers. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
I wrote:
> > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
> > >
> > > Thoughts?
> >
> > It looks to work but it still doesn't work in a case where a shared
> > tidstore is created with a 64kB memory limit, right?
> > TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
> > from the beginning.
>
> I have two ideas:
>
> 1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else that vacuum does. That would allow other users to avoid that overhead.
> 2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not sure such a top-down facility exists, and I'm not feeling well enough today to study this further.
Since then, Masahiko incorporated #1 into v31, and that's what I'm looking at now. Unfortunately, If I had spent five minutes reminding myself what the original objections were to this approach, I could have saved us some effort. Back in July (!), Andres raised two points: GetMemoryChunkSpace() is slow [1], and fragmentation [2] (leading to underestimation).
In v31, in the local case at least, the underestimation is actually worse than tracking chunk space, since it ignores chunk header and alignment. I'm not sure about the DSA case. This doesn't seem great.
> > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
> > >
> > > Thoughts?
> >
> > It looks to work but it still doesn't work in a case where a shared
> > tidstore is created with a 64kB memory limit, right?
> > TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
> > from the beginning.
>
> I have two ideas:
>
> 1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else that vacuum does. That would allow other users to avoid that overhead.
> 2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not sure such a top-down facility exists, and I'm not feeling well enough today to study this further.
Since then, Masahiko incorporated #1 into v31, and that's what I'm looking at now. Unfortunately, If I had spent five minutes reminding myself what the original objections were to this approach, I could have saved us some effort. Back in July (!), Andres raised two points: GetMemoryChunkSpace() is slow [1], and fragmentation [2] (leading to underestimation).
In v31, in the local case at least, the underestimation is actually worse than tracking chunk space, since it ignores chunk header and alignment. I'm not sure about the DSA case. This doesn't seem great.
It shouldn't be a surprise why a simple increment of raw allocation size is comparable in speed -- GetMemoryChunkSpace() calls the right function through a pointer, which is slower. If we were willing to underestimate for the sake of speed, that takes away the reason for making memory tracking optional.
Further, if the option is not specified, in v31 there is no way to get the memory use at all, which seems odd. Surely the caller should be able to ask the context/area, if it wants to.
I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.
I'll put this item and a couple other things together in a separate email tomorrow.
[1] https://www.postgresql.org/message-id/20220704211822.kfxtzpcdmslzm2dy%40awork3.anarazel.de
[2] https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
On Tue, Mar 14, 2023 at 8:27 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > I wrote: > > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order ofthe actions here, effectively reporting progress for the *last page* and not the current one: First update progress withthe current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will bewritten to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like ourearlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to,I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense. > > > > > > > > Thoughts? > > > > > > It looks to work but it still doesn't work in a case where a shared > > > tidstore is created with a 64kB memory limit, right? > > > TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true > > > from the beginning. > > > > I have two ideas: > > > > 1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything elsethat vacuum does. That would allow other users to avoid that overhead. > > 2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm notsure such a top-down facility exists, and I'm not feeling well enough today to study this further. > > Since then, Masahiko incorporated #1 into v31, and that's what I'm looking at now. Unfortunately, If I had spent five minutesreminding myself what the original objections were to this approach, I could have saved us some effort. Back in July(!), Andres raised two points: GetMemoryChunkSpace() is slow [1], and fragmentation [2] (leading to underestimation). > > In v31, in the local case at least, the underestimation is actually worse than tracking chunk space, since it ignores chunkheader and alignment. I'm not sure about the DSA case. This doesn't seem great. Right. > > It shouldn't be a surprise why a simple increment of raw allocation size is comparable in speed -- GetMemoryChunkSpace()calls the right function through a pointer, which is slower. If we were willing to underestimate forthe sake of speed, that takes away the reason for making memory tracking optional. > > Further, if the option is not specified, in v31 there is no way to get the memory use at all, which seems odd. Surely thecaller should be able to ask the context/area, if it wants to. There are precedents that don't provide a way to return memory usage, such as simplehash.h and dshash.c. > > I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's rightbut I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases atthe moment. > What does it mean by "the precise usage" in your idea? Quoting from the email you referred to, Andres said: --- One thing I was wondering about is trying to choose node types in roughly-power-of-two struct sizes. It's pretty easy to end up with significant fragmentation in the slabs right now when inserting as you go, because some of the smaller node types will be freed but not enough to actually free blocks of memory. If we instead have ~power-of-two sizes we could just use a single slab of the max size, and carve out the smaller node types out of that largest allocation. Btw, that fragmentation is another reason why I think it's better to track memory usage via memory contexts, rather than doing so based on GetMemoryChunkSpace(). --- IIUC he suggested measuring memory usage in block-level in order to count blocks that are not actually freed but some of its chunks are freed. That's why we used MemoryContextMemAllocated(). On the other hand, recently you pointed out[1]: --- I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing there is what invokes a page fault. --- IIUC you suggested measuring memory usage by tracking how much memory chunks are allocated within a block. If your idea at the top of the page follows this method, it still doesn't deal with the point Andres mentioned. > I'll put this item and a couple other things together in a separate email tomorrow. Thanks! Regards, [1] https://www.postgresql.org/message-id/CAFBsxsEnzivaJ13iCGdDoUMsXJVGOaahuBe_y%3Dq6ow%3DLTzyDvA%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Mar 14, 2023 at 8:27 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > I wrote:
> >
> > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
> > I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.
> IIUC you suggested measuring memory usage by tracking how much memory
> chunks are allocated within a block. If your idea at the top of the
> page follows this method, it still doesn't deal with the point Andres
> mentioned.
Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-ordering was just to make sure that progress reporting didn't show current-use > max-use.
However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations. It seems it just wasn't designed for things where memory growth is unpredictable.
I'm starting to wonder if we need to give DSA a bit more info at the start. Imagine a "soft" limit given to the DSA area when it is initialized. If the total segment usage exceeds this, it stops doubling and instead new segments get smaller. Modifying an example we used for the fudge-factor idea some time ago:
m_w_m = 1GB, so calculate the soft limit to be 512MB and pass it to the DSA area.
2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> hit soft limit, so "stairstep down" the new segment sizes:
766 + 2*(128) + 64 = 1086MB -> stop
That's just an undeveloped idea, however, so likely v17 development, even assuming it's not a bad idea (could be).
And sadly, unless we find some other, simpler answer soon for tracking and limiting shared memory, the tid store is looking like v17 material.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Tue, Mar 14, 2023 at 8:27 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > I wrote:
> >
> > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
> > I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.
> IIUC you suggested measuring memory usage by tracking how much memory
> chunks are allocated within a block. If your idea at the top of the
> page follows this method, it still doesn't deal with the point Andres
> mentioned.
Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-ordering was just to make sure that progress reporting didn't show current-use > max-use.
However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations. It seems it just wasn't designed for things where memory growth is unpredictable.
I'm starting to wonder if we need to give DSA a bit more info at the start. Imagine a "soft" limit given to the DSA area when it is initialized. If the total segment usage exceeds this, it stops doubling and instead new segments get smaller. Modifying an example we used for the fudge-factor idea some time ago:
m_w_m = 1GB, so calculate the soft limit to be 512MB and pass it to the DSA area.
2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> hit soft limit, so "stairstep down" the new segment sizes:
766 + 2*(128) + 64 = 1086MB -> stop
That's just an undeveloped idea, however, so likely v17 development, even assuming it's not a bad idea (could be).
And sadly, unless we find some other, simpler answer soon for tracking and limiting shared memory, the tid store is looking like v17 material.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Mar 17, 2023 at 4:03 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Mar 14, 2023 at 8:27 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > I wrote: > > > > > > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the orderof the actions here, effectively reporting progress for the *last page* and not the current one: First update progresswith the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of thatwill be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kindof like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actuallywritten to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers willmake sense. > > > > I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it'sright but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future casesat the moment. > > > IIUC you suggested measuring memory usage by tracking how much memory > > chunks are allocated within a block. If your idea at the top of the > > page follows this method, it still doesn't deal with the point Andres > > mentioned. > > Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-orderingwas just to make sure that progress reporting didn't show current-use > max-use. Right. I still like your re-ordering idea. It's true that the most area of the last allocated block before heap scanning stops is not actually used yet. I'm guessing we can just check if the context memory has gone over the limit. But I'm concerned it might not work well in systems where overcommit memory is disabled. > > However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations. Itseems it just wasn't designed for things where memory growth is unpredictable. > > I'm starting to wonder if we need to give DSA a bit more info at the start. Imagine a "soft" limit given to the DSA areawhen it is initialized. If the total segment usage exceeds this, it stops doubling and instead new segments get smaller.Modifying an example we used for the fudge-factor idea some time ago: > > m_w_m = 1GB, so calculate the soft limit to be 512MB and pass it to the DSA area. > > 2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> hit soft limit, so "stairstep down" the new segment sizes: > > 766 + 2*(128) + 64 = 1086MB -> stop > > That's just an undeveloped idea, however, so likely v17 development, even assuming it's not a bad idea (could be). This is an interesting idea. But I'm concerned we don't have enough time to get confident with adding this new concept to DSA. > > And sadly, unless we find some other, simpler answer soon for tracking and limiting shared memory, the tid store is lookinglike v17 material. Another problem we need to deal with is the supported minimum memory in shared tidstore cases. Since the initial DSA segment size is 1MB, memory usage of a shared tidstore will start from 1MB+. This is higher than the minimum values of both work_mem and maintenance_work_mem, 64kB and 1MB respectively. Increasing the minimum m_w_m to 2MB seems to be acceptable in the community but not for work_mem. One idea is to deny the memory limit less than 2MB so it won't work with small m_w_m settings. While it might be an acceptable restriction at this stage (where there is no use case of using tidstore with work_mem in the core) but it will be a blocker for the future adoptions such as unifying with tidbitmap.c. Another idea is that the process can specify the initial segment size at dsa_create() so that DSA can start with a smaller segment, say 32kB. That way, a tidstore with a 32kB limit gets full once it allocates the next DSA segment, 32kB. . But a downside of this idea is to increase the number of segments behind DSA. Assuming it's a relatively rare case where we use such a low work_mem, it might be acceptable. FYI, the total number of DSM segments available on the system is calculated by: #define PG_DYNSHMEM_FIXED_SLOTS 64 #define PG_DYNSHMEM_SLOTS_PER_BACKEND 5 maxitems = PG_DYNSHMEM_FIXED_SLOTS + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends; Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Mar 17, 2023 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Mar 17, 2023 at 4:03 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, Mar 14, 2023 at 8:27 PM John Naylor > > > <john.naylor@enterprisedb.com> wrote: > > > > > > > > I wrote: > > > > > > > > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the orderof the actions here, effectively reporting progress for the *last page* and not the current one: First update progresswith the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of thatwill be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kindof like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actuallywritten to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers willmake sense. > > > > > > I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear ifit's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future casesat the moment. > > > > > IIUC you suggested measuring memory usage by tracking how much memory > > > chunks are allocated within a block. If your idea at the top of the > > > page follows this method, it still doesn't deal with the point Andres > > > mentioned. > > > > Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-orderingwas just to make sure that progress reporting didn't show current-use > max-use. > > Right. I still like your re-ordering idea. It's true that the most > area of the last allocated block before heap scanning stops is not > actually used yet. I'm guessing we can just check if the context > memory has gone over the limit. But I'm concerned it might not work > well in systems where overcommit memory is disabled. > > > > > However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations.It seems it just wasn't designed for things where memory growth is unpredictable. aset.c also has a similar characteristic; allocates an 8K block upon the first allocation in a context, and doubles that size for each successive block request. But we can specify the initial block size and max blocksize. This made me think of another idea to specify both to DSA and both values are calculated based on m_w_m. For example, we can create a DSA in parallel_vacuum_init() as follows: initial block size = min(m_w_m / 4, 1MB) max block size = max(m_w_m / 8, 8MB) In most cases, we can start with a 1MB initial segment, the same as before. For small memory cases, say 1MB, we start with a 256KB initial segment and heap scanning stops after DSA allocated 1.5MB (= 256kB + 256kB + 512kB + 512kB). For larger memory, we can have heap scan stop after DSA allocates 1.25 times more memory than m_w_m. For example, if m_w_m = 1GB, the both initial and maximum segment sizes are 1MB and 128MB respectively, and then DSA allocates the segments as follows until heap scanning stops: 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) + (128 * 5) = 1150MB dsa_allocate() will be extended to have the initial and maximum block sizes like AllocSetContextCreate(). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Mar 20, 2023 at 12:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Mar 17, 2023 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Mar 17, 2023 at 4:03 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > > On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Tue, Mar 14, 2023 at 8:27 PM John Naylor
> > > > <john.naylor@enterprisedb.com> wrote:
> > > > >
> > > > > I wrote:
> > > > >
> > > > > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.
> > >
> > > > > I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.
> > >
> > > > IIUC you suggested measuring memory usage by tracking how much memory
> > > > chunks are allocated within a block. If your idea at the top of the
> > > > page follows this method, it still doesn't deal with the point Andres
> > > > mentioned.
> > >
> > > Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-ordering was just to make sure that progress reporting didn't show current-use > max-use.
> >
> > Right. I still like your re-ordering idea. It's true that the most
> > area of the last allocated block before heap scanning stops is not
> > actually used yet. I'm guessing we can just check if the context
> > memory has gone over the limit. But I'm concerned it might not work
> > well in systems where overcommit memory is disabled.
> >
> > >
> > > However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations. It seems it just wasn't designed for things where memory growth is unpredictable.
>
> aset.c also has a similar characteristic; allocates an 8K block upon
> the first allocation in a context, and doubles that size for each
> successive block request. But we can specify the initial block size
> and max blocksize. This made me think of another idea to specify both
> to DSA and both values are calculated based on m_w_m. For example, we
That's an interesting idea, and the analogous behavior to aset could be a good thing for readability and maintainability. Worth seeing if it's workable.
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Mar 20, 2023 at 9:34 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Mon, Mar 20, 2023 at 12:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Mar 17, 2023 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Fri, Mar 17, 2023 at 4:03 PM John Naylor > > > <john.naylor@enterprisedb.com> wrote: > > > > > > > > On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > On Tue, Mar 14, 2023 at 8:27 PM John Naylor > > > > > <john.naylor@enterprisedb.com> wrote: > > > > > > > > > > > > I wrote: > > > > > > > > > > > > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse theorder of the actions here, effectively reporting progress for the *last page* and not the current one: First update progresswith the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of thatwill be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kindof like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actuallywritten to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers willmake sense. > > > > > > > > > > I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clearif it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all futurecases at the moment. > > > > > > > > > IIUC you suggested measuring memory usage by tracking how much memory > > > > > chunks are allocated within a block. If your idea at the top of the > > > > > page follows this method, it still doesn't deal with the point Andres > > > > > mentioned. > > > > > > > > Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-orderingwas just to make sure that progress reporting didn't show current-use > max-use. > > > > > > Right. I still like your re-ordering idea. It's true that the most > > > area of the last allocated block before heap scanning stops is not > > > actually used yet. I'm guessing we can just check if the context > > > memory has gone over the limit. But I'm concerned it might not work > > > well in systems where overcommit memory is disabled. > > > > > > > > > > > However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations.It seems it just wasn't designed for things where memory growth is unpredictable. > > > > aset.c also has a similar characteristic; allocates an 8K block upon > > the first allocation in a context, and doubles that size for each > > successive block request. But we can specify the initial block size > > and max blocksize. This made me think of another idea to specify both > > to DSA and both values are calculated based on m_w_m. For example, we > > That's an interesting idea, and the analogous behavior to aset could be a good thing for readability and maintainability.Worth seeing if it's workable. I've attached a quick hack patch. It can be applied on top of v32 patches. The changes to dsa.c are straightforward since it makes the initial and max block sizes configurable. The patch includes a test function, test_memory_usage() to simulate how DSA segments grow behind the shared radix tree. If we set the first argument to true, it calculates both initial and maximum block size based on work_mem (I used work_mem here just because its value range is larger than m_w_m): postgres(1:833654)=# select test_memory_usage(true); NOTICE: memory limit 134217728 NOTICE: init 1048576 max 16777216 NOTICE: initial: 1048576 NOTICE: rt_create: 1048576 NOTICE: allocate new DSM [1] 1048576 NOTICE: allocate new DSM [2] 2097152 NOTICE: allocate new DSM [3] 2097152 NOTICE: allocate new DSM [4] 4194304 NOTICE: allocate new DSM [5] 4194304 NOTICE: allocate new DSM [6] 8388608 NOTICE: allocate new DSM [7] 8388608 NOTICE: allocate new DSM [8] 16777216 NOTICE: allocate new DSM [9] 16777216 NOTICE: allocate new DSM [10] 16777216 NOTICE: allocate new DSM [11] 16777216 NOTICE: allocate new DSM [12] 16777216 NOTICE: allocate new DSM [13] 16777216 NOTICE: allocate new DSM [14] 16777216 NOTICE: reached: 148897792 (+14680064) NOTICE: 12718205 keys inserted: 148897792 test_memory_usage ------------------- (1 row) Time: 7195.664 ms (00:07.196) Setting the first argument to false, we can specify both manually in second and third arguments: postgres(1:833654)=# select test_memory_usage(false, 1024 * 1024, 1024 * 1024 * 1024 * 10::bigint); NOTICE: memory limit 134217728 NOTICE: init 1048576 max 10737418240 NOTICE: initial: 1048576 NOTICE: rt_create: 1048576 NOTICE: allocate new DSM [1] 1048576 NOTICE: allocate new DSM [2] 2097152 NOTICE: allocate new DSM [3] 2097152 NOTICE: allocate new DSM [4] 4194304 NOTICE: allocate new DSM [5] 4194304 NOTICE: allocate new DSM [6] 8388608 NOTICE: allocate new DSM [7] 8388608 NOTICE: allocate new DSM [8] 16777216 NOTICE: allocate new DSM [9] 16777216 NOTICE: allocate new DSM [10] 33554432 NOTICE: allocate new DSM [11] 33554432 NOTICE: allocate new DSM [12] 67108864 NOTICE: reached: 199229440 (+65011712) NOTICE: 12718205 keys inserted: 199229440 test_memory_usage ------------------- (1 row) Time: 7187.571 ms (00:07.188) It seems to work fine. The differences between the above two cases is the maximum block size (16MB .vs 10GB). We allocated two more DSA segments in the first segments but there was no big difference in the performance in my test environment. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Mar 20, 2023 at 9:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Mar 20, 2023 at 9:34 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > That's an interesting idea, and the analogous behavior to aset could be a good thing for readability and maintainability. Worth seeing if it's workable.
>
> I've attached a quick hack patch. It can be applied on top of v32
> patches. The changes to dsa.c are straightforward since it makes the
> initial and max block sizes configurable.
Good to hear -- this should probably be proposed in a separate thread for wider visibility.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Mar 21, 2023 at 2:41 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Mon, Mar 20, 2023 at 9:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Mar 20, 2023 at 9:34 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > That's an interesting idea, and the analogous behavior to aset could be a good thing for readability and maintainability.Worth seeing if it's workable. > > > > I've attached a quick hack patch. It can be applied on top of v32 > > patches. The changes to dsa.c are straightforward since it makes the > > initial and max block sizes configurable. > > Good to hear -- this should probably be proposed in a separate thread for wider visibility. Agreed. I'll start a new thread for that. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Feb 16, 2023 at 11:44 PM Andres Freund <andres@anarazel.de> wrote:
>
> We really ought to replace the tid bitmap used for bitmap heap scans. The
> hashtable we use is a pretty awful data structure for it. And that's not
> filled in-order, for example.
I spent some time studying tidbitmap.c, and not only does it make sense to use a radix tree there, but since it has more complex behavior and stricter runtime requirements, it should really be the thing driving the design and tradeoffs, not vacuum:
- With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get rid of the need to track TBMStatus, since setting a single-leaf tree should be cheap.
- Fixed-size PagetableEntry's are pretty large, but the tid compression scheme used in this thread (in addition to being complex) is not a great fit for tidbitmap because it makes it more difficult to track per-block metadata (see also next point). With the "combined pointer-value slots" technique, if a page's max tid offset is 63 or less, the offsets can be stored directly in the pointer for the exact case. The lowest bit can tag to indicate a pointer to a single-value leaf. That would complicate operations like union/intersection and tracking "needs recheck", but it would reduce memory use and node-traversal in common cases.
- Managing lossy storage. With pure blocknumber keys, replacing exact storage for a range of 256 pages amounts to replacing a last-level node with a single leaf containing one lossy PagetableEntry. The leader could iterate over the nodes, and rank the last-level nodes by how much storage they (possibly with leaf children) are using, and come up with an optimal lossy-conversion plan.
The above would address the points (not including better iteration and parallel bitmap index scans) raised in
https://www.postgresql.org/message-id/CAPsAnrn5yWsoWs8GhqwbwAJx1SeLxLntV54Biq0Z-J_E86Fnng@mail.gmail.com
Ironically, by targeting a more difficult use case, it's easier since there is less freedom. There are many ways to beat a binary search, but fewer good ways to improve bitmap heap scan. I'd like to put aside vacuum for some time and try killing two birds with one stone, building upon our work thus far.
Note: I've moved the CF entry to the next CF, and set to waiting on author for now. Since no action is currently required from Masahiko, I've added myself as author as well. If tackling bitmap heap scan shows promise, we could RWF and resurrect at a later time.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> We really ought to replace the tid bitmap used for bitmap heap scans. The
> hashtable we use is a pretty awful data structure for it. And that's not
> filled in-order, for example.
I spent some time studying tidbitmap.c, and not only does it make sense to use a radix tree there, but since it has more complex behavior and stricter runtime requirements, it should really be the thing driving the design and tradeoffs, not vacuum:
- With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get rid of the need to track TBMStatus, since setting a single-leaf tree should be cheap.
- Fixed-size PagetableEntry's are pretty large, but the tid compression scheme used in this thread (in addition to being complex) is not a great fit for tidbitmap because it makes it more difficult to track per-block metadata (see also next point). With the "combined pointer-value slots" technique, if a page's max tid offset is 63 or less, the offsets can be stored directly in the pointer for the exact case. The lowest bit can tag to indicate a pointer to a single-value leaf. That would complicate operations like union/intersection and tracking "needs recheck", but it would reduce memory use and node-traversal in common cases.
- Managing lossy storage. With pure blocknumber keys, replacing exact storage for a range of 256 pages amounts to replacing a last-level node with a single leaf containing one lossy PagetableEntry. The leader could iterate over the nodes, and rank the last-level nodes by how much storage they (possibly with leaf children) are using, and come up with an optimal lossy-conversion plan.
The above would address the points (not including better iteration and parallel bitmap index scans) raised in
https://www.postgresql.org/message-id/CAPsAnrn5yWsoWs8GhqwbwAJx1SeLxLntV54Biq0Z-J_E86Fnng@mail.gmail.com
Ironically, by targeting a more difficult use case, it's easier since there is less freedom. There are many ways to beat a binary search, but fewer good ways to improve bitmap heap scan. I'd like to put aside vacuum for some time and try killing two birds with one stone, building upon our work thus far.
Note: I've moved the CF entry to the next CF, and set to waiting on author for now. Since no action is currently required from Masahiko, I've added myself as author as well. If tackling bitmap heap scan shows promise, we could RWF and resurrect at a later time.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Apr 7, 2023 at 6:55 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Thu, Feb 16, 2023 at 11:44 PM Andres Freund <andres@anarazel.de> wrote: > > > > We really ought to replace the tid bitmap used for bitmap heap scans. The > > hashtable we use is a pretty awful data structure for it. And that's not > > filled in-order, for example. > > I spent some time studying tidbitmap.c, and not only does it make sense to use a radix tree there, but since it has morecomplex behavior and stricter runtime requirements, it should really be the thing driving the design and tradeoffs, notvacuum: > > - With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get ridof the need to track TBMStatus, since setting a single-leaf tree should be cheap. > Instead of introducing single-value leaves to the radix tree as another structure, can we store pointers to PagetableEntry as values? > - Fixed-size PagetableEntry's are pretty large, but the tid compression scheme used in this thread (in addition to beingcomplex) is not a great fit for tidbitmap because it makes it more difficult to track per-block metadata (see also nextpoint). With the "combined pointer-value slots" technique, if a page's max tid offset is 63 or less, the offsets canbe stored directly in the pointer for the exact case. The lowest bit can tag to indicate a pointer to a single-value leaf.That would complicate operations like union/intersection and tracking "needs recheck", but it would reduce memory useand node-traversal in common cases. > > - Managing lossy storage. With pure blocknumber keys, replacing exact storage for a range of 256 pages amounts to replacinga last-level node with a single leaf containing one lossy PagetableEntry. The leader could iterate over the nodes,and rank the last-level nodes by how much storage they (possibly with leaf children) are using, and come up with anoptimal lossy-conversion plan. > > The above would address the points (not including better iteration and parallel bitmap index scans) raised in > > https://www.postgresql.org/message-id/CAPsAnrn5yWsoWs8GhqwbwAJx1SeLxLntV54Biq0Z-J_E86Fnng@mail.gmail.com > > Ironically, by targeting a more difficult use case, it's easier since there is less freedom. There are many ways to beata binary search, but fewer good ways to improve bitmap heap scan. I'd like to put aside vacuum for some time and trykilling two birds with one stone, building upon our work thus far. > > Note: I've moved the CF entry to the next CF, and set to waiting on author for now. Since no action is currently requiredfrom Masahiko, I've added myself as author as well. If tackling bitmap heap scan shows promise, we could RWF andresurrect at a later time. Thanks. I'm going to continue researching the memory limitation and try lazy path expansion until PG17 development begins. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Sat, Mar 11, 2023 at 12:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Mar 10, 2023 at 11:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Mar 10, 2023 at 3:42 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > I've attached the new version patches. I merged improvements and fixes > > > > I did in the v29 patch. > > > > > > I haven't yet had a chance to look at those closely, since I've had to devote time to other commitments. I rememberI wasn't particularly impressed that v29-0008 mixed my requested name-casing changes with a bunch of other randomthings. Separating those out would be an obvious way to make it easier for me to look at, whenever I can get back tothis. I need to look at the iteration changes as well, in addition to testing memory measurement (thanks for the new results,they look encouraging). > > > > Okay, I'll separate them again. > > Attached new patch series. In addition to separate them again, I've > fixed a conflict with HEAD. > I've attached updated version patches to make cfbot happy. Also, I've splitted fixup patches further(from 0007 except for 0016 and 0018) to make reviews easy. These patches have the prefix radix tree, tidstore, and vacuum, indicating the part it changes. 0016 patch is to change DSA so that we can specify both the initial and max segment size and 0017 makes use of it in vacuumparallel.c I'm still researching a better solution for memory limitation but it's the best solution for me for now. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v32-0015-vacuum-Miscellaneous-updates.patch
- v32-0016-Make-initial-and-maximum-DSA-segment-size-config.patch
- v32-0014-tidstore-Miscellaneous-updates.patch
- v32-0018-Revert-building-benchmark-module-for-CI.patch
- v32-0017-tidstore-vacuum-Specify-the-init-and-max-DSA-seg.patch
- v32-0010-radix-tree-fix-radix-tree-test-code.patch
- v32-0011-tidstore-vacuum-Use-camel-case-for-TidStore-APIs.patch
- v32-0012-tidstore-Use-concept-of-off_upper-and-off_lower.patch
- v32-0009-radix-tree-Review-tree-iteration-code.patch
- v32-0013-tidstore-Embed-output-offsets-in-TidStoreIterRes.patch
- v32-0008-radix-tree-remove-resolved-TODO.patch
- v32-0007-radix-tree-rename-RT_EXTEND-and-RT_SET_EXTEND-to.patch
- v32-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patch
- v32-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patch
- v32-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
- v32-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v32-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patch
- v32-0003-Add-radixtree-template.patch
On Mon, Apr 17, 2023 at 8:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > - With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get rid of the need to track TBMStatus, since setting a single-leaf tree should be cheap.
> >
>
> Instead of introducing single-value leaves to the radix tree as
> another structure, can we store pointers to PagetableEntry as values?
Well, that's pretty much what a single-value leaf is. Now that I've had time to pause and regroup, I've looked into some aspects we previously put off for future work, and this is one of them.
The concept is really quite trivial, and it's the simplest and most flexible way to implement ART. Our, or at least my, documented reason not to go that route was due to "an extra pointer traversal", but that's partially mitigated by "lazy expansion", which is actually fairly easy to do with single-value leaves. The two techniques complement each other in a natural way. (Path compression, on the other hand, is much more complex.)
> > Note: I've moved the CF entry to the next CF, and set to waiting on author for now. Since no action is currently required from Masahiko, I've added myself as author as well. If tackling bitmap heap scan shows promise, we could RWF and resurrect at a later time.
>
> Thanks. I'm going to continue researching the memory limitation and
Sounds like the best thing to nail down at this point.
> try lazy path expansion until PG17 development begins.
This doesn't seem like a useful thing to try and attach into the current patch (if that's what you mean), as the current insert/delete paths are quite complex. Using bitmap heap scan as a motivating use case, I hope to refocus complexity to where it's most needed, and aggressively simplify where possible.
--
John Naylor
EDB: http://www.enterprisedb.com
> > - With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get rid of the need to track TBMStatus, since setting a single-leaf tree should be cheap.
> >
>
> Instead of introducing single-value leaves to the radix tree as
> another structure, can we store pointers to PagetableEntry as values?
Well, that's pretty much what a single-value leaf is. Now that I've had time to pause and regroup, I've looked into some aspects we previously put off for future work, and this is one of them.
The concept is really quite trivial, and it's the simplest and most flexible way to implement ART. Our, or at least my, documented reason not to go that route was due to "an extra pointer traversal", but that's partially mitigated by "lazy expansion", which is actually fairly easy to do with single-value leaves. The two techniques complement each other in a natural way. (Path compression, on the other hand, is much more complex.)
> > Note: I've moved the CF entry to the next CF, and set to waiting on author for now. Since no action is currently required from Masahiko, I've added myself as author as well. If tackling bitmap heap scan shows promise, we could RWF and resurrect at a later time.
>
> Thanks. I'm going to continue researching the memory limitation and
Sounds like the best thing to nail down at this point.
> try lazy path expansion until PG17 development begins.
This doesn't seem like a useful thing to try and attach into the current patch (if that's what you mean), as the current insert/delete paths are quite complex. Using bitmap heap scan as a motivating use case, I hope to refocus complexity to where it's most needed, and aggressively simplify where possible.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Apr 19, 2023 at 4:02 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Apr 17, 2023 at 8:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > - With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might getrid of the need to track TBMStatus, since setting a single-leaf tree should be cheap. > > > > > > > Instead of introducing single-value leaves to the radix tree as > > another structure, can we store pointers to PagetableEntry as values? > > Well, that's pretty much what a single-value leaf is. Now that I've had time to pause and regroup, I've looked into someaspects we previously put off for future work, and this is one of them. > > The concept is really quite trivial, and it's the simplest and most flexible way to implement ART. Our, or at least my,documented reason not to go that route was due to "an extra pointer traversal", but that's partially mitigated by "lazyexpansion", which is actually fairly easy to do with single-value leaves. The two techniques complement each other ina natural way. (Path compression, on the other hand, is much more complex.) > > > > Note: I've moved the CF entry to the next CF, and set to waiting on author for now. Since no action is currently requiredfrom Masahiko, I've added myself as author as well. If tackling bitmap heap scan shows promise, we could RWF andresurrect at a later time. > > > > Thanks. I'm going to continue researching the memory limitation and > > Sounds like the best thing to nail down at this point. > > > try lazy path expansion until PG17 development begins. > > This doesn't seem like a useful thing to try and attach into the current patch (if that's what you mean), as the currentinsert/delete paths are quite complex. Using bitmap heap scan as a motivating use case, I hope to refocus complexityto where it's most needed, and aggressively simplify where possible. > I agree that we don't want to make the current patch complex further. Thinking about the memory limitation more, I think that combination of the idea of specifying the initial and max DSA segment size and dsa_set_size_limit() works well. There are two points in terms of memory limitation; when the memory usage reaches the limit we want (1) to minimize the last allocated memory block that is allocated but not used yet and (2) to minimize the amount of memory that exceeds the memory limit. Since we can specify the maximum DSA segment size, the last allocated block before reaching the memory limit is small. Also, thanks to dsa_set_size_limit(), the total DSA size will stop at the limit, so (memory_usage >= memory_limit) returns true without any exceeding memory. Given that we need to configure the initial and maximum DSA segment size and set the DSA limit for TidStore memory accounting and limiting, it would be better to create the DSA for TidStore by TidStoreCreate() API, rather than creating DSA in the caller and pass it to TidStoreCreate(). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Apr 7, 2023 at 4:55 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> - Fixed-size PagetableEntry's are pretty large, but the tid compression scheme used in this thread (in addition to being complex) is not a great fit for tidbitmap because it makes it more difficult to track per-block metadata (see also next point). With the "combined pointer-value slots" technique, if a page's max tid offset is 63 or less, the offsets can be stored directly in the pointer for the exact case. The lowest bit can tag to indicate a pointer to a single-value leaf. That would complicate operations like union/intersection and tracking "needs recheck", but it would reduce memory use and node-traversal in common cases.
[just getting some thoughts out there before I have something concrete]
Thinking some more, this needn't be complicated at all. We'd just need to reserve some bits of a bitmapword for the tag, as well as flags for "ischunk" and "recheck". The other bits can be used for offsets. Getting/storing the offsets basically amounts to adjusting the shift by a constant. That way, this "embeddable PTE" could serve as both "PTE embedded in a node pointer" and also the first member of a full PTE. A full PTE is now just an array of embedded PTEs, except only the first one has the flags we need. That reduces the number of places that have to be different. Storing any set of offsets all less than ~60 would save allocation/traversal in a large number of real cases. Furthermore, that would reduce a full PTE to 40 bytes because there would be no padding.
This all assumes the key (block number) is no longer stored in the PTE, whether embedded or not. That would mean this technique:
> - With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get rid of the need to track TBMStatus, since setting a single-leaf tree should be cheap.
...is not a good trade off because it requires each leaf to have the key, and would thus reduce the utility of embedded leaves. We just need to make sure storing a single value is not costly, and I suspect it's not. (Currently the overhead avoided is allocating and zeroing a few kilobytes for a hash table). If it is not, then we don't need a special case in tidbitmap, which would be a great simplification. If it is, there are other ways to mitigate.
--
John Naylor
EDB: http://www.enterprisedb.com
> - Fixed-size PagetableEntry's are pretty large, but the tid compression scheme used in this thread (in addition to being complex) is not a great fit for tidbitmap because it makes it more difficult to track per-block metadata (see also next point). With the "combined pointer-value slots" technique, if a page's max tid offset is 63 or less, the offsets can be stored directly in the pointer for the exact case. The lowest bit can tag to indicate a pointer to a single-value leaf. That would complicate operations like union/intersection and tracking "needs recheck", but it would reduce memory use and node-traversal in common cases.
[just getting some thoughts out there before I have something concrete]
Thinking some more, this needn't be complicated at all. We'd just need to reserve some bits of a bitmapword for the tag, as well as flags for "ischunk" and "recheck". The other bits can be used for offsets. Getting/storing the offsets basically amounts to adjusting the shift by a constant. That way, this "embeddable PTE" could serve as both "PTE embedded in a node pointer" and also the first member of a full PTE. A full PTE is now just an array of embedded PTEs, except only the first one has the flags we need. That reduces the number of places that have to be different. Storing any set of offsets all less than ~60 would save allocation/traversal in a large number of real cases. Furthermore, that would reduce a full PTE to 40 bytes because there would be no padding.
This all assumes the key (block number) is no longer stored in the PTE, whether embedded or not. That would mean this technique:
> - With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get rid of the need to track TBMStatus, since setting a single-leaf tree should be cheap.
...is not a good trade off because it requires each leaf to have the key, and would thus reduce the utility of embedded leaves. We just need to make sure storing a single value is not costly, and I suspect it's not. (Currently the overhead avoided is allocating and zeroing a few kilobytes for a hash table). If it is not, then we don't need a special case in tidbitmap, which would be a great simplification. If it is, there are other ways to mitigate.
--
John Naylor
EDB: http://www.enterprisedb.com
I wrote:
> the current insert/delete paths are quite complex. Using bitmap heap scan as a motivating use case, I hope to refocus complexity to where it's most needed, and aggressively simplify where possible.
Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I just want to share some progress on making the radix tree usable not only for that, but hopefully a wider range of applications, while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no iteration) and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ).
0001
This combines a few concepts that I didn't bother separating out after the fact:
- Split insert_impl.h into multiple functions for improved readability and maintainability.
- Use single-value leaves as the basis for storing values, with the goal to get to "combined pointer-value slots" for efficiency and flexibility.
- With the latter in mind, searching the child within a node now returns the address of the slot. This allows the same interface whether the slot contains a child pointer or a value.
- Starting with RT_SET, start turning some iterative algorithms into recursive ones. This is a more natural way to traverse a tree structure, and we already see an advantage: Previously when growing a node, we searched within the parent to update its reference to the new node, because we didn't know the slot we descended from. Now we can simply update a single variable.
- Since we recursively pass the "shift" down the stack, it doesn't have to be stored in any node -- only the "top-level" start shift is stored in the tree control struct. This was easy to code since the node's shift value was hardly ever accessed anyway! The node header shrinks from 5 bytes to 4.
0002
Back in v15, we tried keeping DSA/local pointers as members of a struct. I did not like the result, but still thought it was a good idea. RT_DELETE is a complex function and I didn't want to try rewriting it without a pointer abstraction, so I've resurrected this idea, but in a simpler, less intrusive way. A key difference from v15 is using a union type for the non-shmem case.
0004
Rewrite RT_DELETE using recursion. I find this simpler than the previous open-coded stack.
0005-06
Deletion has an inefficiency: One function searches for the child to see if it's there, then another function searches for it again to delete it. Since 0001, a successful child search returns the address of the slot, so we can save it. For the two smaller "linear search" node kinds we can then use a single subtraction to compute the chunk/slot index for deletion. Also, split RT_NODE_DELETE_INNER into separate functions, for a similar reason as the insert case in 0001.
0007
Anticipate node shrinking: If only one node-kind needs to be freed, we can move a branch to that one code path, rather than every place where RT_FREE is inlined.
0009
Teach node256 how to shrink *. Since we know the number of children in a node256 can't possibly be zero, we can use uint8 to store the count and interpret an overflow to zero as 256 for this node. The node header shrinks from 4 bytes to 3.
* Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) -- currently node32's two size classes work fine for growing, but the code should be simplified before extending to other cases.)
0010
Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves" based on the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level child slots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256 leaf', since zero is a valid value. Aside from that, this was a small change.
What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster (untested) because of better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to extend this idea and branch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go there, and if not, it'll be a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time info will require e.g. an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in the node3 case. In addition, we can and should also bump it back up to node4, still keeping the metadata within 8 bytes (no struct padding).
I've started in this patchset to refer to the node kinds as "4/16/48/256", regardless of their actual fanout. This is for readability (by matching the language in the paper) and maintainability (should *not* ever change again). The size classes (including multiple classes per kind) could be determined by macros and #ifdef's. For example, in non-SIMD architectures, it's likely slow to search an array of 32 key chunks, so in that case the compiler should choose size classes similar to these four nominal kinds.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
Hi, On Tue, May 23, 2023 at 7:17 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > I wrote: > > the current insert/delete paths are quite complex. Using bitmap heap scan as a motivating use case, I hope to refocuscomplexity to where it's most needed, and aggressively simplify where possible. > > Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I just wantto share some progress on making the radix tree usable not only for that, but hopefully a wider range of applications,while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no iteration)and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ). > Thank you for making progress on this. I agree with these directions overall. I have some comments and questions: > - With the latter in mind, searching the child within a node now returns the address of the slot. This allows the sameinterface whether the slot contains a child pointer or a value. Probably we can apply similar changes to the iteration as well. > * Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) -- currently node32'stwo size classes work fine for growing, but the code should be simplified before extending to other cases.) Within the size class, we just alloc a new node of lower size class and do memcpy(). I guess it will be almost same as what we do for growing. It might be a good idea to support node shrinking within the size class for node32 (and node125 if we support). I don't think shrinking class-3 to class-1 makes sense. > > Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves" basedon the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level childslots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256 leaf',since zero is a valid value. Aside from that, this was a small change. Yes, but it also means that we use pointer-sized value anyway even if the value size is less than that, which wastes the memory, no? > > What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster (untested) becauseof better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to extend this idea andbranch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go there, and if not, it'llbe a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time info will require e.g.an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in the node3 case. In addition,we can and should also bump it back up to node4, still keeping the metadata within 8 bytes (no struct padding). Sounds good. > I've started in this patchset to refer to the node kinds as "4/16/48/256", regardless of their actual fanout. This is forreadability (by matching the language in the paper) and maintainability (should *not* ever change again). The size classes(including multiple classes per kind) could be determined by macros and #ifdef's. For example, in non-SIMD architectures,it's likely slow to search an array of 32 key chunks, so in that case the compiler should choose size classessimilar to these four nominal kinds. If we want to use the node kinds used in the paper, I think we should change the number in RT_NODE_KIND_X too. Otherwise, it would be confusing when reading the code without referring to the paper. Particularly, this part is very confusing: case RT_NODE_KIND_3: RT_ADD_CHILD_4(tree, ref, node, chunk, child); break; case RT_NODE_KIND_32: RT_ADD_CHILD_16(tree, ref, node, chunk, child); break; case RT_NODE_KIND_125: RT_ADD_CHILD_48(tree, ref, node, chunk, child); break; case RT_NODE_KIND_256: RT_ADD_CHILD_256(tree, ref, node, chunk, child); break; Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Jun 5, 2023 at 5:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I just want to share some progress on making the radix tree usable not only for that, but hopefully a wider range of applications, while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no iteration) and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ).
> >
>
> Thank you for making progress on this. I agree with these directions
> overall. I have some comments and questions:
Glad to hear it and thanks for looking!
> > * Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) -- currently node32's two size classes work fine for growing, but the code should be simplified before extending to other cases.)
>
> Within the size class, we just alloc a new node of lower size class
> and do memcpy(). I guess it will be almost same as what we do for
> growing.
Oh, the memcpy part is great, very simple. I mean the (compile-time) "class info" table lookups are a bit awkward. I'm thinking the hard-coded numbers like this:
.fanout = 3,
.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
...may be better with a #defined symbol that can also be used elsewhere.
> I don't think
> shrinking class-3 to class-1 makes sense.
Agreed. The smallest kind should just be freed when empty.
> > Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves" based on the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level child slots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256 leaf', since zero is a valid value. Aside from that, this was a small change.
>
> Yes, but it also means that we use pointer-sized value anyway even if
> the value size is less than that, which wastes the memory, no?
At a low-level, that makes sense, but I've found an interesting global effect showing the opposite: _less_ memory, which may compensate:
psql -c "select * from bench_search_random_nodes(1*1000*1000)"
num_keys = 992660
(using a low enough number that the experimental change n125->n63 doesn't affect anything)
height = 4, n3 = 375258, n15 = 137490, n32 = 0, n63 = 0, n256 = 1025
v31:
mem_allocated | load_ms | search_ms
---------------+---------+-----------
47800768 | 253 | 134
(unreleased code "similar" to v33, but among other things restores the separate "extend down" function)
mem_allocated | load_ms | search_ms
---------------+---------+-----------
42926048 | 221 | 127
I'd need to make sure, but apparently just going from 6 non-empty memory contexts to 3 (remember all values are embedded here) reduces memory fragmentation significantly in this test. (That should also serve as a demonstration that additional size classes have both runtime costs as well as benefits. We need to have a balance.)
So, I'm inclined to think the only reason to prefer "multi-value leaves" is if 1) the value type is _bigger_ than a pointer 2) there is no convenient abbreviation (like tid bitmaps have) and 3) the use case really needs to avoid another memory access. Under those circumstances, though, the new code plus lazy expansion etc might suit and be easier to maintain. That said, I've mostly left alone the "leaf" types and functions, as well as added some detritus like "const bool = false;". It would look a *lot* nicer if we gave up on multi-value leaves entirely, but there's no rush and I don't want to close that door entirely just yet.
> > What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster (untested) because of better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to extend this idea and branch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go there, and if not, it'll be a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time info will require e.g. an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in the node3 case. In addition, we can and should also bump it back up to node4, still keeping the metadata within 8 bytes (no struct padding).
>
> Sounds good.
The additional bit per slot would require per-node logic and additional branches, which is not great. I'm now thinking a much easier way to get there is to give up (at least for now) on promising that "run-time embeddable values" can use the full pointer-size (unlike value types found embeddable at compile-time). Reserving the lowest pointer bit for a tag "value or pointer-to-leaf" would have a much smaller code footprint. That also has a curious side-effect for TID offsets: They are one-based so reserving the zero bit would actually simplify things: getting rid of the +1/-1 logic when converting bits to/from offsets.
In addition, without a new bitmap, the smallest node can actually be up to a node5 with no struct padding, with a node2 as a subclass. (Those numbers coincidentally were also one scenario in the paper, when calculating worst-case memory usage). That's worth considering.
> > I've started in this patchset to refer to the node kinds as "4/16/48/256", regardless of their actual fanout.
> If we want to use the node kinds used in the paper, I think we should
> change the number in RT_NODE_KIND_X too.
Oh absolutely, this is nowhere near ready for cosmetic review :-)
--
John Naylor
EDB: http://www.enterprisedb.com
>
> > Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I just want to share some progress on making the radix tree usable not only for that, but hopefully a wider range of applications, while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no iteration) and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ).
> >
>
> Thank you for making progress on this. I agree with these directions
> overall. I have some comments and questions:
Glad to hear it and thanks for looking!
> > * Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) -- currently node32's two size classes work fine for growing, but the code should be simplified before extending to other cases.)
>
> Within the size class, we just alloc a new node of lower size class
> and do memcpy(). I guess it will be almost same as what we do for
> growing.
Oh, the memcpy part is great, very simple. I mean the (compile-time) "class info" table lookups are a bit awkward. I'm thinking the hard-coded numbers like this:
.fanout = 3,
.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
...may be better with a #defined symbol that can also be used elsewhere.
> I don't think
> shrinking class-3 to class-1 makes sense.
Agreed. The smallest kind should just be freed when empty.
> > Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves" based on the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level child slots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256 leaf', since zero is a valid value. Aside from that, this was a small change.
>
> Yes, but it also means that we use pointer-sized value anyway even if
> the value size is less than that, which wastes the memory, no?
At a low-level, that makes sense, but I've found an interesting global effect showing the opposite: _less_ memory, which may compensate:
psql -c "select * from bench_search_random_nodes(1*1000*1000)"
num_keys = 992660
(using a low enough number that the experimental change n125->n63 doesn't affect anything)
height = 4, n3 = 375258, n15 = 137490, n32 = 0, n63 = 0, n256 = 1025
v31:
mem_allocated | load_ms | search_ms
---------------+---------+-----------
47800768 | 253 | 134
(unreleased code "similar" to v33, but among other things restores the separate "extend down" function)
mem_allocated | load_ms | search_ms
---------------+---------+-----------
42926048 | 221 | 127
I'd need to make sure, but apparently just going from 6 non-empty memory contexts to 3 (remember all values are embedded here) reduces memory fragmentation significantly in this test. (That should also serve as a demonstration that additional size classes have both runtime costs as well as benefits. We need to have a balance.)
So, I'm inclined to think the only reason to prefer "multi-value leaves" is if 1) the value type is _bigger_ than a pointer 2) there is no convenient abbreviation (like tid bitmaps have) and 3) the use case really needs to avoid another memory access. Under those circumstances, though, the new code plus lazy expansion etc might suit and be easier to maintain. That said, I've mostly left alone the "leaf" types and functions, as well as added some detritus like "const bool = false;". It would look a *lot* nicer if we gave up on multi-value leaves entirely, but there's no rush and I don't want to close that door entirely just yet.
> > What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster (untested) because of better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to extend this idea and branch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go there, and if not, it'll be a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time info will require e.g. an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in the node3 case. In addition, we can and should also bump it back up to node4, still keeping the metadata within 8 bytes (no struct padding).
>
> Sounds good.
The additional bit per slot would require per-node logic and additional branches, which is not great. I'm now thinking a much easier way to get there is to give up (at least for now) on promising that "run-time embeddable values" can use the full pointer-size (unlike value types found embeddable at compile-time). Reserving the lowest pointer bit for a tag "value or pointer-to-leaf" would have a much smaller code footprint. That also has a curious side-effect for TID offsets: They are one-based so reserving the zero bit would actually simplify things: getting rid of the +1/-1 logic when converting bits to/from offsets.
In addition, without a new bitmap, the smallest node can actually be up to a node5 with no struct padding, with a node2 as a subclass. (Those numbers coincidentally were also one scenario in the paper, when calculating worst-case memory usage). That's worth considering.
> > I've started in this patchset to refer to the node kinds as "4/16/48/256", regardless of their actual fanout.
> If we want to use the node kinds used in the paper, I think we should
> change the number in RT_NODE_KIND_X too.
Oh absolutely, this is nowhere near ready for cosmetic review :-)
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Jun 6, 2023 at 2:13 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Jun 5, 2023 at 5:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I justwant to share some progress on making the radix tree usable not only for that, but hopefully a wider range of applications,while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no iteration)and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ). > > > > > > > Thank you for making progress on this. I agree with these directions > > overall. I have some comments and questions: > > Glad to hear it and thanks for looking! > > > > * Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) -- currentlynode32's two size classes work fine for growing, but the code should be simplified before extending to other cases.) > > > > Within the size class, we just alloc a new node of lower size class > > and do memcpy(). I guess it will be almost same as what we do for > > growing. > > Oh, the memcpy part is great, very simple. I mean the (compile-time) "class info" table lookups are a bit awkward. I'mthinking the hard-coded numbers like this: > > .fanout = 3, > .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC), > > ...may be better with a #defined symbol that can also be used elsewhere. FWIW, exposing these definitions would be good in terms of testing too since we can use them in regression tests. > > > I don't think > > shrinking class-3 to class-1 makes sense. > > Agreed. The smallest kind should just be freed when empty. > > > > Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves" basedon the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level childslots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256 leaf',since zero is a valid value. Aside from that, this was a small change. > > > > Yes, but it also means that we use pointer-sized value anyway even if > > the value size is less than that, which wastes the memory, no? > > At a low-level, that makes sense, but I've found an interesting global effect showing the opposite: _less_ memory, whichmay compensate: > > psql -c "select * from bench_search_random_nodes(1*1000*1000)" > num_keys = 992660 > > (using a low enough number that the experimental change n125->n63 doesn't affect anything) > height = 4, n3 = 375258, n15 = 137490, n32 = 0, n63 = 0, n256 = 1025 > > v31: > mem_allocated | load_ms | search_ms > ---------------+---------+----------- > 47800768 | 253 | 134 > > (unreleased code "similar" to v33, but among other things restores the separate "extend down" function) > mem_allocated | load_ms | search_ms > ---------------+---------+----------- > 42926048 | 221 | 127 > > I'd need to make sure, but apparently just going from 6 non-empty memory contexts to 3 (remember all values are embeddedhere) reduces memory fragmentation significantly in this test. (That should also serve as a demonstration that additionalsize classes have both runtime costs as well as benefits. We need to have a balance.) Interesting. The result would probably vary if we change the slab block sizes. I'd like to experiment if the code is available. > > So, I'm inclined to think the only reason to prefer "multi-value leaves" is if 1) the value type is _bigger_ than a pointer2) there is no convenient abbreviation (like tid bitmaps have) and 3) the use case really needs to avoid another memoryaccess. Under those circumstances, though, the new code plus lazy expansion etc might suit and be easier to maintain. Indeed. > > > > What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster (untested)because of better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to extendthis idea and branch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go there,and if not, it'll be a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time infowill require e.g. an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in thenode3 case. In addition, we can and should also bump it back up to node4, still keeping the metadata within 8 bytes (nostruct padding). > > > > Sounds good. > > The additional bit per slot would require per-node logic and additional branches, which is not great. I'm now thinkinga much easier way to get there is to give up (at least for now) on promising that "run-time embeddable values" canuse the full pointer-size (unlike value types found embeddable at compile-time). Reserving the lowest pointer bit fora tag "value or pointer-to-leaf" would have a much smaller code footprint. Do you mean we can make sure that the value doesn't set the lowest bit? Or is it an optimization for TIDStore? > In addition, without a new bitmap, the smallest node can actually be up to a node5 with no struct padding, with a node2as a subclass. (Those numbers coincidentally were also one scenario in the paper, when calculating worst-case memoryusage). That's worth considering. Agreed. FWIW please let me know if there are some experiments I can help with. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Jun 13, 2023 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jun 6, 2023 at 2:13 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> > I'd need to make sure, but apparently just going from 6 non-empty memory contexts to 3 (remember all values are embedded here) reduces memory fragmentation significantly in this test. (That should also serve as a demonstration that additional size classes have both runtime costs as well as benefits. We need to have a balance.)
>
> Interesting. The result would probably vary if we change the slab
> block sizes. I'd like to experiment if the code is available.
I cleaned up a few things and attached v34 so you can do that if you like. (Note: what I said about node63/n125 not making a difference in that one test is not quite true since slab keeps a few empty blocks around. I did some rough mental math and I think it doesn't change the conclusion any.)
0001-0007 is basically v33, but can apply on master.
0008 just adds back RT_EXTEND_DOWN. I left it out to simplify moving to recursion.
> > Oh, the memcpy part is great, very simple. I mean the (compile-time) "class info" table lookups are a bit awkward. I'm thinking the hard-coded numbers like this:
> >
> > .fanout = 3,
> > .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
> >
> > ...may be better with a #defined symbol that can also be used elsewhere.
>
> FWIW, exposing these definitions would be good in terms of testing too
> since we can use them in regression tests.
I added some definitions in 0012. It kind of doesn't matter now what sizes are the test unless it also can test that it stays within the expected size, if that makes sense. It is helpful during debugging to force growth to stop at a certain size.
> > > Within the size class, we just alloc a new node of lower size class
> > > and do memcpy().
Not anymore. ;-) To be technical, it didn't "just" memcpy(), since it then fell through to find the insert position and memmove(). In some parts of Andres' prototype, no memmove() is necessary, because it memcpy()'s around the insert position, and puts the new child in the right place. I've done this in 0009.
The memcpy you mention was done for 1) simplicity 2) to avoid memset'ing. Well, it was never necessary to memset the whole node in the first place. Only the header, slot index array, and isset arrays need to be zeroed, so in 0011 we always do only that. That combines alloc and init functionality, and it's simple everywhere.
In 0010 I restored iteration functionality -- it can no longer get the shift from the node, because it's not there as of v33. I was not particularly impressed that there were no basic iteration tests, and in fact the test_pattern test relied on functioning iteration. I added some basic tests. I'm not entirely pleased with testing overall, but I think it's at least sufficient for the job. I had the idea to replace "shift" everywhere and use "level" as a fundamental concept. This is clearer. I do want to make sure the compiler can compute the shift efficiently where necessary. I think that can wait until much later.
0013 standardizes (mostly) on 4/16/48/256 for naming convention, regardless of actual size, as I started to do earlier.
0014 is part cleanup of shrinking, and part making grow-node-48 more consistent with the rest.
> > The additional bit per slot would require per-node logic and additional branches, which is not great. I'm now thinking a much easier way to get there is to give up (at least for now) on promising that "run-time embeddable values" can use the full pointer-size (unlike value types found embeddable at compile-time). Reserving the lowest pointer bit for a tag "value or pointer-to-leaf" would have a much smaller code footprint.
>
> Do you mean we can make sure that the value doesn't set the lowest
> bit? Or is it an optimization for TIDStore?
It will be up to the caller (the user of the template) -- if an abbreviation is possible that fits in the upper 63 bits (with something to guard for 32-bit platforms), the developer will be able to specify a conversion function so that the caller only sees the full value when searching and setting. Without such a function, the template will fall back to the size of the value type to determine how the value is stored.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Tue, Jun 6, 2023 at 2:13 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> > I'd need to make sure, but apparently just going from 6 non-empty memory contexts to 3 (remember all values are embedded here) reduces memory fragmentation significantly in this test. (That should also serve as a demonstration that additional size classes have both runtime costs as well as benefits. We need to have a balance.)
>
> Interesting. The result would probably vary if we change the slab
> block sizes. I'd like to experiment if the code is available.
I cleaned up a few things and attached v34 so you can do that if you like. (Note: what I said about node63/n125 not making a difference in that one test is not quite true since slab keeps a few empty blocks around. I did some rough mental math and I think it doesn't change the conclusion any.)
0001-0007 is basically v33, but can apply on master.
0008 just adds back RT_EXTEND_DOWN. I left it out to simplify moving to recursion.
> > Oh, the memcpy part is great, very simple. I mean the (compile-time) "class info" table lookups are a bit awkward. I'm thinking the hard-coded numbers like this:
> >
> > .fanout = 3,
> > .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
> >
> > ...may be better with a #defined symbol that can also be used elsewhere.
>
> FWIW, exposing these definitions would be good in terms of testing too
> since we can use them in regression tests.
I added some definitions in 0012. It kind of doesn't matter now what sizes are the test unless it also can test that it stays within the expected size, if that makes sense. It is helpful during debugging to force growth to stop at a certain size.
> > > Within the size class, we just alloc a new node of lower size class
> > > and do memcpy().
Not anymore. ;-) To be technical, it didn't "just" memcpy(), since it then fell through to find the insert position and memmove(). In some parts of Andres' prototype, no memmove() is necessary, because it memcpy()'s around the insert position, and puts the new child in the right place. I've done this in 0009.
The memcpy you mention was done for 1) simplicity 2) to avoid memset'ing. Well, it was never necessary to memset the whole node in the first place. Only the header, slot index array, and isset arrays need to be zeroed, so in 0011 we always do only that. That combines alloc and init functionality, and it's simple everywhere.
In 0010 I restored iteration functionality -- it can no longer get the shift from the node, because it's not there as of v33. I was not particularly impressed that there were no basic iteration tests, and in fact the test_pattern test relied on functioning iteration. I added some basic tests. I'm not entirely pleased with testing overall, but I think it's at least sufficient for the job. I had the idea to replace "shift" everywhere and use "level" as a fundamental concept. This is clearer. I do want to make sure the compiler can compute the shift efficiently where necessary. I think that can wait until much later.
0013 standardizes (mostly) on 4/16/48/256 for naming convention, regardless of actual size, as I started to do earlier.
0014 is part cleanup of shrinking, and part making grow-node-48 more consistent with the rest.
> > The additional bit per slot would require per-node logic and additional branches, which is not great. I'm now thinking a much easier way to get there is to give up (at least for now) on promising that "run-time embeddable values" can use the full pointer-size (unlike value types found embeddable at compile-time). Reserving the lowest pointer bit for a tag "value or pointer-to-leaf" would have a much smaller code footprint.
>
> Do you mean we can make sure that the value doesn't set the lowest
> bit? Or is it an optimization for TIDStore?
It will be up to the caller (the user of the template) -- if an abbreviation is possible that fits in the upper 63 bits (with something to guard for 32-bit platforms), the developer will be able to specify a conversion function so that the caller only sees the full value when searching and setting. Without such a function, the template will fall back to the size of the value type to determine how the value is stored.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
I wrote:
> I cleaned up a few things and attached v34 so you can do that if you like.
Of course, "clean" is a relative term. While making a small bit of progress working in tidbitmap.c earlier this week, I thought it useful to prototype some things in the tidstore, at which point I was reminded it no longer compiles because of my recent work. I put in the necessary incantations so that the v32 tidstore compiles and passes tests, so here's a patchset for that (but no vacuum changes). I thought it was a good time to also condense it down to look more similar to previous patches, as a basis for future work.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
- v35-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patch
- v35-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
- v35-0004-Tool-for-measuring-radix-tree-performance.patch
- v35-0003-Add-radixtree-template.patch
- v35-0005-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
- v35-0006-Add-tidstore-tests-to-benchmark.patch
- v35-0007-Revert-building-benchmark-module-for-CI.patch
On Fri, Jun 23, 2023 at 6:54 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > I wrote: > > I cleaned up a few things and attached v34 so you can do that if you like. > > Of course, "clean" is a relative term. While making a small bit of progress working in tidbitmap.c earlier this week, Ithought it useful to prototype some things in the tidstore, at which point I was reminded it no longer compiles becauseof my recent work. I put in the necessary incantations so that the v32 tidstore compiles and passes tests, so here'sa patchset for that (but no vacuum changes). I thought it was a good time to also condense it down to look more similarto previous patches, as a basis for future work. > Thank you for updating the patch set. I'll look at updates closely early next week. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Jun 27, 2023 at 5:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Jun 23, 2023 at 6:54 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > > > I wrote: > > > I cleaned up a few things and attached v34 so you can do that if you like. > > > > Of course, "clean" is a relative term. While making a small bit of progress working in tidbitmap.c earlier this week,I thought it useful to prototype some things in the tidstore, at which point I was reminded it no longer compiles becauseof my recent work. I put in the necessary incantations so that the v32 tidstore compiles and passes tests, so here'sa patchset for that (but no vacuum changes). I thought it was a good time to also condense it down to look more similarto previous patches, as a basis for future work. > > > > Thank you for updating the patch set. I'll look at updates closely > early next week. > I've run several benchmarks for v32, where before your recent change starting, and v35 patch. Overall the numbers are better than the previous version. Here is the test result where I used 1-byte value: "select * from bench_load_random(10_000_000)" * v35 radix tree leaves: 192 total in 0 blocks; 0 empty blocks; 0 free (0 chunks); 192 used radix tree node 256: 13697472 total in 205 blocks; 0 empty blocks; 52400 free (25 chunks); 13645072 used radix tree node 125: 86630592 total in 2115 blocks; 0 empty blocks; 7859376 free (6102 chunks); 78771216 used radix tree node 32: 94912 total in 0 blocks; 10 empty blocks; 0 free (0 chunks); 94912 used radix tree node 15: 9269952 total in 1136 blocks; 0 empty blocks; 168 free (1 chunks); 9269784 used radix tree node 3: 1915502784 total in 233826 blocks; 0 empty blocks; 6560 free (164 chunks); 1915496224 used mem_allocated | load_ms ---------------+--------- 2025194752 | 3011 (1 row) * v32 radix tree node 256: 192 total in 0 blocks; 0 empty blocks; 0 free (0 chunks); 192 used radix tree node 256: 13487552 total in 205 blocks; 0 empty blocks; 51600 free (25 chunks); 13435952 used radix tree node 125: 192 total in 0 blocks; 0 empty blocks; 0 free (0 chunks); 192 used radix tree node 125: 86630592 total in 2115 blocks; 0 empty blocks; 7859376 free (6102 chunks); 78771216 used radix tree node 32: 192 total in 0 blocks; 0 empty blocks; 0 free (0 chunks); 192 used radix tree node 32: 94912 total in 0 blocks; 10 empty blocks; 0 free (0 chunks); 94912 used radix tree node 15: 192 total in 0 blocks; 0 empty blocks; 0 free (0 chunks); 192 used radix tree node 15: 9269952 total in 1136 blocks; 0 empty blocks; 168 free (1 chunks); 9269784 used radix tree node 3: 241597002 total in 29499 blocks; 0 empty blocks; 3864 free (161 chunks); 241593138 used radix tree node 3: 1809039552 total in 221696 blocks; 0 empty blocks; 5280 free (110 chunks); 1809034272 used mem_allocated | load_ms ---------------+--------- 2160118410 | 3069 (1 row) As you mentioned, the 1-byte value is embedded into 8 byte so 7 bytes are unused, but we use less memory since we use less slab contexts and save fragmentations. I've also tested some large value cases (e.g. the value is 80-bytes) and got a similar result. Regarding the codes, there are many todo and fixme comments so it seems to me that your recent work is still in-progress. What is the current status? Can I start reviewing the code or should I wait for a while until your recent work completes? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Jul 4, 2023 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> As you mentioned, the 1-byte value is embedded into 8 byte so 7 bytes
> are unused, but we use less memory since we use less slab contexts and
> save fragmentations.
Thanks for testing. This tree is sparse enough that most of the space is taken up by small inner nodes, and not by leaves. So, it's encouraging to see a small space savings even here.
> I've also tested some large value cases (e.g. the value is 80-bytes)
> and got a similar result.
Interesting. With a separate allocation per value the overhead would be 8 bytes, or 10% here. It's plausible that savings elsewhere can hide that, globally.
> Regarding the codes, there are many todo and fixme comments so it
> seems to me that your recent work is still in-progress. What is the
> current status? Can I start reviewing the code or should I wait for a
> while until your recent work completes?
Well, it's going to be a bit of a mess until I can demonstrate it working (and working well) with bitmap heap scan. Fixing that now is just going to create conflicts. I do have a couple small older patches laying around that were quick experiments -- I think at least some of them should give a performance boost in loading speed, but haven't had time to test. Would you like to take a look?
--
John Naylor
EDB: http://www.enterprisedb.com
>
> As you mentioned, the 1-byte value is embedded into 8 byte so 7 bytes
> are unused, but we use less memory since we use less slab contexts and
> save fragmentations.
Thanks for testing. This tree is sparse enough that most of the space is taken up by small inner nodes, and not by leaves. So, it's encouraging to see a small space savings even here.
> I've also tested some large value cases (e.g. the value is 80-bytes)
> and got a similar result.
Interesting. With a separate allocation per value the overhead would be 8 bytes, or 10% here. It's plausible that savings elsewhere can hide that, globally.
> Regarding the codes, there are many todo and fixme comments so it
> seems to me that your recent work is still in-progress. What is the
> current status? Can I start reviewing the code or should I wait for a
> while until your recent work completes?
Well, it's going to be a bit of a mess until I can demonstrate it working (and working well) with bitmap heap scan. Fixing that now is just going to create conflicts. I do have a couple small older patches laying around that were quick experiments -- I think at least some of them should give a performance boost in loading speed, but haven't had time to test. Would you like to take a look?
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Jul 5, 2023 at 8:21 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Tue, Jul 4, 2023 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > As you mentioned, the 1-byte value is embedded into 8 byte so 7 bytes > > are unused, but we use less memory since we use less slab contexts and > > save fragmentations. > > Thanks for testing. This tree is sparse enough that most of the space is taken up by small inner nodes, and not by leaves.So, it's encouraging to see a small space savings even here. > > > I've also tested some large value cases (e.g. the value is 80-bytes) > > and got a similar result. > > Interesting. With a separate allocation per value the overhead would be 8 bytes, or 10% here. It's plausible that savingselsewhere can hide that, globally. > > > Regarding the codes, there are many todo and fixme comments so it > > seems to me that your recent work is still in-progress. What is the > > current status? Can I start reviewing the code or should I wait for a > > while until your recent work completes? > > Well, it's going to be a bit of a mess until I can demonstrate it working (and working well) with bitmap heap scan. Fixingthat now is just going to create conflicts. I do have a couple small older patches laying around that were quick experiments-- I think at least some of them should give a performance boost in loading speed, but haven't had time to test.Would you like to take a look? Yes, I can experiment with these patches in the meantime. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Jul 5, 2023 at 8:21 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > Well, it's going to be a bit of a mess until I can demonstrate it working (and working well) with bitmap heap scan. Fixing that now is just going to create conflicts. I do have a couple small older patches laying around that were quick experiments -- I think at least some of them should give a performance boost in loading speed, but haven't had time to test. Would you like to take a look?
>
> Yes, I can experiment with these patches in the meantime.
Okay, here it is in v36. 0001-6 are same as v35.
0007 removes a wasted extra computation newly introduced by refactoring growing nodes. 0008 just makes 0011 nicer. Not worth testing by themselves, but better to be tidy.
0009 is an experiment to get rid of slow memmoves in node4, addressing a long-standing inefficiency. It looks a bit tricky, but I think it's actually straightforward after drawing out the cases with pen and paper. It works if the fanout is either 4 or 5, so we have some wiggle room. This may give a noticeable boost if the input is reversed or random.
0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with sparse trees.
0011 reduces function calls when growing the smaller nodes. Not sure about this one -- possibly worth it for node4 only?
If these help, it'll show up more easily in smaller inputs. Large inputs tend to be more dominated by RAM latency.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
On Sat, Jul 8, 2023 at 11:54 AM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Jul 5, 2023 at 8:21 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > Well, it's going to be a bit of a mess until I can demonstrate it working (and working well) with bitmap heap scan.Fixing that now is just going to create conflicts. I do have a couple small older patches laying around that were quickexperiments -- I think at least some of them should give a performance boost in loading speed, but haven't had timeto test. Would you like to take a look? > > > > Yes, I can experiment with these patches in the meantime. > > Okay, here it is in v36. 0001-6 are same as v35. > > 0007 removes a wasted extra computation newly introduced by refactoring growing nodes. 0008 just makes 0011 nicer. Notworth testing by themselves, but better to be tidy. > 0009 is an experiment to get rid of slow memmoves in node4, addressing a long-standing inefficiency. It looks a bit tricky,but I think it's actually straightforward after drawing out the cases with pen and paper. It works if the fanout iseither 4 or 5, so we have some wiggle room. This may give a noticeable boost if the input is reversed or random. > 0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with sparse trees. > 0011 reduces function calls when growing the smaller nodes. Not sure about this one -- possibly worth it for node4 only? > > If these help, it'll show up more easily in smaller inputs. Large inputs tend to be more dominated by RAM latency. Thanks for sharing the patches! 0007, 0008, 0010, and 0011 are straightforward and agree to merge them. I have some questions on 0009 patch: + /* shift chunks and children + + Unfortunately, gcc has gotten too aggressive in turning simple loops + into slow memmove's, so we have to be a bit more clever. + See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101481 + + We take advantage of the fact that a good + compiler can turn a memmove of a small constant power-of-two + number of bytes into a single load/store. + */ According to the comment, this optimization is for only gcc? and there is no negative impact when building with other compilers such as clang by this change? I'm not sure that it's a good approach to hand-optimize the code much to generate better instructions on gcc. I think this change reduces readability and maintainability. According to the bugzilla ticket referred to in the comment, it's realized as a bug in the community, so once the gcc bug fixes, we might no longer need this trick, no? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Hi, On Thu, Jul 13, 2023 at 5:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sat, Jul 8, 2023 at 11:54 AM John Naylor > <john.naylor@enterprisedb.com> wrote: > > > > > > On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Wed, Jul 5, 2023 at 8:21 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > Well, it's going to be a bit of a mess until I can demonstrate it working (and working well) with bitmap heap scan.Fixing that now is just going to create conflicts. I do have a couple small older patches laying around that were quickexperiments -- I think at least some of them should give a performance boost in loading speed, but haven't had timeto test. Would you like to take a look? > > > > > > Yes, I can experiment with these patches in the meantime. > > > > Okay, here it is in v36. 0001-6 are same as v35. > > > > 0007 removes a wasted extra computation newly introduced by refactoring growing nodes. 0008 just makes 0011 nicer. Notworth testing by themselves, but better to be tidy. > > 0009 is an experiment to get rid of slow memmoves in node4, addressing a long-standing inefficiency. It looks a bit tricky,but I think it's actually straightforward after drawing out the cases with pen and paper. It works if the fanout iseither 4 or 5, so we have some wiggle room. This may give a noticeable boost if the input is reversed or random. > > 0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with sparse trees. > > 0011 reduces function calls when growing the smaller nodes. Not sure about this one -- possibly worth it for node4 only? > > > > If these help, it'll show up more easily in smaller inputs. Large inputs tend to be more dominated by RAM latency. cfbot reported some failures[1], and the v36 patch cannot be applied cleanly to the current HEAD. I've attached updated patches to make cfbot happy. Regards, [1] http://cfbot.cputube.org/highlights/all.html#3687 -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Jul 13, 2023 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> 0007, 0008, 0010, and 0011 are straightforward and agree to merge them.
[Part 1 - clear the deck of earlier performance work etc]
Thanks for taking a look! I've merged 0007 and 0008. The others need a performance test to justify them -- an eyeball check is not enough. I've now made the time to do that.
==== sparse loads
v38 0001-0006 (still using node3 for this test only):
select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 * (1+x-x))) a;
avg
---------------------
27.1000000000000000
select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a;
avg
----------------------
165.6333333333333333
v38-0007-Optimize-RT_EXTEND_DOWN.patch
select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 * (1+x-x))) a;
avg
---------------------
25.0900000000000000
select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a;
avg
----------------------
157.3666666666666667
That seems worth doing.
v38-0008-Use-4-children-for-node-4-also-attempt-portable-.patch
This combines two things because I messed up a rebase: Use fanout of 4, and try some macros for shmem sizes, both 32- and 64-bit. Looking at this much, I no longer have a goal to have a separate set of size-classes for non-SIMD platforms, because that would cause global maintenance problems -- it's probably better to reduce worst-case search time where necessary. That would be much more localized.
> I have some questions on 0009 patch:
> According to the comment, this optimization is for only gcc?
No, not at all. That tells me the comment is misleading.
> I think this change reduces
> readability and maintainability.
Well, that much is obvious. What is not obvious is how much it gains us over the alternatives. I do have a simpler idea, though...
==== load mostly node4
select * from bench_search_random_nodes(250*1000, '0xFFFFFF');
n4 = 42626, n16 = 21492, n32 = 0, n64 = 0, n256 = 257
mem_allocated | load_ms | search_ms
---------------+---------+-----------
7352384 | 25 | 0
v38-0009-TEMP-take-out-search-time-from-bench.patch
This is just to allow LATERAL queries for better measurements.
select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_search_random_nodes(250*1000 * (1+x-x), '0xFFFFFF')) a;
avg
---------------------
24.8333333333333333
v38-0010-Try-a-simpler-way-to-avoid-memmove.patch
This slightly rewrites the standard loop so that gcc doesn't turn it into a memmove(). Unlike the patch you didn't like, this *is* gcc-specific. (needs a comment, which I forgot)
avg
---------------------
21.9600000000000000
So, that's not a trivial difference. I wasn't a big fan of Andres' __asm("") workaround, but that may be just my ignorance about it. We need something like either of the two.
v38-0011-Optimize-add_child_4-take-2.patch
avg
---------------------
21.3500000000000000
This is possibly faster than v38-0010, but looking like not worth the complexity, assuming the other way avoids the bug going forward.
> According to the bugzilla ticket
> referred to in the comment, it's realized as a bug in the community,
> so once the gcc bug fixes, we might no longer need this trick, no?
No comment in two years...
v38-0013-Use-constant-for-initial-copy-of-chunks-and-chil.patch
This is the same as v37-0011. I wasn't quite satisfied with it since it still has two memcpy() calls, but it actually seems to regress:
avg
---------------------
22.0900000000000000
v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch
This patch uses a single loop for the copy.
avg
---------------------
21.0300000000000000
Within noise level of v38-0011, but it's small and simple, so I like it, at least for small arrays.
v38-0014-node48-Remove-need-for-RIGHTMOST_ONE-in-radix-tr.patch
v38-0015-node48-Remove-dead-code-by-using-loop-local-var.patch
Just small cleanups.
v38-0016-Use-memcpy-for-children-when-growing-into-node48.patch
Makes sense, but untested.
===============
[Part 2]
Per off-list discussion with Masahiko, it makes sense to take some of the ideas I've used locally on tidbitmap, and start incorporating them into earlier vacuum work to get that out the door faster. With that in mind...
v38-0017-Make-tidstore-more-similar-to-tidbitmap.patch
This uses a simplified PagetableEntry (unimaginatively called BlocktableEntry just to avoid confusion), to be replaced with the real thing at a later date. This is still fixed size, to be replaced with a varlen type.
Looking at the tidstore tests again after some months, I'm not particularly pleased with the amount of code required for how little it seems to be testing, nor the output when something fails. (I wonder how hard it would be to have SQL functions that add blocks/offsets to the tid store, and emit tuples of tids found in the store.)
I'm also concerned about the number of places that have to know if the store is using shared memory or not. Something to think about later.
v38-0018-Consolidate-inserting-updating-values.patch
This is something I coded up to get to an API more similar to one in simplehash, as used in tidbitmap.c. It seem worth doing on its own to reduce code duplication, and also simplifies coding of varlen types and "runtime-embeddable values".
>
> 0007, 0008, 0010, and 0011 are straightforward and agree to merge them.
[Part 1 - clear the deck of earlier performance work etc]
Thanks for taking a look! I've merged 0007 and 0008. The others need a performance test to justify them -- an eyeball check is not enough. I've now made the time to do that.
==== sparse loads
v38 0001-0006 (still using node3 for this test only):
select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 * (1+x-x))) a;
avg
---------------------
27.1000000000000000
select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a;
avg
----------------------
165.6333333333333333
v38-0007-Optimize-RT_EXTEND_DOWN.patch
select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 * (1+x-x))) a;
avg
---------------------
25.0900000000000000
select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a;
avg
----------------------
157.3666666666666667
That seems worth doing.
v38-0008-Use-4-children-for-node-4-also-attempt-portable-.patch
This combines two things because I messed up a rebase: Use fanout of 4, and try some macros for shmem sizes, both 32- and 64-bit. Looking at this much, I no longer have a goal to have a separate set of size-classes for non-SIMD platforms, because that would cause global maintenance problems -- it's probably better to reduce worst-case search time where necessary. That would be much more localized.
> I have some questions on 0009 patch:
> According to the comment, this optimization is for only gcc?
No, not at all. That tells me the comment is misleading.
> I think this change reduces
> readability and maintainability.
Well, that much is obvious. What is not obvious is how much it gains us over the alternatives. I do have a simpler idea, though...
==== load mostly node4
select * from bench_search_random_nodes(250*1000, '0xFFFFFF');
n4 = 42626, n16 = 21492, n32 = 0, n64 = 0, n256 = 257
mem_allocated | load_ms | search_ms
---------------+---------+-----------
7352384 | 25 | 0
v38-0009-TEMP-take-out-search-time-from-bench.patch
This is just to allow LATERAL queries for better measurements.
select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_search_random_nodes(250*1000 * (1+x-x), '0xFFFFFF')) a;
avg
---------------------
24.8333333333333333
v38-0010-Try-a-simpler-way-to-avoid-memmove.patch
This slightly rewrites the standard loop so that gcc doesn't turn it into a memmove(). Unlike the patch you didn't like, this *is* gcc-specific. (needs a comment, which I forgot)
avg
---------------------
21.9600000000000000
So, that's not a trivial difference. I wasn't a big fan of Andres' __asm("") workaround, but that may be just my ignorance about it. We need something like either of the two.
v38-0011-Optimize-add_child_4-take-2.patch
avg
---------------------
21.3500000000000000
This is possibly faster than v38-0010, but looking like not worth the complexity, assuming the other way avoids the bug going forward.
> According to the bugzilla ticket
> referred to in the comment, it's realized as a bug in the community,
> so once the gcc bug fixes, we might no longer need this trick, no?
No comment in two years...
v38-0013-Use-constant-for-initial-copy-of-chunks-and-chil.patch
This is the same as v37-0011. I wasn't quite satisfied with it since it still has two memcpy() calls, but it actually seems to regress:
avg
---------------------
22.0900000000000000
v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch
This patch uses a single loop for the copy.
avg
---------------------
21.0300000000000000
Within noise level of v38-0011, but it's small and simple, so I like it, at least for small arrays.
v38-0014-node48-Remove-need-for-RIGHTMOST_ONE-in-radix-tr.patch
v38-0015-node48-Remove-dead-code-by-using-loop-local-var.patch
Just small cleanups.
v38-0016-Use-memcpy-for-children-when-growing-into-node48.patch
Makes sense, but untested.
===============
[Part 2]
Per off-list discussion with Masahiko, it makes sense to take some of the ideas I've used locally on tidbitmap, and start incorporating them into earlier vacuum work to get that out the door faster. With that in mind...
v38-0017-Make-tidstore-more-similar-to-tidbitmap.patch
This uses a simplified PagetableEntry (unimaginatively called BlocktableEntry just to avoid confusion), to be replaced with the real thing at a later date. This is still fixed size, to be replaced with a varlen type.
Looking at the tidstore tests again after some months, I'm not particularly pleased with the amount of code required for how little it seems to be testing, nor the output when something fails. (I wonder how hard it would be to have SQL functions that add blocks/offsets to the tid store, and emit tuples of tids found in the store.)
I'm also concerned about the number of places that have to know if the store is using shared memory or not. Something to think about later.
v38-0018-Consolidate-inserting-updating-values.patch
This is something I coded up to get to an API more similar to one in simplehash, as used in tidbitmap.c. It seem worth doing on its own to reduce code duplication, and also simplifies coding of varlen types and "runtime-embeddable values".
Attachment
Hi, On Mon, Aug 14, 2023 at 8:05 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Thu, Jul 13, 2023 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > 0007, 0008, 0010, and 0011 are straightforward and agree to merge them. Thank you for updating the patch! > > [Part 1 - clear the deck of earlier performance work etc] > > Thanks for taking a look! I've merged 0007 and 0008. The others need a performance test to justify them -- an eyeball checkis not enough. I've now made the time to do that. > > ==== sparse loads > > v38 0001-0006 (still using node3 for this test only): > > select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 * (1+x-x)))a; > avg > --------------------- > 27.1000000000000000 > > select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x)))a; > avg > ---------------------- > 165.6333333333333333 > > v38-0007-Optimize-RT_EXTEND_DOWN.patch > > select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 * (1+x-x)))a; > avg > --------------------- > 25.0900000000000000 > > select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x)))a; > avg > ---------------------- > 157.3666666666666667 > > That seems worth doing. > > v38-0008-Use-4-children-for-node-4-also-attempt-portable-.patch > > This combines two things because I messed up a rebase: Use fanout of 4, and try some macros for shmem sizes, both 32- and64-bit. Looking at this much, I no longer have a goal to have a separate set of size-classes for non-SIMD platforms, becausethat would cause global maintenance problems -- it's probably better to reduce worst-case search time where necessary.That would be much more localized. > > > I have some questions on 0009 patch: > > > According to the comment, this optimization is for only gcc? > > No, not at all. That tells me the comment is misleading. > > > I think this change reduces > > readability and maintainability. > > Well, that much is obvious. What is not obvious is how much it gains us over the alternatives. I do have a simpler idea,though... > > ==== load mostly node4 > > select * from bench_search_random_nodes(250*1000, '0xFFFFFF'); > n4 = 42626, n16 = 21492, n32 = 0, n64 = 0, n256 = 257 > mem_allocated | load_ms | search_ms > ---------------+---------+----------- > 7352384 | 25 | 0 > > v38-0009-TEMP-take-out-search-time-from-bench.patch > > This is just to allow LATERAL queries for better measurements. > > select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_search_random_nodes(250*1000 * (1+x-x),'0xFFFFFF')) a; > > avg > --------------------- > 24.8333333333333333 0007, 0008, and 0009 look good to me. > > v38-0010-Try-a-simpler-way-to-avoid-memmove.patch > > This slightly rewrites the standard loop so that gcc doesn't turn it into a memmove(). Unlike the patch you didn't like,this *is* gcc-specific. (needs a comment, which I forgot) > > avg > --------------------- > 21.9600000000000000 > > So, that's not a trivial difference. I wasn't a big fan of Andres' __asm("") workaround, but that may be just my ignoranceabout it. We need something like either of the two. > > v38-0011-Optimize-add_child_4-take-2.patch > avg > --------------------- > 21.3500000000000000 > > This is possibly faster than v38-0010, but looking like not worth the complexity, assuming the other way avoids the buggoing forward. I prefer 0010 but is it worth testing with other compilers such as clang? > > > According to the bugzilla ticket > > referred to in the comment, it's realized as a bug in the community, > > so once the gcc bug fixes, we might no longer need this trick, no? > > No comment in two years... > > v38-0013-Use-constant-for-initial-copy-of-chunks-and-chil.patch > > This is the same as v37-0011. I wasn't quite satisfied with it since it still has two memcpy() calls, but it actually seemsto regress: > > avg > --------------------- > 22.0900000000000000 > > v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch > > This patch uses a single loop for the copy. > > avg > --------------------- > 21.0300000000000000 > > Within noise level of v38-0011, but it's small and simple, so I like it, at least for small arrays. Agreed. > > v38-0014-node48-Remove-need-for-RIGHTMOST_ONE-in-radix-tr.patch > v38-0015-node48-Remove-dead-code-by-using-loop-local-var.patch > > Just small cleanups. > > v38-0016-Use-memcpy-for-children-when-growing-into-node48.patch > > Makes sense, but untested. Agreed. BTW cfbot reported that some regression tests failed due to OOM. I've attached the patch to fix it. > > =============== > [Part 2] > > Per off-list discussion with Masahiko, it makes sense to take some of the ideas I've used locally on tidbitmap, and startincorporating them into earlier vacuum work to get that out the door faster. With that in mind... > > v38-0017-Make-tidstore-more-similar-to-tidbitmap.patch > > This uses a simplified PagetableEntry (unimaginatively called BlocktableEntry just to avoid confusion), to be replacedwith the real thing at a later date. This is still fixed size, to be replaced with a varlen type. That's more readable. > > Looking at the tidstore tests again after some months, I'm not particularly pleased with the amount of code required forhow little it seems to be testing, nor the output when something fails. (I wonder how hard it would be to have SQL functionsthat add blocks/offsets to the tid store, and emit tuples of tids found in the store.) It would not be hard to have such SQL functions. I'll try it. > > I'm also concerned about the number of places that have to know if the store is using shared memory or not. Something tothink about later. > > v38-0018-Consolidate-inserting-updating-values.patch > > This is something I coded up to get to an API more similar to one in simplehash, as used in tidbitmap.c. It seem worthdoing on its own to reduce code duplication, and also simplifies coding of varlen types and "runtime-embeddable values". Agreed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Aug 15, 2023 at 9:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> BTW cfbot reported that some regression tests failed due to OOM. I've
> attached the patch to fix it.
Seems worth doing now rather than later, so added this and squashed most of the rest together. I wonder if that test uses too much memory in general. Maybe using the full uint64 is too much.
> On Mon, Aug 14, 2023 at 8:05 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > This is possibly faster than v38-0010, but looking like not worth the complexity, assuming the other way avoids the bug going forward.
>
> I prefer 0010 but is it worth testing with other compilers such as clang?
Okay, keeping 0010 with a comment, and leaving out 0011 for now. Clang is aggressive about unrolling loops, so may be worth looking globally at some point.
> > v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch
> > Within noise level of v38-0011, but it's small and simple, so I like it, at least for small arrays.
>
> Agreed.
Keeping 0012 and not 0013.
--
John Naylor
EDB: http://www.enterprisedb.com
> BTW cfbot reported that some regression tests failed due to OOM. I've
> attached the patch to fix it.
Seems worth doing now rather than later, so added this and squashed most of the rest together. I wonder if that test uses too much memory in general. Maybe using the full uint64 is too much.
> On Mon, Aug 14, 2023 at 8:05 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> > This is possibly faster than v38-0010, but looking like not worth the complexity, assuming the other way avoids the bug going forward.
>
> I prefer 0010 but is it worth testing with other compilers such as clang?
Okay, keeping 0010 with a comment, and leaving out 0011 for now. Clang is aggressive about unrolling loops, so may be worth looking globally at some point.
> > v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch
> > Within noise level of v38-0011, but it's small and simple, so I like it, at least for small arrays.
>
> Agreed.
Keeping 0012 and not 0013.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
On Tue, Aug 15, 2023 at 6:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Tue, Aug 15, 2023 at 9:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > BTW cfbot reported that some regression tests failed due to OOM. I've
> > attached the patch to fix it.
>
> Seems worth doing now rather than later, so added this and squashed most of the rest together.
This segfaults because of a mistake fixing a rebase conflict, so v40 attached.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
On Wed, Aug 16, 2023 at 8:04 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Tue, Aug 15, 2023 at 6:53 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > On Tue, Aug 15, 2023 at 9:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > BTW cfbot reported that some regression tests failed due to OOM. I've > > > attached the patch to fix it. > > > > Seems worth doing now rather than later, so added this and squashed most of the rest together. > > This segfaults because of a mistake fixing a rebase conflict, so v40 attached. > Thank you for updating the patch set. On Tue, Aug 15, 2023 at 11:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Aug 14, 2023 at 8:05 PM John Naylor > <john.naylor@enterprisedb.com> wrote: > > Looking at the tidstore tests again after some months, I'm not particularly pleased with the amount of code requiredfor how little it seems to be testing, nor the output when something fails. (I wonder how hard it would be to haveSQL functions that add blocks/offsets to the tid store, and emit tuples of tids found in the store.) > > It would not be hard to have such SQL functions. I'll try it. I've updated the regression tests for tidstore so that it uses SQL functions to add blocks/offsets and dump its contents. The new test covers the same test coverages but it's executed using SQL functions instead of executing all tests in one SQL function. 0008 patch fixes a bug in tidstore which I found during this work. We didn't recreate the radix tree in the same memory context when TidStoreReset(). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Sun, Aug 27, 2023 at 7:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've updated the regression tests for tidstore so that it uses SQL
> functions to add blocks/offsets and dump its contents. The new test
> covers the same test coverages but it's executed using SQL functions
> instead of executing all tests in one SQL function.
This is much nicer and more flexible, thanks! A few questions/comments:
tidstore_dump_tids() returns a string -- is it difficult to turn this into a SRF, or is it just a bit more work?
The lookup test seems fine for now. The output would look nicer with an "order by tid".
I think we could have the SQL function tidstore_create() take a boolean for shared memory. That would allow ad-hoc testing without a recompile, if I'm not mistaken.
+SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+ FROM blocks, offsets
+ GROUP BY blk;
+ tidstore_set_block_offsets
+----------------------------
+
+
+
+
+
+(5 rows)
Calling a void function multiple times leads to vertical whitespace, which looks a bit strange and may look better with some output, even if irrelevant:
-SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+SELECT row_number() over(order by blk), tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
row_number | tidstore_set_block_offsets
------------+----------------------------
1 |
2 |
3 |
4 |
5 |
(5 rows)
--
John Naylor
EDB: http://www.enterprisedb.com
>
> I've updated the regression tests for tidstore so that it uses SQL
> functions to add blocks/offsets and dump its contents. The new test
> covers the same test coverages but it's executed using SQL functions
> instead of executing all tests in one SQL function.
This is much nicer and more flexible, thanks! A few questions/comments:
tidstore_dump_tids() returns a string -- is it difficult to turn this into a SRF, or is it just a bit more work?
The lookup test seems fine for now. The output would look nicer with an "order by tid".
I think we could have the SQL function tidstore_create() take a boolean for shared memory. That would allow ad-hoc testing without a recompile, if I'm not mistaken.
+SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+ FROM blocks, offsets
+ GROUP BY blk;
+ tidstore_set_block_offsets
+----------------------------
+
+
+
+
+
+(5 rows)
Calling a void function multiple times leads to vertical whitespace, which looks a bit strange and may look better with some output, even if irrelevant:
-SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+SELECT row_number() over(order by blk), tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
row_number | tidstore_set_block_offsets
------------+----------------------------
1 |
2 |
3 |
4 |
5 |
(5 rows)
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Aug 28, 2023 at 4:20 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Sun, Aug 27, 2023 at 7:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I've updated the regression tests for tidstore so that it uses SQL > > functions to add blocks/offsets and dump its contents. The new test > > covers the same test coverages but it's executed using SQL functions > > instead of executing all tests in one SQL function. > > This is much nicer and more flexible, thanks! A few questions/comments: > > tidstore_dump_tids() returns a string -- is it difficult to turn this into a SRF, or is it just a bit more work? It's not difficult. I've changed it in v42 patch. > > The lookup test seems fine for now. The output would look nicer with an "order by tid". Agreed. > > I think we could have the SQL function tidstore_create() take a boolean for shared memory. That would allow ad-hoc testingwithout a recompile, if I'm not mistaken. Agreed. > > +SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[]) > + FROM blocks, offsets > + GROUP BY blk; > + tidstore_set_block_offsets > +---------------------------- > + > + > + > + > + > +(5 rows) > > Calling a void function multiple times leads to vertical whitespace, which looks a bit strange and may look better withsome output, even if irrelevant: > > -SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[]) > +SELECT row_number() over(order by blk), tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[]) > > row_number | tidstore_set_block_offsets > ------------+---------------------------- > 1 | > 2 | > 3 | > 4 | > 5 | > (5 rows) Yes, it looks better. I've attached v42 patch set. I improved tidstore regression test codes in addition of imcorporating the above comments. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Aug 28, 2023 at 9:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've attached v42 patch set. I improved tidstore regression test codes
> in addition of imcorporating the above comments.
Seems fine at a glance, thanks. I will build on this to implement variable-length values. I have already finished one prerequisite which is: public APIs passing pointers to values.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Sep 6, 2023 at 3:23 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Mon, Aug 28, 2023 at 9:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I've attached v42 patch set. I improved tidstore regression test codes > > in addition of imcorporating the above comments. > > Seems fine at a glance, thanks. I will build on this to implement variable-length values. Thanks. > I have already finished one prerequisite which is: public APIs passing pointers to values. Great! Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Hi, On 2023-08-28 23:43:22 +0900, Masahiko Sawada wrote: > I've attached v42 patch set. I improved tidstore regression test codes > in addition of imcorporating the above comments. Why did you need to disable the benchmark module for CI? Greetings, Andres Freund
On Sat, Sep 16, 2023 at 9:03 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-08-28 23:43:22 +0900, Masahiko Sawada wrote: > > I've attached v42 patch set. I improved tidstore regression test codes > > in addition of imcorporating the above comments. > > Why did you need to disable the benchmark module for CI? I didn't want to unnecessarily make cfbot unhappy since the benchmark module is not going to get committed to the core and sometimes not up-to-date. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
I wrote: > Seems fine at a glance, thanks. I will build on this to implement variable-length values. I have already finished one prerequisitewhich is: public APIs passing pointers to values. Since my publishing schedule has not kept up, I'm just going to share something similar to what I mentioned earlier, just to get things moving again. 0001-0009 are from earlier versions, except for 0007 which makes a bunch of superficial naming updates, similar to those done in a recent other version. Somewhere along the way I fixed long-standing git whitespace warnings, but I don't remember if that's new here. In any case, let's try to preserve that. 0010 is some minor refactoring to reduce duplication 0011-0014 add public functions that give the caller more control over the input and responsibility for locking. They are not named well, but I plan these to be temporary: They are currently used for the tidstore only, since that has much simpler tests than the standard radix tree tests. One thing to note: since the tidstore has always done it's own locking within a larger structure, these patches don't bother to do locking at the radix tree level. Locking twice seems...not great. These patches are the main prerequisite for variable-length values. Once that is working well, we can switch the standard tests to the new APIs. Next steps include (some of these were briefly discussed off-list with Sawada-san): - template parameter for varlen values - some callers to pass length in bytes - block entries to have num_elems for # of bitmap words - a way for updates to re-alloc values when needed - aset allocation for values when appropriate
Attachment
On Sat, Oct 28, 2023 at 5:56 PM John Naylor <johncnaylorls@gmail.com> wrote: > > I wrote: > > > Seems fine at a glance, thanks. I will build on this to implement variable-length values. I have already finished oneprerequisite which is: public APIs passing pointers to values. > > Since my publishing schedule has not kept up, I'm just going to share > something similar to what I mentioned earlier, just to get things > moving again. Thanks for sharing the updates. I've returned to work today and will resume working on this feature. > > 0001-0009 are from earlier versions, except for 0007 which makes a > bunch of superficial naming updates, similar to those done in a recent > other version. Somewhere along the way I fixed long-standing git > whitespace warnings, but I don't remember if that's new here. In any > case, let's try to preserve that. > > 0010 is some minor refactoring to reduce duplication > > 0011-0014 add public functions that give the caller more control over > the input and responsibility for locking. They are not named well, but > I plan these to be temporary: They are currently used for the tidstore > only, since that has much simpler tests than the standard radix tree > tests. One thing to note: since the tidstore has always done it's own > locking within a larger structure, these patches don't bother to do > locking at the radix tree level. Locking twice seems...not great. > These patches are the main prerequisite for variable-length values. > Once that is working well, we can switch the standard tests to the new > APIs. Since the variable-length values support is a big deal and would be related to API design I'd like to discuss the API design first. Currently, we have the following APIs: --- RT_VALUE_TYPE RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found); or for variable-length value support, RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); If an entry already exists, return its pointer and set "found" to true. Otherwize, insert an empty value with sz bytes, return its pointer, and set "found" to false. --- RT_VALUE_TYPE RT_FIND(RT_RADIX_TREE *tree, uint64 key); If an entry exists, return the pointer to the value, otherwise return NULL. (I omitted RT_SEARCH() as it's essentially the same as RT_FIND() and will probably get removed.) --- bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p); or for variable-length value support, RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz); If an entry already exists, update its value to 'value_p' and return true. Otherwise set the value and return false. Given variable-length value support, RT_GET() would have to do repalloc() if the existing value size is not big enough for the new value, but it cannot as the radix tree doesn't know the size of each stored value. Another idea is that the radix tree returns the pointer to the slot and the caller updates the value accordingly. But it means that the caller has to update the slot properly while considering the value size (embedded vs. single-leave value), which seems not a good idea. To deal with this problem, I think we can somewhat change RT_GET() API as follow: RT_VALUE_TYPE RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); If the entry already exists, replace the value with a new empty value with sz bytes and set "found" to true. Otherwise, insert an empty value, return its pointer, and set "found" to false. We probably will find a better name but I use RT_INSERT() for discussion. RT_INSERT() returns an empty slot regardless of existing values. It can be used to insert a new value or to replace the value with a larger value. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Since the variable-length values support is a big deal and would be > related to API design I'd like to discuss the API design first. Thanks for the fine summary of the issues here. [Swapping this back in my head] > RT_VALUE_TYPE > RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found); > or for variable-length value support, > RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); > > If an entry already exists, return its pointer and set "found" to > true. Otherwize, insert an empty value with sz bytes, return its > pointer, and set "found" to false. > --- > bool > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p); > or for variable-length value support, > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz); > > If an entry already exists, update its value to 'value_p' and return > true. Otherwise set the value and return false. I'd have to double-check, but I think RT_SET is vestigial and I'm not sure it has any advantage over RT_GET as I've sketched it out. I'm pretty sure it's only there now because changing the radix tree regression tests is much harder than changing TID store. > Given variable-length value support, RT_GET() would have to do > repalloc() if the existing value size is not big enough for the new > value, but it cannot as the radix tree doesn't know the size of each > stored value. I think we have two choices: - the value stores the "length". The caller would need to specify a function to compute size from the "length" member. Note this assumes there is an array. I think both aspects are not great. - the value stores the "size". Callers that store an array (as PageTableEntry's do) would compute length when they need to. This sounds easier. > Another idea is that the radix tree returns the pointer > to the slot and the caller updates the value accordingly. I did exactly this in v43 TidStore if I understood you correctly. If I misunderstood you, can you clarify? > But it means > that the caller has to update the slot properly while considering the > value size (embedded vs. single-leave value), which seems not a good > idea. For this optimization, callers will have to know about pointer-sized values and treat them differently, but they don't need to know the details about how where they are stored. While we want to keep embedded values in the back of our minds, I really think the details should be postponed to a follow-up commit. > To deal with this problem, I think we can somewhat change RT_GET() API > as follow: > > RT_VALUE_TYPE > RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); > > If the entry already exists, replace the value with a new empty value > with sz bytes and set "found" to true. Otherwise, insert an empty > value, return its pointer, and set "found" to false. > > We probably will find a better name but I use RT_INSERT() for > discussion. RT_INSERT() returns an empty slot regardless of existing > values. It can be used to insert a new value or to replace the value > with a larger value. For the case we are discussing, bitmaps, updating an existing value is a bit tricky. We need the existing value to properly update it with set or unset bits. This can't work in general without a lot of work for the caller. However, for vacuum, we have all values that we need up front. That gives me an idea: Something like this insert API could be optimized for "insert-only": If we only free values when we free the whole tree at the end, that's a clear use case for David Rowley's proposed "bump context", which would save 8 bytes per allocation and be a bit faster. [1] (RT_GET for varlen values would use an aset context, to allow repalloc, and nodes would continue to use slab). [1] https://www.postgresql.org/message-id/flat/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
On Mon, Dec 4, 2023 at 5:21 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Since the variable-length values support is a big deal and would be > > related to API design I'd like to discuss the API design first. > > Thanks for the fine summary of the issues here. > > [Swapping this back in my head] > > > RT_VALUE_TYPE > > RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found); > > or for variable-length value support, > > RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); > > > > If an entry already exists, return its pointer and set "found" to > > true. Otherwize, insert an empty value with sz bytes, return its > > pointer, and set "found" to false. > > > --- > > bool > > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p); > > or for variable-length value support, > > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz); > > > > If an entry already exists, update its value to 'value_p' and return > > true. Otherwise set the value and return false. > > I'd have to double-check, but I think RT_SET is vestigial and I'm not > sure it has any advantage over RT_GET as I've sketched it out. I'm > pretty sure it's only there now because changing the radix tree > regression tests is much harder than changing TID store. Agreed. > > > Given variable-length value support, RT_GET() would have to do > > repalloc() if the existing value size is not big enough for the new > > value, but it cannot as the radix tree doesn't know the size of each > > stored value. > > I think we have two choices: > > - the value stores the "length". The caller would need to specify a > function to compute size from the "length" member. Note this assumes > there is an array. I think both aspects are not great. > - the value stores the "size". Callers that store an array (as > PageTableEntry's do) would compute length when they need to. This > sounds easier. As for the second idea, do we always need to require the value to have the "size" (e.g. int32) in the first field of its struct? If so, the caller will be able to use only 4 bytes in embedded value cases (or won't be able to use at all if the pointer size is 4 bytes). > > > Another idea is that the radix tree returns the pointer > > to the slot and the caller updates the value accordingly. > > I did exactly this in v43 TidStore if I understood you correctly. If I > misunderstood you, can you clarify? I meant to expose RT_GET_SLOT_RECURSIVE() so that the caller updates the value as they want. > > > But it means > > that the caller has to update the slot properly while considering the > > value size (embedded vs. single-leave value), which seems not a good > > idea. > > For this optimization, callers will have to know about pointer-sized > values and treat them differently, but they don't need to know the > details about how where they are stored. > > While we want to keep embedded values in the back of our minds, I > really think the details should be postponed to a follow-up commit. Agreed. > > > To deal with this problem, I think we can somewhat change RT_GET() API > > as follow: > > > > RT_VALUE_TYPE > > RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); > > > > If the entry already exists, replace the value with a new empty value > > with sz bytes and set "found" to true. Otherwise, insert an empty > > value, return its pointer, and set "found" to false. > > > > We probably will find a better name but I use RT_INSERT() for > > discussion. RT_INSERT() returns an empty slot regardless of existing > > values. It can be used to insert a new value or to replace the value > > with a larger value. > > For the case we are discussing, bitmaps, updating an existing value is > a bit tricky. We need the existing value to properly update it with > set or unset bits. This can't work in general without a lot of work > for the caller. True. > > However, for vacuum, we have all values that we need up front. That > gives me an idea: Something like this insert API could be optimized > for "insert-only": If we only free values when we free the whole tree > at the end, that's a clear use case for David Rowley's proposed "bump > context", which would save 8 bytes per allocation and be a bit faster. > [1] (RT_GET for varlen values would use an aset context, to allow > repalloc, and nodes would continue to use slab). Interesting idea and worth trying it. Do we need to protect the whole tree as insert-only for safety? It's problematic if the user uses mixed RT_INSERT() and RT_GET(). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Dec 6, 2023 at 4:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Dec 4, 2023 at 5:21 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > Given variable-length value support, RT_GET() would have to do > > > repalloc() if the existing value size is not big enough for the new > > > value, but it cannot as the radix tree doesn't know the size of each > > > stored value. > > > > I think we have two choices: > > > > - the value stores the "length". The caller would need to specify a > > function to compute size from the "length" member. Note this assumes > > there is an array. I think both aspects are not great. > > - the value stores the "size". Callers that store an array (as > > PageTableEntry's do) would compute length when they need to. This > > sounds easier. > > As for the second idea, do we always need to require the value to have > the "size" (e.g. int32) in the first field of its struct? If so, the > caller will be able to use only 4 bytes in embedded value cases (or > won't be able to use at all if the pointer size is 4 bytes). We could have an RT_SIZE_TYPE for varlen value types. That's easy. There is another way, though: (This is a digression into embedded values, but it does illuminate some issues even aside from that) My thinking a while ago was that an embedded value had no explicit length/size, but could be "expanded" into a conventional value for the caller. For bitmaps, the smallest full value would have length 1 and whatever size (For tid store maybe 16 bytes). This would happen automatically via a template function. Now I think that could be too complicated (especially for page table entries, which have more bookkeeping than vacuum needs) and slow. Imagine this as an embedded value: typedef struct BlocktableEntry { uint16 size; /* later: uint8 flags; for bitmap scan */ /* 64 bit: 3 elements , 32-bit: 1 element */ OffsetNumber offsets[( sizeof(Pointer) - sizeof(int16) ) / sizeof(OffsetNumber)]; /* end of embeddable value */ bitmapword words[FLEXIBLE_ARRAY_MEMBER]; } BlocktableEntry; Here we can use a slot to store up to 3 offsets, no matter how big they are. That's great because a bitmap could be mostly wasted space. But now the caller can't know up front how many bytes it needs until it retrieves the value and sees what's already there. If there are already three values, the caller needs to tell the tree "alloc this much, update this slot you just gave me with the alloc (maybe DSA) pointer, and return the local pointer". Then copy the 3 offsets into set bits, and set whatever else it needs to. With normal values, same thing, but with realloc. This is a bit complex, but I see an advantage The tree doesn't need to care so much about the size, so the value doesn't need to contain the size. For our case, we can use length (number of bitmapwords) without the disadvantages I mentioned above, with length zero (or maybe -1) meaning "no bitmapword array, the offsets are all in this small array". > > > Another idea is that the radix tree returns the pointer > > > to the slot and the caller updates the value accordingly. > > > > I did exactly this in v43 TidStore if I understood you correctly. If I > > misunderstood you, can you clarify? > > I meant to expose RT_GET_SLOT_RECURSIVE() so that the caller updates > the value as they want. Did my sketch above get closer to that? Side note: I don't think we can expose that directly (e.g. need to check for create or extend upwards), but some functionality can be a thin wrapper around it. > > However, for vacuum, we have all values that we need up front. That > > gives me an idea: Something like this insert API could be optimized > > for "insert-only": If we only free values when we free the whole tree > > at the end, that's a clear use case for David Rowley's proposed "bump > > context", which would save 8 bytes per allocation and be a bit faster. > > [1] (RT_GET for varlen values would use an aset context, to allow > > repalloc, and nodes would continue to use slab). > > Interesting idea and worth trying it. Do we need to protect the whole > tree as insert-only for safety? It's problematic if the user uses > mixed RT_INSERT() and RT_GET(). You're right, but I'm not sure what the policy should be.
On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sat, Oct 28, 2023 at 5:56 PM John Naylor <johncnaylorls@gmail.com> wrote: > bool > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p); > or for variable-length value support, > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz); > > If an entry already exists, update its value to 'value_p' and return > true. Otherwise set the value and return false. > RT_VALUE_TYPE > RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); > > If the entry already exists, replace the value with a new empty value > with sz bytes and set "found" to true. Otherwise, insert an empty > value, return its pointer, and set "found" to false. > > We probably will find a better name but I use RT_INSERT() for > discussion. RT_INSERT() returns an empty slot regardless of existing > values. It can be used to insert a new value or to replace the value > with a larger value. Looking at TidStoreSetBlockOffsets again (in particular how it works with RT_GET), and thinking about issues we've discussed, I think RT_SET is sufficient for vacuum. Here's how it could work: TidStoreSetBlockOffsets could have a stack variable that's "almost always" large enough. When not, it can allocate in its own context. It sets the necessary bits there. Then, it passes the pointer to RT_SET with the number of bytes to copy. That seems very simple. At some future time, we can add a new function with the complex business about getting the current value to modify it, with the re-alloc'ing that it might require. In other words, from both an API perspective and a performance perspective, it makes sense for tid store to have a simple "set" interface for vacuum that can be optimized for its characteristics (insert only, ordered offsets). And also a more complex one for bitmap scan (setting/unsetting bits of existing values, in any order). They can share the same iteration interface, key types, and value types. What do you think, Masahiko?
On Wed, Dec 6, 2023 at 3:39 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Dec 6, 2023 at 4:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Dec 4, 2023 at 5:21 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > Given variable-length value support, RT_GET() would have to do > > > > repalloc() if the existing value size is not big enough for the new > > > > value, but it cannot as the radix tree doesn't know the size of each > > > > stored value. > > > > > > I think we have two choices: > > > > > > - the value stores the "length". The caller would need to specify a > > > function to compute size from the "length" member. Note this assumes > > > there is an array. I think both aspects are not great. > > > - the value stores the "size". Callers that store an array (as > > > PageTableEntry's do) would compute length when they need to. This > > > sounds easier. > > > > As for the second idea, do we always need to require the value to have > > the "size" (e.g. int32) in the first field of its struct? If so, the > > caller will be able to use only 4 bytes in embedded value cases (or > > won't be able to use at all if the pointer size is 4 bytes). > > We could have an RT_SIZE_TYPE for varlen value types. That's easy. > There is another way, though: (This is a digression into embedded > values, but it does illuminate some issues even aside from that) > > My thinking a while ago was that an embedded value had no explicit > length/size, but could be "expanded" into a conventional value for the > caller. For bitmaps, the smallest full value would have length 1 and > whatever size (For tid store maybe 16 bytes). This would happen > automatically via a template function. > > Now I think that could be too complicated (especially for page table > entries, which have more bookkeeping than vacuum needs) and slow. > Imagine this as an embedded value: > > typedef struct BlocktableEntry > { > uint16 size; > > /* later: uint8 flags; for bitmap scan */ > > /* 64 bit: 3 elements , 32-bit: 1 element */ > OffsetNumber offsets[( sizeof(Pointer) - sizeof(int16) ) / > sizeof(OffsetNumber)]; > > /* end of embeddable value */ > > bitmapword words[FLEXIBLE_ARRAY_MEMBER]; > } BlocktableEntry; > > Here we can use a slot to store up to 3 offsets, no matter how big > they are. That's great because a bitmap could be mostly wasted space. Interesting idea. > But now the caller can't know up front how many bytes it needs until > it retrieves the value and sees what's already there. If there are > already three values, the caller needs to tell the tree "alloc this > much, update this slot you just gave me with the alloc (maybe DSA) > pointer, and return the local pointer". Then copy the 3 offsets into > set bits, and set whatever else it needs to. With normal values, same > thing, but with realloc. > > This is a bit complex, but I see an advantage The tree doesn't need to > care so much about the size, so the value doesn't need to contain the > size. For our case, we can use length (number of bitmapwords) without > the disadvantages I mentioned above, with length zero (or maybe -1) > meaning "no bitmapword array, the offsets are all in this small > array". It's still unclear to me why the value doesn't need to contain the size. If I understand you correctly, in RT_GET(), the tree allocs a new memory and updates the slot where the value is embedded with the pointer to the allocated memory, and returns the pointer to the caller. Since the returned value, newly allocated memory, is still empty, the callner needs to copy the contents of the old value to the new value and do whatever else it needs to. If the value is already a single-leave value and RT_GET() is called with a larger size, the slot is always replaced with the newly allocated area and the caller needs to copy the contents? If the tree does realloc the value with a new size, how does the tree know the new value is larger than the existing value? It seems like the caller needs to provide a function to calculate the size of the value based on the length. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Dec 7, 2023 at 12:27 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sat, Oct 28, 2023 at 5:56 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > bool > > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p); > > or for variable-length value support, > > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz); > > > > If an entry already exists, update its value to 'value_p' and return > > true. Otherwise set the value and return false. > > > RT_VALUE_TYPE > > RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); > > > > If the entry already exists, replace the value with a new empty value > > with sz bytes and set "found" to true. Otherwise, insert an empty > > value, return its pointer, and set "found" to false. > > > > We probably will find a better name but I use RT_INSERT() for > > discussion. RT_INSERT() returns an empty slot regardless of existing > > values. It can be used to insert a new value or to replace the value > > with a larger value. > > Looking at TidStoreSetBlockOffsets again (in particular how it works > with RT_GET), and thinking about issues we've discussed, I think > RT_SET is sufficient for vacuum. Here's how it could work: > > TidStoreSetBlockOffsets could have a stack variable that's "almost > always" large enough. When not, it can allocate in its own context. It > sets the necessary bits there. Then, it passes the pointer to RT_SET > with the number of bytes to copy. That seems very simple. Right. > > At some future time, we can add a new function with the complex > business about getting the current value to modify it, with the > re-alloc'ing that it might require. > > In other words, from both an API perspective and a performance > perspective, it makes sense for tid store to have a simple "set" > interface for vacuum that can be optimized for its characteristics > (insert only, ordered offsets). And also a more complex one for bitmap > scan (setting/unsetting bits of existing values, in any order). They > can share the same iteration interface, key types, and value types. > > What do you think, Masahiko? Good point. RT_SET() would be faster than RT_GET() and updating the value because RT_SET() would not need to take care of the existing value (its size, embedded or not, realloc etc). I think that we can separate the radix tree patch into two parts: the main implementation with RT_SET(), and more complex APIs such as RT_GET() etc. That way, it would probably make it easy to complete the radix tree and tidstore first. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > It's still unclear to me why the value doesn't need to contain the size. > > If I understand you correctly, in RT_GET(), the tree allocs a new > memory and updates the slot where the value is embedded with the > pointer to the allocated memory, and returns the pointer to the > caller. Since the returned value, newly allocated memory, is still > empty, the callner needs to copy the contents of the old value to the > new value and do whatever else it needs to. > > If the value is already a single-leave value and RT_GET() is called > with a larger size, the slot is always replaced with the newly > allocated area and the caller needs to copy the contents? If the tree > does realloc the value with a new size, how does the tree know the new > value is larger than the existing value? It seems like the caller > needs to provide a function to calculate the size of the value based > on the length. Right. My brief description mentioned one thing without details: The caller would need to control whether to re-alloc. RT_GET would pass the size. If nothing is found, the tree would allocate. If there is a value already, just return it. That means both the address of the slot, and the local pointer to the value (with embedded, would be the same address). The caller checks if the array is long enough. If not, call a new function that takes the new size, the address of the slot, and the pointer to the old value. The tree would re-alloc, put the alloc pointer in the slot and return the new local pointer. But as we agreed, that is all follow-up work.
On Fri, Dec 8, 2023 at 1:37 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > It's still unclear to me why the value doesn't need to contain the size. > > > > If I understand you correctly, in RT_GET(), the tree allocs a new > > memory and updates the slot where the value is embedded with the > > pointer to the allocated memory, and returns the pointer to the > > caller. Since the returned value, newly allocated memory, is still > > empty, the callner needs to copy the contents of the old value to the > > new value and do whatever else it needs to. > > > > If the value is already a single-leave value and RT_GET() is called > > with a larger size, the slot is always replaced with the newly > > allocated area and the caller needs to copy the contents? If the tree > > does realloc the value with a new size, how does the tree know the new > > value is larger than the existing value? It seems like the caller > > needs to provide a function to calculate the size of the value based > > on the length. > > Right. My brief description mentioned one thing without details: The > caller would need to control whether to re-alloc. RT_GET would pass > the size. If nothing is found, the tree would allocate. If there is a > value already, just return it. That means both the address of the > slot, and the local pointer to the value (with embedded, would be the > same address). The caller checks if the array is long enough. If not, > call a new function that takes the new size, the address of the slot, > and the pointer to the old value. The tree would re-alloc, put the > alloc pointer in the slot and return the new local pointer. But as we > agreed, that is all follow-up work. Thank you for the detailed explanation. That makes sense to me. We will address it as a follow-up work. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Dec 8, 2023 at 3:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Dec 8, 2023 at 1:37 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > It's still unclear to me why the value doesn't need to contain the size. > > > > > > If I understand you correctly, in RT_GET(), the tree allocs a new > > > memory and updates the slot where the value is embedded with the > > > pointer to the allocated memory, and returns the pointer to the > > > caller. Since the returned value, newly allocated memory, is still > > > empty, the callner needs to copy the contents of the old value to the > > > new value and do whatever else it needs to. > > > > > > If the value is already a single-leave value and RT_GET() is called > > > with a larger size, the slot is always replaced with the newly > > > allocated area and the caller needs to copy the contents? If the tree > > > does realloc the value with a new size, how does the tree know the new > > > value is larger than the existing value? It seems like the caller > > > needs to provide a function to calculate the size of the value based > > > on the length. > > > > Right. My brief description mentioned one thing without details: The > > caller would need to control whether to re-alloc. RT_GET would pass > > the size. If nothing is found, the tree would allocate. If there is a > > value already, just return it. That means both the address of the > > slot, and the local pointer to the value (with embedded, would be the > > same address). BTW Given that the actual value size can be calculated only by the caller, how does the tree know if the value is embedded or not? It's probably related to how to store combined pointer/value slots. If leaf nodes have a bitmap array that indicates the corresponding slot is an embedded value or a pointer to a value, it would be easy. But since the bitmap array is needed only in the leaf nodes, internal nodes and leaf nodes will no longer be identical structure, which is not a bad thing to me, though. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > BTW Given that the actual value size can be calculated only by the > caller, how does the tree know if the value is embedded or not? It's > probably related to how to store combined pointer/value slots. Right, this is future work. At first, variable-length types will have to be single-value leaves. In fact, the idea for storing up to 3 offsets in the bitmap header could be done this way -- it would just be a (small) single-value leaf. (Reminder: Currently, fixed-length values are compile-time embeddable if the platform pointer size is big enough.) > If leaf > nodes have a bitmap array that indicates the corresponding slot is an > embedded value or a pointer to a value, it would be easy. That's the most general way to do it. We could do it much more easily with a pointer tag, although for the above idea it may require some endian-aware coding. Both were mentioned in the paper, I recall. > But since > the bitmap array is needed only in the leaf nodes, internal nodes and > leaf nodes will no longer be identical structure, which is not a bad > thing to me, though. Absolutely no way we are going back to double everything: double types, double functions, double memory contexts. Plus, that bitmap in inner nodes could indicate a pointer to a leaf that got there by "lazy expansion".
On Fri, Dec 8, 2023 at 7:46 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > BTW Given that the actual value size can be calculated only by the > > caller, how does the tree know if the value is embedded or not? It's > > probably related to how to store combined pointer/value slots. > > Right, this is future work. At first, variable-length types will have > to be single-value leaves. In fact, the idea for storing up to 3 > offsets in the bitmap header could be done this way -- it would just > be a (small) single-value leaf. Agreed. > > (Reminder: Currently, fixed-length values are compile-time embeddable > if the platform pointer size is big enough.) > > > If leaf > > nodes have a bitmap array that indicates the corresponding slot is an > > embedded value or a pointer to a value, it would be easy. > > That's the most general way to do it. We could do it much more easily > with a pointer tag, although for the above idea it may require some > endian-aware coding. Both were mentioned in the paper, I recall. True. Probably we can use the combined pointer/value slots approach only if the tree is able to use the pointer tagging. That is, if the caller allows the tree to use one bit of the value. I'm going to update the patch based on the recent discussion (RT_SET() and variable-length values) etc., and post the patch set early next week. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Dec 8, 2023 at 9:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Dec 8, 2023 at 7:46 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > BTW Given that the actual value size can be calculated only by the > > > caller, how does the tree know if the value is embedded or not? It's > > > probably related to how to store combined pointer/value slots. > > > > Right, this is future work. At first, variable-length types will have > > to be single-value leaves. In fact, the idea for storing up to 3 > > offsets in the bitmap header could be done this way -- it would just > > be a (small) single-value leaf. > > Agreed. > > > > > (Reminder: Currently, fixed-length values are compile-time embeddable > > if the platform pointer size is big enough.) > > > > > If leaf > > > nodes have a bitmap array that indicates the corresponding slot is an > > > embedded value or a pointer to a value, it would be easy. > > > > That's the most general way to do it. We could do it much more easily > > with a pointer tag, although for the above idea it may require some > > endian-aware coding. Both were mentioned in the paper, I recall. > > True. Probably we can use the combined pointer/value slots approach > only if the tree is able to use the pointer tagging. That is, if the > caller allows the tree to use one bit of the value. > > I'm going to update the patch based on the recent discussion (RT_SET() > and variable-length values) etc., and post the patch set early next > week. I've attached the updated patch set. From the previous patch set, I've merged patches 0007 to 0010. The other changes such as adding RT_GET() still are unmerged for now, for discussion. Probably we can make them as follow-up patches as we discussed. 0011 to 0015 patches are new changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and support variable-length values. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Dec 11, 2023 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I've attached the updated patch set. From the previous patch set, I've > merged patches 0007 to 0010. The other changes such as adding RT_GET() > still are unmerged for now, for discussion. Probably we can make them > as follow-up patches as we discussed. 0011 to 0015 patches are new > changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and > support variable-length values. This looks like the right direction, and I'm pleased it's not much additional code on top of my last patch. v44-0014: +#ifdef RT_VARLEN_VALUE + /* XXX: need to choose block sizes? */ + tree->leaf_ctx = AllocSetContextCreate(ctx, + "radix tree leaves", + ALLOCSET_DEFAULT_SIZES); +#else + tree->leaf_ctx = SlabContextCreate(ctx, + "radix tree leaves", + RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), + sizeof(RT_VALUE_TYPE)); +#endif /* RT_VARLEN_VALUE */ Choosing block size: Similar to what we've discussed previously around DSA segments, we might model this on CreateWorkExprContext() in src/backend/executor/execUtils.c. Maybe tid store can pass maint_w_m / autovac_w_m (later work_mem for bitmap scan). RT_CREATE could set the max block size to 1/16 of that, or less. Also, it occurred to me that compile-time embeddable values don't need a leaf context. I'm not sure how many places assume that there is always a leaf context. If not many, it may be worth not creating one here, just to be tidy. + size_t copysize; - memcpy(leaf.local, value_p, sizeof(RT_VALUE_TYPE)); + copysize = sizeof(RT_VALUE_TYPE); +#endif + + memcpy(leaf.local, value_p, copysize); I'm not sure this indirection adds clarity. I guess the intent was to keep from saying "memcpy" twice, but now the code has to say "copysize = foo" twice. For varlen case, we need to watch out for slowness because of memcpy. Let's put that off for later testing, though. We may someday want to avoid a memcpy call for the varlen case, so let's keep it flexible here. v44-0015: +#define SizeOfBlocktableEntry (offsetof( Unused. + char buf[MaxBlocktableEntrySize] = {0}; Zeroing this buffer is probably going to be expensive. Also see this pre-existing comment: /* WIP: slow, since it writes to memory for every bit */ page->words[wordnum] |= ((bitmapword) 1 << bitnum); For this function (which will be vacuum-only, so we can assume ordering), in the loop we can: * declare the local bitmapword variable to be zero * set the bits on it * write it out to the right location when done. Let's fix both of these at once. + if (TidStoreIsShared(ts)) + shared_rt_set(ts->tree.shared, blkno, (void *) page, page_len); + else + local_rt_set(ts->tree.local, blkno, (void *) page, page_len); Is there a reason for "void *"? The declared parameter is "RT_VALUE_TYPE *value_p" in 0014. Also, since this function is for vacuum (and other uses will need a new function), let's assert the returned bool is false. Does iteration still work? If so, it's not too early to re-wire this up with vacuum and see how it behaves. Lastly, my compiler has a warning that CI doesn't have: In file included from ../src/test/modules/test_radixtree/test_radixtree.c:121: ../src/include/lib/radixtree.h: In function ‘rt_find.isra’: ../src/include/lib/radixtree.h:2142:24: warning: ‘slot’ may be used uninitialized [-Wmaybe-uninitialized] 2142 | return (RT_VALUE_TYPE*) slot; | ^~~~~~~~~~~~~~~~~~~~~ ../src/include/lib/radixtree.h:2112:23: note: ‘slot’ was declared here 2112 | RT_PTR_ALLOC *slot; | ^~~~
On Tue, Dec 12, 2023 at 11:53 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Mon, Dec 11, 2023 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I've attached the updated patch set. From the previous patch set, I've > > merged patches 0007 to 0010. The other changes such as adding RT_GET() > > still are unmerged for now, for discussion. Probably we can make them > > as follow-up patches as we discussed. 0011 to 0015 patches are new > > changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and > > support variable-length values. > > This looks like the right direction, and I'm pleased it's not much > additional code on top of my last patch. > > v44-0014: > > +#ifdef RT_VARLEN_VALUE > + /* XXX: need to choose block sizes? */ > + tree->leaf_ctx = AllocSetContextCreate(ctx, > + "radix tree leaves", > + ALLOCSET_DEFAULT_SIZES); > +#else > + tree->leaf_ctx = SlabContextCreate(ctx, > + "radix tree leaves", > + RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > + sizeof(RT_VALUE_TYPE)); > +#endif /* RT_VARLEN_VALUE */ > > Choosing block size: Similar to what we've discussed previously around > DSA segments, we might model this on CreateWorkExprContext() in > src/backend/executor/execUtils.c. Maybe tid store can pass maint_w_m / > autovac_w_m (later work_mem for bitmap scan). RT_CREATE could set the > max block size to 1/16 of that, or less. > > Also, it occurred to me that compile-time embeddable values don't need > a leaf context. I'm not sure how many places assume that there is > always a leaf context. If not many, it may be worth not creating one > here, just to be tidy. > > + size_t copysize; > > - memcpy(leaf.local, value_p, sizeof(RT_VALUE_TYPE)); > + copysize = sizeof(RT_VALUE_TYPE); > +#endif > + > + memcpy(leaf.local, value_p, copysize); > > I'm not sure this indirection adds clarity. I guess the intent was to > keep from saying "memcpy" twice, but now the code has to say "copysize > = foo" twice. > > For varlen case, we need to watch out for slowness because of memcpy. > Let's put that off for later testing, though. We may someday want to > avoid a memcpy call for the varlen case, so let's keep it flexible > here. > > v44-0015: > > +#define SizeOfBlocktableEntry (offsetof( > > Unused. > > + char buf[MaxBlocktableEntrySize] = {0}; > > Zeroing this buffer is probably going to be expensive. Also see this > pre-existing comment: > /* WIP: slow, since it writes to memory for every bit */ > page->words[wordnum] |= ((bitmapword) 1 << bitnum); > > For this function (which will be vacuum-only, so we can assume > ordering), in the loop we can: > * declare the local bitmapword variable to be zero > * set the bits on it > * write it out to the right location when done. > > Let's fix both of these at once. > > + if (TidStoreIsShared(ts)) > + shared_rt_set(ts->tree.shared, blkno, (void *) page, page_len); > + else > + local_rt_set(ts->tree.local, blkno, (void *) page, page_len); > > Is there a reason for "void *"? The declared parameter is > "RT_VALUE_TYPE *value_p" in 0014. > Also, since this function is for vacuum (and other uses will need a > new function), let's assert the returned bool is false. > > Does iteration still work? If so, it's not too early to re-wire this > up with vacuum and see how it behaves. > > Lastly, my compiler has a warning that CI doesn't have: > > In file included from ../src/test/modules/test_radixtree/test_radixtree.c:121: > ../src/include/lib/radixtree.h: In function ‘rt_find.isra’: > ../src/include/lib/radixtree.h:2142:24: warning: ‘slot’ may be used > uninitialized [-Wmaybe-uninitialized] > 2142 | return (RT_VALUE_TYPE*) slot; > | ^~~~~~~~~~~~~~~~~~~~~ > ../src/include/lib/radixtree.h:2112:23: note: ‘slot’ was declared here > 2112 | RT_PTR_ALLOC *slot; > | ^~~~ Thank you for the comments! I agreed with all of them and incorporated them into the attached latest patch set, v45. In v45, 0001 - 0006 are from earlier versions but I've merged previous updates. So the radix tree now has RT_SET() and RT_FIND() but not RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous versions that incorporated the above comments. 0009 patch integrates tidstore with lazy vacuum. Note that DSA segment problem is not resolved yet in this patch. 0010 and 0011 makes DSA initial/max segment size configurable and make parallel vacuum specify both in proportion to maintenance_work_mem. 0012 is a development-purpose patch to make it easy to investigate bugs in tidstore. I'd like to keep it in the patch set at least during the development. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Dec 14, 2023 at 7:22 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > In v45, 0001 - 0006 are from earlier versions but I've merged previous > updates. So the radix tree now has RT_SET() and RT_FIND() but not > RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous > versions that incorporated the above comments. 0009 patch integrates > tidstore with lazy vacuum. Excellent! I repeated a quick run of the small "test 1" with very low m_w_m from https://www.postgresql.org/message-id/CAFBsxsHrvTPUK%3DC1%3DxweJjGujja4Xjfgva3C8jnW3Shz6RBnFg%40mail.gmail.com ...and got similar results, so we still have good space-efficiency on this test: master: INFO: finished vacuuming "john.public.test": index scans: 9 system usage: CPU: user: 56.83 s, system: 9.36 s, elapsed: 119.62 s v45: INFO: finished vacuuming "john.public.test": index scans: 1 system usage: CPU: user: 6.82 s, system: 2.05 s, elapsed: 10.89 s More sparse TID distributions won't be as favorable, but we have ideas to improve that in the future. For my next steps, I will finish the node-shrinking behavior and save for a later patchset. Not needed for tid store, but needs to happen because of assumptions in the code. Also, some time ago, I think I commented out RT_FREE_RECURSE to get something working, so I'll fix it, and look at other fixmes and todos. > Note that DSA segment problem is not > resolved yet in this patch. I remember you started a separate thread about this, but I don't think it got any attention. Maybe reply with a "TLDR;" and share a patch to allow controlling max segment size. Some more comments: v45-0003: Since RT_ITERATE_NEXT_PTR works for tid store, do we even need RT_ITERATE_NEXT anymore? The former should handle fixed-length values just fine? If so, we should rename it to match the latter. + * The caller is responsible for locking/unlocking the tree in shared mode. This is not new to v45, but this will come up again below. This needs more explanation: Since we're returning a pointer (to support variable-length values), the caller needs to maintain control until it's finished with the value. v45-0005: + * Regarding the concurrency support, we use a single LWLock for the TidStore. + * The TidStore is exclusively locked when inserting encoded tids to the + * radix tree or when resetting itself. When searching on the TidStore or + * doing the iteration, it is not locked but the underlying radix tree is + * locked in shared mode. This is just stating facts without giving any reasons. Readers are going to wonder why it's inconsistent. The "why" is much more important than the "what". Even with that, this comment is also far from the relevant parts, and so will get out of date. Maybe we can just make sure each relevant function is explained individually. v45-0007: -RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx); +RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, Size work_mem); Tid store calls this max_bytes -- can we use that name here, too? "work_mem" is highly specific. - RT_PTR_ALLOC *slot; + RT_PTR_ALLOC *slot = NULL; We have a macro for invalid pointer because of DSA. v45-0008: - if (off < 1 || off > MAX_TUPLES_PER_PAGE) + if (unlikely(off < 1 || off > MAX_TUPLES_PER_PAGE)) elog(ERROR, "tuple offset out of range: %u", off); This is a superfluous distraction, since the error path is located way off in the cold segment of the binary. v45-0009: (just a few small things for now) - * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the - * vacrel->dead_items array. + * lazy_vacuum_heap_page() -- free page's LP_DEAD items. I think we can keep as "listed in the TID store". - * Allocate dead_items (either using palloc, or in dynamic shared memory). - * Sets dead_items in vacrel for caller. + * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items + * in vacrel for caller. I think we want to keep "in dynamic shared memory". It's still true. I'm not sure anything needs to change here, actually. parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes, - int nrequested_workers, int max_items, - int elevel, BufferAccessStrategy bstrategy) + int nrequested_workers, int vac_work_mem, + int max_offset, int elevel, + BufferAccessStrategy bstrategy) It seems very strange to me that this function has to pass the max_offset. In general, it's been simpler to assume we have a constant max_offset, but in this case that fact is not helping. Something to think about for later. - (errmsg("scanned index \"%s\" to remove %d row versions", + (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions", This should be signed int64. v45-0010: Thinking about this some more, I'm not sure we need to do anything different for the *starting* segment size. (Controlling *max* size does seem important, however.) For the corner case of m_w_m = 1MB, it's fine if vacuum quits pruning immediately after (in effect) it finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If the memory accounting starts >1MB because we're adding the trivial size of some struct, let's just stop doing that. The segment allocations are what we care about. v45-0011: + /* + * max_bytes is forced to be at least 64kB, the current minimum valid + * value for the work_mem GUC. + */ + max_bytes = Max(64 * 1024L, max_bytes); Why? I believe I mentioned months ago that copying a hard-coded value that can get out of sync is not maintainable, but I don't even see the point of this part.
On Fri, Dec 15, 2023 at 10:30 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Dec 14, 2023 at 7:22 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > In v45, 0001 - 0006 are from earlier versions but I've merged previous > > updates. So the radix tree now has RT_SET() and RT_FIND() but not > > RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous > > versions that incorporated the above comments. 0009 patch integrates > > tidstore with lazy vacuum. > > Excellent! I repeated a quick run of the small "test 1" with very low m_w_m from > > https://www.postgresql.org/message-id/CAFBsxsHrvTPUK%3DC1%3DxweJjGujja4Xjfgva3C8jnW3Shz6RBnFg%40mail.gmail.com > > ...and got similar results, so we still have good space-efficiency on this test: > > master: > INFO: finished vacuuming "john.public.test": index scans: 9 > system usage: CPU: user: 56.83 s, system: 9.36 s, elapsed: 119.62 s > > v45: > INFO: finished vacuuming "john.public.test": index scans: 1 > system usage: CPU: user: 6.82 s, system: 2.05 s, elapsed: 10.89 s Thank you for testing it again. That's a very good result. > For my next steps, I will finish the node-shrinking behavior and save > for a later patchset. Not needed for tid store, but needs to happen > because of assumptions in the code. Also, some time ago, I think I > commented out RT_FREE_RECURSE to get something working, so I'll fix > it, and look at other fixmes and todos. Great! > > > Note that DSA segment problem is not > > resolved yet in this patch. > > I remember you started a separate thread about this, but I don't think > it got any attention. Maybe reply with a "TLDR;" and share a patch to > allow controlling max segment size. Yeah, I recalled that thread. Will send a reply. > > Some more comments: > > v45-0003: > > Since RT_ITERATE_NEXT_PTR works for tid store, do we even need > RT_ITERATE_NEXT anymore? The former should handle fixed-length values > just fine? If so, we should rename it to match the latter. Agreed to rename it. > > + * The caller is responsible for locking/unlocking the tree in shared mode. > > This is not new to v45, but this will come up again below. This needs > more explanation: Since we're returning a pointer (to support > variable-length values), the caller needs to maintain control until > it's finished with the value. Will fix. > > v45-0005: > > + * Regarding the concurrency support, we use a single LWLock for the TidStore. > + * The TidStore is exclusively locked when inserting encoded tids to the > + * radix tree or when resetting itself. When searching on the TidStore or > + * doing the iteration, it is not locked but the underlying radix tree is > + * locked in shared mode. > > This is just stating facts without giving any reasons. Readers are > going to wonder why it's inconsistent. The "why" is much more > important than the "what". Even with that, this comment is also far > from the relevant parts, and so will get out of date. Maybe we can > just make sure each relevant function is explained individually. Right, I'll fix it. > > v45-0007: > > -RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx); > +RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, Size work_mem); > > Tid store calls this max_bytes -- can we use that name here, too? > "work_mem" is highly specific. While I agree that "work_mem" is highly specific, I avoided using "max_bytes" in radix tree because "max_bytes" sounds to me there is a memory limitation but the radix tree doesn't have it actually. It might be sufficient to mention it in the comment, though. > > - RT_PTR_ALLOC *slot; > + RT_PTR_ALLOC *slot = NULL; > > We have a macro for invalid pointer because of DSA. Will fix. > > v45-0008: > > - if (off < 1 || off > MAX_TUPLES_PER_PAGE) > + if (unlikely(off < 1 || off > MAX_TUPLES_PER_PAGE)) > elog(ERROR, "tuple offset out of range: %u", off); > > This is a superfluous distraction, since the error path is located way > off in the cold segment of the binary. Okay, will remove it. > > v45-0009: > > (just a few small things for now) > > - * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the > - * vacrel->dead_items array. > + * lazy_vacuum_heap_page() -- free page's LP_DEAD items. > > I think we can keep as "listed in the TID store". > > - * Allocate dead_items (either using palloc, or in dynamic shared memory). > - * Sets dead_items in vacrel for caller. > + * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items > + * in vacrel for caller. > > I think we want to keep "in dynamic shared memory". It's still true. > I'm not sure anything needs to change here, actually. Agreed with above comments. Will fix them. > > parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes, > - int nrequested_workers, int max_items, > - int elevel, BufferAccessStrategy bstrategy) > + int nrequested_workers, int vac_work_mem, > + int max_offset, int elevel, > + BufferAccessStrategy bstrategy) > > It seems very strange to me that this function has to pass the > max_offset. In general, it's been simpler to assume we have a constant > max_offset, but in this case that fact is not helping. Something to > think about for later. max_offset was previously used in old TID encoding in tidstore. Since tidstore has entries for each block, I think we no longer need it. > > - (errmsg("scanned index \"%s\" to remove %d row versions", > + (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions", > > This should be signed int64. Will fix. > > v45-0010: > > Thinking about this some more, I'm not sure we need to do anything > different for the *starting* segment size. (Controlling *max* size > does seem important, however.) For the corner case of m_w_m = 1MB, > it's fine if vacuum quits pruning immediately after (in effect) it > finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If > the memory accounting starts >1MB because we're adding the trivial > size of some struct, let's just stop doing that. The segment > allocations are what we care about. IIUC it's for work_mem, whose the minimum value is 64kB. > > v45-0011: > > + /* > + * max_bytes is forced to be at least 64kB, the current minimum valid > + * value for the work_mem GUC. > + */ > + max_bytes = Max(64 * 1024L, max_bytes); > > Why? This is to avoid creating a radix tree within very small memory. The minimum work_mem value is a reasonable lower bound that PostgreSQL uses internally. It's actually copied from tuplesort.c. >I believe I mentioned months ago that copying a hard-coded value > that can get out of sync is not maintainable, but I don't even see the > point of this part. True. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Dec 15, 2023 at 3:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Dec 15, 2023 at 10:30 AM John Naylor <johncnaylorls@gmail.com> wrote: > > parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes, > > - int nrequested_workers, int max_items, > > - int elevel, BufferAccessStrategy bstrategy) > > + int nrequested_workers, int vac_work_mem, > > + int max_offset, int elevel, > > + BufferAccessStrategy bstrategy) > > > > It seems very strange to me that this function has to pass the > > max_offset. In general, it's been simpler to assume we have a constant > > max_offset, but in this case that fact is not helping. Something to > > think about for later. > > max_offset was previously used in old TID encoding in tidstore. Since > tidstore has entries for each block, I think we no longer need it. It's needed now to properly size the allocation of TidStoreIter which contains... +/* Result struct for TidStoreIterateNext */ +typedef struct TidStoreIterResult +{ + BlockNumber blkno; + int num_offsets; + OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER]; +} TidStoreIterResult; Maybe we can palloc the offset array to "almost always" big enough, with logic to resize if needed? If not too hard, seems worth it to avoid churn in the parameter list. > > v45-0010: > > > > Thinking about this some more, I'm not sure we need to do anything > > different for the *starting* segment size. (Controlling *max* size > > does seem important, however.) For the corner case of m_w_m = 1MB, > > it's fine if vacuum quits pruning immediately after (in effect) it > > finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If > > the memory accounting starts >1MB because we're adding the trivial > > size of some struct, let's just stop doing that. The segment > > allocations are what we care about. > > IIUC it's for work_mem, whose the minimum value is 64kB. > > > > > v45-0011: > > > > + /* > > + * max_bytes is forced to be at least 64kB, the current minimum valid > > + * value for the work_mem GUC. > > + */ > > + max_bytes = Max(64 * 1024L, max_bytes); > > > > Why? > > This is to avoid creating a radix tree within very small memory. The > minimum work_mem value is a reasonable lower bound that PostgreSQL > uses internally. It's actually copied from tuplesort.c. There is no explanation for why it should be done like tuplesort.c. Also... - tree->leaf_ctx = SlabContextCreate(ctx, - "radix tree leaves", - RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), - sizeof(RT_VALUE_TYPE)); + tree->leaf_ctx = SlabContextCreate(ctx, + "radix tree leaves", + Min(RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), + work_mem), + sizeof(RT_VALUE_TYPE)); At first, my eyes skipped over this apparent re-indent, but hidden inside here is another (undocumented) attempt to clamp the size of something. There are too many of these sprinkled in various places, and they're already a maintenance hazard -- a different one was left behind in v45-0011: @@ -201,6 +183,7 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area) ts->control->max_bytes = max_bytes - (70 * 1024); } Let's do it in just one place. In TidStoreCreate(), do /* clamp max_bytes to at least the size of the empty tree with allocated blocks, so it doesn't immediately appear full */ ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage); Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that. I may not recall everything while writing this, but it seems the only other thing we should be clamping is the max aset block size (solved) / max DSM segment size (in progress).
On Mon, Dec 18, 2023 at 3:41 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Fri, Dec 15, 2023 at 3:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Dec 15, 2023 at 10:30 AM John Naylor <johncnaylorls@gmail.com> wrote: > > > > parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes, > > > - int nrequested_workers, int max_items, > > > - int elevel, BufferAccessStrategy bstrategy) > > > + int nrequested_workers, int vac_work_mem, > > > + int max_offset, int elevel, > > > + BufferAccessStrategy bstrategy) > > > > > > It seems very strange to me that this function has to pass the > > > max_offset. In general, it's been simpler to assume we have a constant > > > max_offset, but in this case that fact is not helping. Something to > > > think about for later. > > > > max_offset was previously used in old TID encoding in tidstore. Since > > tidstore has entries for each block, I think we no longer need it. > > It's needed now to properly size the allocation of TidStoreIter which > contains... > > +/* Result struct for TidStoreIterateNext */ > +typedef struct TidStoreIterResult > +{ > + BlockNumber blkno; > + int num_offsets; > + OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER]; > +} TidStoreIterResult; > > Maybe we can palloc the offset array to "almost always" big enough, > with logic to resize if needed? If not too hard, seems worth it to > avoid churn in the parameter list. Yes, I was thinking of that. > > > > v45-0010: > > > > > > Thinking about this some more, I'm not sure we need to do anything > > > different for the *starting* segment size. (Controlling *max* size > > > does seem important, however.) For the corner case of m_w_m = 1MB, > > > it's fine if vacuum quits pruning immediately after (in effect) it > > > finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If > > > the memory accounting starts >1MB because we're adding the trivial > > > size of some struct, let's just stop doing that. The segment > > > allocations are what we care about. > > > > IIUC it's for work_mem, whose the minimum value is 64kB. > > > > > > > > v45-0011: > > > > > > + /* > > > + * max_bytes is forced to be at least 64kB, the current minimum valid > > > + * value for the work_mem GUC. > > > + */ > > > + max_bytes = Max(64 * 1024L, max_bytes); > > > > > > Why? > > > > This is to avoid creating a radix tree within very small memory. The > > minimum work_mem value is a reasonable lower bound that PostgreSQL > > uses internally. It's actually copied from tuplesort.c. > > There is no explanation for why it should be done like tuplesort.c. Also... > > - tree->leaf_ctx = SlabContextCreate(ctx, > - "radix tree leaves", > - RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > - sizeof(RT_VALUE_TYPE)); > + tree->leaf_ctx = SlabContextCreate(ctx, > + "radix tree leaves", > + Min(RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > + work_mem), > + sizeof(RT_VALUE_TYPE)); > > At first, my eyes skipped over this apparent re-indent, but hidden > inside here is another (undocumented) attempt to clamp the size of > something. There are too many of these sprinkled in various places, > and they're already a maintenance hazard -- a different one was left > behind in v45-0011: > > @@ -201,6 +183,7 @@ TidStoreCreate(size_t max_bytes, int max_off, > dsa_area *area) > ts->control->max_bytes = max_bytes - (70 * 1024); > } > > Let's do it in just one place. In TidStoreCreate(), do > > /* clamp max_bytes to at least the size of the empty tree with > allocated blocks, so it doesn't immediately appear full */ > ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage); > > Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that. But doesn't it mean that even if we create a shared tidstore with small memory, say 64kB, it actually uses 1MB? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Dec 19, 2023 at 12:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Dec 18, 2023 at 3:41 PM John Naylor <johncnaylorls@gmail.com> wrote: > > Let's do it in just one place. In TidStoreCreate(), do > > > > /* clamp max_bytes to at least the size of the empty tree with > > allocated blocks, so it doesn't immediately appear full */ > > ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage); > > > > Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that. > > But doesn't it mean that even if we create a shared tidstore with > small memory, say 64kB, it actually uses 1MB? This sounds like an argument for controlling the minimum DSA segment size. (I'm not really in favor of that, but open to others' opinion) I wasn't talking about that above -- I was saying we should have only one place where we clamp max_bytes so that the tree doesn't immediately appear full.
On Tue, Dec 19, 2023 at 4:37 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Dec 19, 2023 at 12:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Dec 18, 2023 at 3:41 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > Let's do it in just one place. In TidStoreCreate(), do > > > > > > /* clamp max_bytes to at least the size of the empty tree with > > > allocated blocks, so it doesn't immediately appear full */ > > > ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage); > > > > > > Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that. > > > > But doesn't it mean that even if we create a shared tidstore with > > small memory, say 64kB, it actually uses 1MB? > > This sounds like an argument for controlling the minimum DSA segment > size. (I'm not really in favor of that, but open to others' opinion) > > I wasn't talking about that above -- I was saying we should have only > one place where we clamp max_bytes so that the tree doesn't > immediately appear full. Thank you for your clarification. Understood. I've updated the new patch set that incorporated comments I got so far. 0007, 0008, and 0012 patches are updates from the v45 patch set. In addition to the review comments, I made some changes in tidstore to make it independent from heap. Specifically, it uses MaxOffsetNumber instead of MaxHeapTuplesPerPage. Now we don't need to include htup_details.h. It enlarged MaxBlocktableEntrySize but it's still 272 bytes. BTW regarding the previous comment I got before: > - RT_PTR_ALLOC *slot; > + RT_PTR_ALLOC *slot = NULL; > > We have a macro for invalid pointer because of DSA. I think that since *slot is a pointer to a RT_PTR_ALLOC it's okay to set NULL. As for the initial and maximum DSA segment sizes, I've sent a summary on that thread: https://www.postgresql.org/message-id/CAD21AoCVMw6DSmgZY9h%2BxfzKtzJeqWiwxaUD2T-FztVcV-XibQ%40mail.gmail.com I'm going to update RT_DUMP() and RT_DUMP_NODE() codes for the next step. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Dec 20, 2023 at 6:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've updated the new patch set that incorporated comments I got so > far. 0007, 0008, and 0012 patches are updates from the v45 patch set. > In addition to the review comments, I made some changes in tidstore to > make it independent from heap. Specifically, it uses MaxOffsetNumber > instead of MaxHeapTuplesPerPage. Now we don't need to include > htup_details.h. It enlarged MaxBlocktableEntrySize but it's still 272 > bytes. That's a good idea. > BTW regarding the previous comment I got before: > > > - RT_PTR_ALLOC *slot; > > + RT_PTR_ALLOC *slot = NULL; > > > > We have a macro for invalid pointer because of DSA. > > I think that since *slot is a pointer to a RT_PTR_ALLOC it's okay to set NULL. Ah right, it's the address of the slot. > I'm going to update RT_DUMP() and RT_DUMP_NODE() codes for the next step. That could probably use some discussion. A few months ago, I found the debugging functions only worked when everything else worked. When things weren't working, I had to rip one of these functions apart so it only looked at one node. If something is broken, we can't count on recursion or iteration working, because we won't get that far. I don't remember how things are in the current patch. I've finished the node shrinking and addressed some fixme/todo areas -- can I share these and squash your v46 changes first?
On Thu, Dec 21, 2023 at 10:19 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Dec 20, 2023 at 6:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I've updated the new patch set that incorporated comments I got so > > far. 0007, 0008, and 0012 patches are updates from the v45 patch set. > > In addition to the review comments, I made some changes in tidstore to > > make it independent from heap. Specifically, it uses MaxOffsetNumber > > instead of MaxHeapTuplesPerPage. Now we don't need to include > > htup_details.h. It enlarged MaxBlocktableEntrySize but it's still 272 > > bytes. > > That's a good idea. > > > BTW regarding the previous comment I got before: > > > > > - RT_PTR_ALLOC *slot; > > > + RT_PTR_ALLOC *slot = NULL; > > > > > > We have a macro for invalid pointer because of DSA. > > > > I think that since *slot is a pointer to a RT_PTR_ALLOC it's okay to set NULL. > > Ah right, it's the address of the slot. > > > I'm going to update RT_DUMP() and RT_DUMP_NODE() codes for the next step. > > That could probably use some discussion. A few months ago, I found the > debugging functions only worked when everything else worked. When > things weren't working, I had to rip one of these functions apart so > it only looked at one node. If something is broken, we can't count on > recursion or iteration working, because we won't get that far. I don't > remember how things are in the current patch. Agreed. I found the following comment and wanted to discuss: // this might be better as "iterate over nodes", plus a callback to RT_DUMP_NODE, // which should really only concern itself with single nodes RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree) If it means we need to somehow use the iteration functions also for dumping the whole tree, it would probably need to refactor the iteration codes so that the RT_DUMP() can use them while dumping visited nodes. But we need to be careful of not adding overheads to the iteration performance. > > I've finished the node shrinking and addressed some fixme/todo areas > -- can I share these and squash your v46 changes first? Cool! Yes, please do so. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Dec 21, 2023 at 8:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I found the following comment and wanted to discuss: > > // this might be better as "iterate over nodes", plus a callback to > RT_DUMP_NODE, > // which should really only concern itself with single nodes > RT_SCOPE void > RT_DUMP(RT_RADIX_TREE *tree) > > If it means we need to somehow use the iteration functions also for > dumping the whole tree, it would probably need to refactor the > iteration codes so that the RT_DUMP() can use them while dumping > visited nodes. But we need to be careful of not adding overheads to > the iteration performance. Yeah, some months ago I thought a callback interface would make some things easier. I don't think we need that at the moment (possibly never), so that comment can be just removed. As far as these debug functions, I only found useful the stats and dumping a single node, FWIW. I've attached v47, which is v46 plus some fixes for radix tree. 0004 - moves everything for "delete" to the end -- gradually other things will be grouped together in a sensible order 0005 - trivial 0006 - shrink nodes -- still needs testing, but nothing crashes yet. This shows some renaming might be good: Previously we had RT_CHUNK_CHILDREN_ARRAY_COPY for growing nodes, but for shrinking I've added RT_COPY_ARRAYS_AND_DELETE, since the deletion happens by simply not copying the slot to be deleted. This means when growing it would be more clear to call the former RT_COPY_ARRAYS_FOR_INSERT, since that reserves a new slot for the caller in the new node, but the caller must do the insert itself. Note that there are some practical restrictions/best-practices on whether shrinking should happen after deletion or vice versa. Hopefully it's clear, but let me know if the description can be improved. Also, it doesn't yet shrink from size class 32 to 16, but it could with a bit of work. 0007 - trivial, but could use a better comment. I also need to make sure stats reporting works (may also need some cleanup work). 0008 - fixes RT_FREE_RECURSE -- I believe you wondered some months ago if DSA could just free all our allocated segments without throwing away the DSA, and that's still a good question. 0009 - fixes the assert in RT_ITER_SET_NODE_FROM (btw, I don't think this name is better than RT_UPDATE_ITER_STACK, so maybe we should go back to that). The assert doesn't fire, so I guess it does what it's supposed to? For me, the iteration logic is still the most confusing piece out of the whole radix tree. Maybe that could be helped with some better variable names, but I wonder if it needs more invasive work. I confess I don't have better ideas for how it would work differently. 0010 - some fixes for number of children accounting in node256 0011 - Long overdue pgindent of radixtree.h, without trying to fix up afterwards. Feel free to throw out and redo if this interferes with ongoing work. The rest are from your v46. The bench doesn't work for tid store anymore, so I squashed "disable bench for CI" until we get back to that. Some more review comments (note: patch numbers are for v47, but I changed nothing from v46 in this area): 0013: + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, + * and stored in the radix tree. Recently outdated. The variable length values seems to work, so let's make everything match. +#define MAX_TUPLES_PER_PAGE MaxOffsetNumber Maybe we don't need this macro anymore? The name no longer fits, in any case. +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, + int num_offsets) +{ + char buf[MaxBlocktableEntrySize]; + BlocktableEntry *page = (BlocktableEntry *) buf; I'm not sure this is safe with alignment. Maybe rather than plain "char", it needs to be a union with BlocktableEntry, or something. +static inline BlocktableEntry * +tidstore_iter_kv(TidStoreIter *iter, uint64 *key) +{ + if (TidStoreIsShared(iter->ts)) + return shared_rt_iterate_next(iter->tree_iter.shared, key); + + return local_rt_iterate_next(iter->tree_iter.local, key); +} In the old encoding scheme, this function did something important, but now it's a useless wrapper with one caller. + /* + * In the shared case, TidStoreControl and radix_tree are backed by the + * same DSA area and rt_memory_usage() returns the value including both. + * So we don't need to add the size of TidStoreControl separately. + */ + if (TidStoreIsShared(ts)) + return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared); + + return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local); I don't see the point in including these tiny structs, since we will always blow past the limit by a number of kilobytes (at least, often megabytes or more) at the time it happens. + iter->output.max_offset = 64; Maybe needs a comment that this is just some starting size and not anything particular. + iter->output.offsets = palloc(sizeof(OffsetNumber) * iter->output.max_offset); + /* Make sure there is enough space to add offsets */ + if (result->num_offsets + bmw_popcount(w) > result->max_offset) + { + result->max_offset *= 2; + result->offsets = repalloc(result->offsets, + sizeof(OffsetNumber) * result->max_offset); + } popcount()-ing for every array element in every value is expensive -- let's just add sizeof(bitmapword). It's not that wasteful, but then the initial max will need to be 128. About separation of responsibilities for locking: The only thing currently where the tid store is not locked is tree iteration. That's a strange exception. Also, we've recently made RT_FIND return a pointer, so the caller must somehow hold a share lock, but I think we haven't exposed callers the ability to do that, and we rely on the tid store lock for that. We have a mix of tree locking and tid store locking. We will need to consider carefully how to make this more clear, maintainable, and understandable. 0015: "XXX: some regression test fails since this commit changes the minimum m_w_m to 2048 from 1024. This was necessary for the pervious memory" This shouldn't fail anymore if the "one-place" clamp was in a patch before this. If so, lets take out that GUC change and worry about min/max size separately. If it still fails, I'd like to know why. - * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the - * vacrel->dead_items array. + * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the TID store. What I was getting at earlier is that the first line here doesn't really need to change, we can just s/array/store/ ? -static int -lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer, - int index, Buffer vmbuffer) +static void +lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, + OffsetNumber *deadoffsets, int num_offsets, Buffer buffer, + Buffer vmbuffer) "buffer" should still come after "blkno", so that line doesn't need to change. $ git diff master -- src/backend/access/heap/ | grep has_lpdead_items - bool has_lpdead_items; /* includes existing LP_DEAD items */ - * pruning and freezing. all_visible implies !has_lpdead_items, but don't - Assert(!prunestate.all_visible || !prunestate.has_lpdead_items); - if (prunestate.has_lpdead_items) - else if (prunestate.has_lpdead_items && PageIsAllVisible(page)) - if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming) - prunestate->has_lpdead_items = false; - prunestate->has_lpdead_itemshas_lpdead_itemshas_lpdead_itemshas_lpdead_items = true; In a green field, it'd be fine to replace these with an expression of "num_offsets", but it adds a bit of noise for reviewers and the git log. Is it really necessary? - deadoffsets[lpdead_items++] = offnum; + prunestate->deadoffsets[prunestate->num_offsets++] = offnum; I'm also not quite sure why "deadoffsets" and "lpdead_items" got moved to the PruneState. The latter was renamed in a way that makes more sense, but I don't see why the churn is necessary. @@ -1875,28 +1882,9 @@ lazy_scan_prune(LVRelState *vacrel, } #endif - /* - * Now save details of the LP_DEAD items from the page in vacrel - */ - if (lpdead_items > 0) + if (prunestate->num_offsets > 0) { - VacDeadItems *dead_items = vacrel->dead_items; - ItemPointerData tmp; - vacrel->lpdead_item_pages++; - prunestate->has_lpdead_items = true; - - ItemPointerSetBlockNumber(&tmp, blkno); - - for (int i = 0; i < lpdead_items; i++) - { - ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]); - dead_items->items[dead_items->num_items++] = tmp; - } - - Assert(dead_items->num_items <= dead_items->max_items); - pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES, - dead_items->num_items); I don't understand why this block got removed and nothing new is adding anything to the tid store. @@ -1087,7 +1088,16 @@ lazy_scan_heap(LVRelState *vacrel) * with prunestate-driven visibility map and FSM steps (just like * the two-pass strategy). */ - Assert(dead_items->num_items == 0); + Assert(TidStoreNumTids(dead_items) == 0); + } + else if (prunestate.num_offsets > 0) + { + /* Save details of the LP_DEAD items from the page in dead_items */ + TidStoreSetBlockOffsets(dead_items, blkno, prunestate.deadoffsets, + prunestate.num_offsets); + + pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES, + TidStoreMemoryUsage(dead_items)); I guess it was added here, 800 lines away? If so, why? About progress reporting: I want to make sure no one is going to miss counting "num_dead_tuples". It's no longer relevant for the number of index scans we need to do, but do admins still have a use for it? Something to think about later. 0017 + /* + * max_bytes is forced to be at least 64kB, the current minimum valid + * value for the work_mem GUC. + */ + max_bytes = Max(64 * 1024L, max_bytes); If this still needs to be here, I still don't understand why.
Attachment
Hi, On 2023-12-21 14:41:37 +0700, John Naylor wrote: > I've attached v47, which is v46 plus some fixes for radix tree. Could either of you summarize what the design changes you've made in the last months are and why you've done them? Unfortunately this thread is very long, and the comments in the file just say "FIXME" in places that apparently are affected by design changes. This makes it hard to catch up here. Greetings, Andres Freund
On Thu, Dec 21, 2023 at 6:27 PM Andres Freund <andres@anarazel.de> wrote: > > Could either of you summarize what the design changes you've made in the last > months are and why you've done them? Unfortunately this thread is very long, > and the comments in the file just say "FIXME" in places that apparently are > affected by design changes. This makes it hard to catch up here. I'd be happy to try, since we are about due for a summary. I was also hoping to reach a coherent-enough state sometime in early January to request your feedback, so good timing. Not sure how much detail to go into, but here goes: Back in May [1], the method of value storage shifted towards "combined pointer-value slots", which was described and recommended in the paper. There were some other changes for simplicity and efficiency, but none as far-reaching as this. This is enabled by using the template architecture that we adopted long ago for different reasons. Fixed length values are either stored in the slot of the last-level node (if the value fits into the platform's pointer), or are a "single-value" leaf (otherwise). For tid store, we want to eventually support bitmap heap scans (in addition to vacuum), and in doing so make it independent of heap AM. That means value types similar to PageTableEntry tidbitmap.c, but with a variable number of bitmapwords. That required radix tree to support variable length values. That has been the main focus in the last several months, and it basically works now. To my mind, the biggest architectural issues in the patch today are: - Variable-length values means that pointers are passed around in places. This will require some shifting responsibility for locking to the caller, or longer-term maybe a callback interface. (This is new, the below are pre-existing issues.) - The tid store has its own "control object" (when shared memory is needed) with its own lock, in addition to the same for the associated radix tree. This leads to unnecessary double-locking. This area needs some attention. - Memory accounting is still unsettled. The current thinking is to cap max block/segment size, scaled to a fraction of m_w_m, but there are still open questions. There has been some recent effort toward finishing work started earlier, like shrinking nodes. There a couple places that can still use either simplification or optimization, but otherwise work fine. Most of the remaining fixmes/todos/wips are trivial; a few are actually outdated now that I look again, and will be removed shortly. The regression tests could use some tidying up. -John [1] https://www.postgresql.org/message-id/CAFBsxsFyWLxweHVDtKb7otOCR4XdQGYR4b%2B9svxpVFnJs08BmQ%40mail.gmail.com
On Thu, Dec 21, 2023 at 4:41 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Dec 21, 2023 at 8:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I found the following comment and wanted to discuss: > > > > // this might be better as "iterate over nodes", plus a callback to > > RT_DUMP_NODE, > > // which should really only concern itself with single nodes > > RT_SCOPE void > > RT_DUMP(RT_RADIX_TREE *tree) > > > > If it means we need to somehow use the iteration functions also for > > dumping the whole tree, it would probably need to refactor the > > iteration codes so that the RT_DUMP() can use them while dumping > > visited nodes. But we need to be careful of not adding overheads to > > the iteration performance. > > Yeah, some months ago I thought a callback interface would make some > things easier. I don't think we need that at the moment (possibly > never), so that comment can be just removed. As far as these debug > functions, I only found useful the stats and dumping a single node, > FWIW. > > I've attached v47, which is v46 plus some fixes for radix tree. > > 0004 - moves everything for "delete" to the end -- gradually other > things will be grouped together in a sensible order > > 0005 - trivial LGTM. > > 0006 - shrink nodes -- still needs testing, but nothing crashes yet. Cool. The coverage test results showed the shrink codes are also covered. > This shows some renaming might be good: Previously we had > RT_CHUNK_CHILDREN_ARRAY_COPY for growing nodes, but for shrinking I've > added RT_COPY_ARRAYS_AND_DELETE, since the deletion happens by simply > not copying the slot to be deleted. This means when growing it would > be more clear to call the former RT_COPY_ARRAYS_FOR_INSERT, since that > reserves a new slot for the caller in the new node, but the caller > must do the insert itself. Agreed. > Note that there are some practical > restrictions/best-practices on whether shrinking should happen after > deletion or vice versa. Hopefully it's clear, but let me know if the > description can be improved. Also, it doesn't yet shrink from size > class 32 to 16, but it could with a bit of work. Sounds reasonable. > > 0007 - trivial, but could use a better comment. I also need to make > sure stats reporting works (may also need some cleanup work). > > 0008 - fixes RT_FREE_RECURSE -- I believe you wondered some months ago > if DSA could just free all our allocated segments without throwing > away the DSA, and that's still a good question. LGTM. > > 0009 - fixes the assert in RT_ITER_SET_NODE_FROM (btw, I don't think > this name is better than RT_UPDATE_ITER_STACK, so maybe we should go > back to that). Will rename it. > The assert doesn't fire, so I guess it does what it's > supposed to? Yes. > For me, the iteration logic is still the most confusing > piece out of the whole radix tree. Maybe that could be helped with > some better variable names, but I wonder if it needs more invasive > work. True. Maybe more comments would also help. > > 0010 - some fixes for number of children accounting in node256 > > 0011 - Long overdue pgindent of radixtree.h, without trying to fix up > afterwards. Feel free to throw out and redo if this interferes with > ongoing work. > LGTM. I'm working on the below review comments and most of them are already incorporated on the local branch: > The rest are from your v46. The bench doesn't work for tid store > anymore, so I squashed "disable bench for CI" until we get back to > that. Some more review comments (note: patch numbers are for v47, but > I changed nothing from v46 in this area): > > 0013: > > + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, > + * and stored in the radix tree. > > Recently outdated. The variable length values seems to work, so let's > make everything match. > > +#define MAX_TUPLES_PER_PAGE MaxOffsetNumber > > Maybe we don't need this macro anymore? The name no longer fits, in any case. Removed. > > +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, > + int num_offsets) > +{ > + char buf[MaxBlocktableEntrySize]; > + BlocktableEntry *page = (BlocktableEntry *) buf; > > I'm not sure this is safe with alignment. Maybe rather than plain > "char", it needs to be a union with BlocktableEntry, or something. I tried it in the new patch set but could you explain why it could not be safe with alignment? > > +static inline BlocktableEntry * > +tidstore_iter_kv(TidStoreIter *iter, uint64 *key) > +{ > + if (TidStoreIsShared(iter->ts)) > + return shared_rt_iterate_next(iter->tree_iter.shared, key); > + > + return local_rt_iterate_next(iter->tree_iter.local, key); > +} > > In the old encoding scheme, this function did something important, but > now it's a useless wrapper with one caller. Removed. > > + /* > + * In the shared case, TidStoreControl and radix_tree are backed by the > + * same DSA area and rt_memory_usage() returns the value including both. > + * So we don't need to add the size of TidStoreControl separately. > + */ > + if (TidStoreIsShared(ts)) > + return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared); > + > + return sizeof(TidStore) + sizeof(TidStore) + > local_rt_memory_usage(ts->tree.local); > > I don't see the point in including these tiny structs, since we will > always blow past the limit by a number of kilobytes (at least, often > megabytes or more) at the time it happens. Agreed, removed. > > + iter->output.max_offset = 64; > > Maybe needs a comment that this is just some starting size and not > anything particular. > > + iter->output.offsets = palloc(sizeof(OffsetNumber) * iter->output.max_offset); > > + /* Make sure there is enough space to add offsets */ > + if (result->num_offsets + bmw_popcount(w) > result->max_offset) > + { > + result->max_offset *= 2; > + result->offsets = repalloc(result->offsets, > + sizeof(OffsetNumber) * result->max_offset); > + } > > popcount()-ing for every array element in every value is expensive -- > let's just add sizeof(bitmapword). It's not that wasteful, but then > the initial max will need to be 128. Good idea. > > About separation of responsibilities for locking: The only thing > currently where the tid store is not locked is tree iteration. That's > a strange exception. Also, we've recently made RT_FIND return a > pointer, so the caller must somehow hold a share lock, but I think we > haven't exposed callers the ability to do that, and we rely on the tid > store lock for that. We have a mix of tree locking and tid store > locking. We will need to consider carefully how to make this more > clear, maintainable, and understandable. Yes, tidstore should be locked during the iteration. One simple direction about locking is that the radix tree has the lock but no APIs hold/release it. It's the caller's responsibility. If a data structure using a radix tree for its storage has its own lock (like tidstore), it can use it instead of the radix tree's one. A downside would be that it's probably hard to support a better locking algorithm such as ROWEX in the radix tree. Another variant of APIs that also does locking/unlocking within APIs might help. > > 0015: > > "XXX: some regression test fails since this commit changes the minimum > m_w_m to 2048 from 1024. This was necessary for the pervious memory" > > This shouldn't fail anymore if the "one-place" clamp was in a patch > before this. If so, lets take out that GUC change and worry about > min/max size separately. If it still fails, I'd like to know why. Agreed. > > - * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the > - * vacrel->dead_items array. > + * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in > the TID store. > > What I was getting at earlier is that the first line here doesn't > really need to change, we can just s/array/store/ ? Fixed. > > -static int > -lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer, > - int index, Buffer vmbuffer) > +static void > +lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, > + OffsetNumber *deadoffsets, > int num_offsets, Buffer buffer, > + Buffer vmbuffer) > > "buffer" should still come after "blkno", so that line doesn't need to change. Fixed. > > $ git diff master -- src/backend/access/heap/ | grep has_lpdead_items > - bool has_lpdead_items; /* includes existing LP_DEAD items */ > - * pruning and freezing. all_visible implies !has_lpdead_items, but don't > - Assert(!prunestate.all_visible || !prunestate.has_lpdead_items); > - if (prunestate.has_lpdead_items) > - else if (prunestate.has_lpdead_items && PageIsAllVisible(page)) > - if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming) > - prunestate->has_lpdead_items = false; > - prunestate->has_lpdead_itemshas_lpdead_itemshas_lpdead_itemshas_lpdead_items > = true; > > In a green field, it'd be fine to replace these with an expression of > "num_offsets", but it adds a bit of noise for reviewers and the git > log. Is it really necessary? I see your point. I think we can live with having both has_lpdead_items and num_offsets. But we will have to check if these values are consistent, which could be less maintainable. > > - deadoffsets[lpdead_items++] = offnum; > + > prunestate->deadoffsets[prunestate->num_offsets++] = offnum; > > I'm also not quite sure why "deadoffsets" and "lpdead_items" got > moved to the PruneState. The latter was renamed in a way that makes > more sense, but I don't see why the churn is necessary. > > @@ -1875,28 +1882,9 @@ lazy_scan_prune(LVRelState *vacrel, > } > #endif > > - /* > - * Now save details of the LP_DEAD items from the page in vacrel > - */ > - if (lpdead_items > 0) > + if (prunestate->num_offsets > 0) > { > - VacDeadItems *dead_items = vacrel->dead_items; > - ItemPointerData tmp; > - > vacrel->lpdead_item_pages++; > - prunestate->has_lpdead_items = true; > - > - ItemPointerSetBlockNumber(&tmp, blkno); > - > - for (int i = 0; i < lpdead_items; i++) > - { > - ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]); > - dead_items->items[dead_items->num_items++] = tmp; > - } > - > - Assert(dead_items->num_items <= dead_items->max_items); > - pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES, > - > dead_items->num_items); > > I don't understand why this block got removed and nothing new is > adding anything to the tid store. > > @@ -1087,7 +1088,16 @@ lazy_scan_heap(LVRelState *vacrel) > * with prunestate-driven visibility map and > FSM steps (just like > * the two-pass strategy). > */ > - Assert(dead_items->num_items == 0); > + Assert(TidStoreNumTids(dead_items) == 0); > + } > + else if (prunestate.num_offsets > 0) > + { > + /* Save details of the LP_DEAD items from the > page in dead_items */ > + TidStoreSetBlockOffsets(dead_items, blkno, > prunestate.deadoffsets, > + > prunestate.num_offsets); > + > + > pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES, > + > TidStoreMemoryUsage(dead_items)); > > I guess it was added here, 800 lines away? If so, why? The above changes are related. The idea is not to use tidstore in a one-pass strategy. If the table doesn't have any indexes, in lazy_scan_prune() we collect offset numbers of dead tuples on the page and vacuum the page using them. In this case, we don't need to use tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The LVPagePruneState is a convenient place to store collected offset numbers. > > About progress reporting: I want to make sure no one is going to miss > counting "num_dead_tuples". It's no longer relevant for the number of > index scans we need to do, but do admins still have a use for it? > Something to think about later. I'm not sure if the user will still need num_dead_tuples in progress reporting view. The total number of dead tuples might be useful but the verbose log already shows that. > > 0017 > > + /* > + * max_bytes is forced to be at least 64kB, the current minimum valid > + * value for the work_mem GUC. > + */ > + max_bytes = Max(64 * 1024L, max_bytes); > > If this still needs to be here, I still don't understand why. Removed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Dec 26, 2023 at 12:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Dec 21, 2023 at 4:41 PM John Naylor <johncnaylorls@gmail.com> wrote: > > +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, > > + int num_offsets) > > +{ > > + char buf[MaxBlocktableEntrySize]; > > + BlocktableEntry *page = (BlocktableEntry *) buf; > > > > I'm not sure this is safe with alignment. Maybe rather than plain > > "char", it needs to be a union with BlocktableEntry, or something. > > I tried it in the new patch set but could you explain why it could not > be safe with alignment? I was thinking because "buf" is just an array of bytes. But, since the next declaration is a cast to a pointer to the actual type, maybe we can rely on the compiler to do the right thing. (It seems to on my machine in any case) > > About separation of responsibilities for locking: The only thing > > currently where the tid store is not locked is tree iteration. That's > > a strange exception. Also, we've recently made RT_FIND return a > > pointer, so the caller must somehow hold a share lock, but I think we > > haven't exposed callers the ability to do that, and we rely on the tid > > store lock for that. We have a mix of tree locking and tid store > > locking. We will need to consider carefully how to make this more > > clear, maintainable, and understandable. > > Yes, tidstore should be locked during the iteration. > > One simple direction about locking is that the radix tree has the lock > but no APIs hold/release it. It's the caller's responsibility. If a > data structure using a radix tree for its storage has its own lock > (like tidstore), it can use it instead of the radix tree's one. A It looks like the only reason tidstore has its own lock is because it has no way to delegate locking to the tree's lock. Instead of working around the limitations of the thing we've designed, let's make it work for the one use case we have. I think we need to expose RT_LOCK_* functions to the outside, and have tid store use them. That would allow us to simplify all those "if (TidStoreIsShared(ts) LWLockAcquire(..., ...)" calls, which are complex and often redundant. At some point, we'll probably want to keep locking inside, at least to smooth the way for fine-grained locking you mentioned. > > In a green field, it'd be fine to replace these with an expression of > > "num_offsets", but it adds a bit of noise for reviewers and the git > > log. Is it really necessary? > > I see your point. I think we can live with having both > has_lpdead_items and num_offsets. But we will have to check if these > values are consistent, which could be less maintainable. It would be clearer if that removal was split out into a separate patch. > > I'm also not quite sure why "deadoffsets" and "lpdead_items" got > > moved to the PruneState. The latter was renamed in a way that makes > > more sense, but I don't see why the churn is necessary. ... > > I guess it was added here, 800 lines away? If so, why? > > The above changes are related. The idea is not to use tidstore in a > one-pass strategy. If the table doesn't have any indexes, in > lazy_scan_prune() we collect offset numbers of dead tuples on the page > and vacuum the page using them. In this case, we don't need to use > tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The > LVPagePruneState is a convenient place to store collected offset > numbers. Okay, that makes sense, but if it was ever explained, I don't remember, and there is nothing in the commit message either. I'm not sure this can be split up easily, but if so it might help reviewing. This change also leads to a weird-looking control flow: if (vacrel->nindexes == 0) { if (prunestate.num_offsets > 0) { ... } } else if (prunestate.num_offsets > 0) { ... }
On Wed, Dec 27, 2023 at 12:08 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Dec 26, 2023 at 12:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Dec 21, 2023 at 4:41 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, > > > + int num_offsets) > > > +{ > > > + char buf[MaxBlocktableEntrySize]; > > > + BlocktableEntry *page = (BlocktableEntry *) buf; > > > > > > I'm not sure this is safe with alignment. Maybe rather than plain > > > "char", it needs to be a union with BlocktableEntry, or something. > > > > I tried it in the new patch set but could you explain why it could not > > be safe with alignment? > > I was thinking because "buf" is just an array of bytes. But, since the > next declaration is a cast to a pointer to the actual type, maybe we > can rely on the compiler to do the right thing. (It seems to on my > machine in any case) Okay, I kept it. > > > > About separation of responsibilities for locking: The only thing > > > currently where the tid store is not locked is tree iteration. That's > > > a strange exception. Also, we've recently made RT_FIND return a > > > pointer, so the caller must somehow hold a share lock, but I think we > > > haven't exposed callers the ability to do that, and we rely on the tid > > > store lock for that. We have a mix of tree locking and tid store > > > locking. We will need to consider carefully how to make this more > > > clear, maintainable, and understandable. > > > > Yes, tidstore should be locked during the iteration. > > > > One simple direction about locking is that the radix tree has the lock > > but no APIs hold/release it. It's the caller's responsibility. If a > > data structure using a radix tree for its storage has its own lock > > (like tidstore), it can use it instead of the radix tree's one. A > > It looks like the only reason tidstore has its own lock is because it > has no way to delegate locking to the tree's lock. Instead of working > around the limitations of the thing we've designed, let's make it work > for the one use case we have. I think we need to expose RT_LOCK_* > functions to the outside, and have tid store use them. That would > allow us to simplify all those "if (TidStoreIsShared(ts) > LWLockAcquire(..., ...)" calls, which are complex and often redundant. I agree that we expose RT_LOCK_* functions and have tidstore use them, but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)" calls part. I think that even if we expose them, we will still need to do something like "if (TidStoreIsShared(ts)) shared_rt_lock_share(ts->tree.shared)", no? > > At some point, we'll probably want to keep locking inside, at least to > smooth the way for fine-grained locking you mentioned. > > > > In a green field, it'd be fine to replace these with an expression of > > > "num_offsets", but it adds a bit of noise for reviewers and the git > > > log. Is it really necessary? > > > > I see your point. I think we can live with having both > > has_lpdead_items and num_offsets. But we will have to check if these > > values are consistent, which could be less maintainable. > > It would be clearer if that removal was split out into a separate patch. Agreed. > > > > I'm also not quite sure why "deadoffsets" and "lpdead_items" got > > > moved to the PruneState. The latter was renamed in a way that makes > > > more sense, but I don't see why the churn is necessary. > ... > > > I guess it was added here, 800 lines away? If so, why? > > > > The above changes are related. The idea is not to use tidstore in a > > one-pass strategy. If the table doesn't have any indexes, in > > lazy_scan_prune() we collect offset numbers of dead tuples on the page > > and vacuum the page using them. In this case, we don't need to use > > tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The > > LVPagePruneState is a convenient place to store collected offset > > numbers. > > Okay, that makes sense, but if it was ever explained, I don't > remember, and there is nothing in the commit message either. > > I'm not sure this can be split up easily, but if so it might help reviewing. Agreed. > > This change also leads to a weird-looking control flow: > > if (vacrel->nindexes == 0) > { > if (prunestate.num_offsets > 0) > { > ... > } > } > else if (prunestate.num_offsets > 0) > { > ... > } Fixed. I've attached a new patch set. From v47 patch, I've merged your changes for radix tree, and split the vacuum integration patch into 3 patches: simply replaces VacDeadItems with TidsTore (0007 patch), and use a simple TID array for one-pass strategy (0008 patch), and replace has_lpdead_items with "num_offsets > 0" (0009 patch), while incorporating your review comments on the vacuum integration patch (sorry for making it difficult to see the changes from v47 patch). 0013 to 0015 patches are also updates from v47 patch. I'm thinking that we should change the order of the patches so that tidstore patch requires the patch for changing DSA segment sizes. That way, we can remove the complex max memory calculation part that we no longer use from the tidstore patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I agree that we expose RT_LOCK_* functions and have tidstore use them, > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)" > calls part. I think that even if we expose them, we will still need to > do something like "if (TidStoreIsShared(ts)) > shared_rt_lock_share(ts->tree.shared)", no? I'll come back to this topic separately. > I've attached a new patch set. From v47 patch, I've merged your > changes for radix tree, and split the vacuum integration patch into 3 > patches: simply replaces VacDeadItems with TidsTore (0007 patch), and > use a simple TID array for one-pass strategy (0008 patch), and replace > has_lpdead_items with "num_offsets > 0" (0009 patch), while > incorporating your review comments on the vacuum integration patch Nice! > (sorry for making it difficult to see the changes from v47 patch). It's actually pretty clear. I just have a couple comments before sharing my latest cleanups: (diff'ing between v47 and v48): -- /* - * In the shared case, TidStoreControl and radix_tree are backed by the - * same DSA area and rt_memory_usage() returns the value including both. - * So we don't need to add the size of TidStoreControl separately. - */ if (TidStoreIsShared(ts)) - return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared); + rt_mem = shared_rt_memory_usage(ts->tree.shared); + else + rt_mem = local_rt_memory_usage(ts->tree.local); - return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local); + return sizeof(TidStore) + sizeof(TidStoreControl) + rt_mem; Upthread, I meant that I don't see the need to include the size of these structs *at all*. They're tiny, and the blocks/segments will almost certainly have some empty space counted in the total anyway. The returned size is already overestimated, so this extra code is just a distraction. - if (result->num_offsets + bmw_popcount(w) > result->max_offset) + if (result->num_offsets + (sizeof(bitmapword) * BITS_PER_BITMAPWORD) >= result->max_offset) I believe this math is wrong. We care about "result->num_offsets + BITS_PER_BITMAPWORD", right? Also, it seems if the condition evaluates to equal, we still have enough space, in which case ">" the max is the right condition. - if (off < 1 || off > MAX_TUPLES_PER_PAGE) + if (off < 1 || off > MaxOffsetNumber) This can now use OffsetNumberIsValid(). > 0013 to 0015 patches are also updates from v47 patch. > I'm thinking that we should change the order of the patches so that > tidstore patch requires the patch for changing DSA segment sizes. That > way, we can remove the complex max memory calculation part that we no > longer use from the tidstore patch. I don't think there is any reason to have those calculations at all at this point. Every patch in every version should at least *work correctly*, without kludging m_w_m and without constraining max segment size. I'm fine with the latter remaining in its own thread, and I hope we can consider it an enhancement that respects the admin's configured limits more effectively, and not a pre-requisite for not breaking. I *think* we're there now, but it's hard to tell since 0015 was at the very end. As I said recently, if something still fails, I'd like to know why. So for v49, I took the liberty of removing the DSA max segment patches for now, and squashing v48-0015. In addition for v49, I have quite a few cleanups: 0001 - This hasn't been touched in a very long time, but I ran pgindent and clarified a comment 0002 - We no longer need to isolate the rightmost bit anywhere, so removed that part and revised the commit message accordingly. radix tree: 0003 - v48 plus squashed v48-0013 0004 - Removed or adjusted WIP, FIXME, TODO items. Some were outdated, and I fixed most of the rest. 0005 - Remove the RT_PTR_LOCAL macro, since it's not really useful anymore. 0006 - RT_FREE_LEAF only needs the allocated pointer, so pass that. A bit simpler. 0007 - Uses the same idea from a previous cleanup of RT_SET, for RT_DELETE. 0008 - Removes a holdover from the multi-value leaves era. 0009 - It occurred to me that we need to have unique names for memory contexts for different instantiations of the template. This is one way to do it, by using the configured RT_PREFIX in the context name. I also took an extra step to make the size class fanout show up correctly on different platforms, but that's probably overkill and undesirable, and I'll probably use only the class name next time. 0010/11 - Make the array functions less surprising and with more informative names. 0012 - Restore a useful technique from Andres's prototype. This part has been slow for a long time, so much that it showed up in a profile where this path wasn't even taken much. tid store / vacuum: 0013/14 - Same as v48 TID store, with review squashed 0015 - Rationalize comment and starting value. 0016 - I applied the removal of the old clamps from v48-0011 (init/max DSA), and left out the rest for now. 0017-20 - Vacuum and debug tidstore as in v48, with v48-0015 squashed I'll bring up locking again shortly.
Attachment
On Wed, Jan 3, 2024 at 9:10 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I agree that we expose RT_LOCK_* functions and have tidstore use them, > > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)" > > calls part. I think that even if we expose them, we will still need to > > do something like "if (TidStoreIsShared(ts)) > > shared_rt_lock_share(ts->tree.shared)", no? > > I'll come back to this topic separately. To answer your question, sure, but that "if (TidStoreIsShared(ts))" part would be pushed down into a function so that only one place has to care about it. However, I'm starting to question whether we even need that. Meaning, lock the tidstore separately. To "lock the tidstore" means to take a lock, _separate_ from the radix tree's internal lock, to control access to two fields in a separate "control object": +typedef struct TidStoreControl +{ + /* the number of tids in the store */ + int64 num_tids; + + /* the maximum bytes a TidStore can use */ + size_t max_bytes; I'm pretty sure max_bytes does not need to be in shared memory, and certainly not under a lock: Thinking of a hypothetical parallel-prune-phase scenario, one way would be for a leader process to pass out ranges of blocks to workers, and when the limit is exceeded, stop passing out blocks and wait for all the workers to finish. As for num_tids, vacuum previously put the similar count in @@ -176,7 +179,8 @@ struct ParallelVacuumState PVIndStats *indstats; /* Shared dead items space among parallel vacuum workers */ - VacDeadItems *dead_items; + TidStore *dead_items; VacDeadItems contained "num_items". What was the reason to have new infrastructure for that count? And it doesn't seem like access to it was controlled by a lock -- can you confirm? If we did get parallel pruning, maybe the count would belong inside PVShared? The number of tids is not that tightly bound to the tidstore's job. I believe tidbitmap.c (a possible future client) doesn't care about the global number of tids -- not only that, but AND/OR operations can change the number in a non-obvious way, so it would not be convenient to keep an accurate number anyway. But the lock would still be mandatory with this patch. If we can make vacuum work a bit closer to how it does now, it'd be a big step up in readability, I think. Namely, getting rid of all the locking logic inside tidstore.c and let the radix tree's locking do the right thing. We'd need to make that work correctly when receiving pointers to values upon lookup, and I already shared ideas for that. But I want to see if there is any obstacle in the way of removing the tidstore control object and it's separate lock.
On Wed, Jan 3, 2024 at 11:10 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I agree that we expose RT_LOCK_* functions and have tidstore use them, > > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)" > > calls part. I think that even if we expose them, we will still need to > > do something like "if (TidStoreIsShared(ts)) > > shared_rt_lock_share(ts->tree.shared)", no? > > I'll come back to this topic separately. > > > I've attached a new patch set. From v47 patch, I've merged your > > changes for radix tree, and split the vacuum integration patch into 3 > > patches: simply replaces VacDeadItems with TidsTore (0007 patch), and > > use a simple TID array for one-pass strategy (0008 patch), and replace > > has_lpdead_items with "num_offsets > 0" (0009 patch), while > > incorporating your review comments on the vacuum integration patch > > Nice! > > > (sorry for making it difficult to see the changes from v47 patch). > > It's actually pretty clear. I just have a couple comments before > sharing my latest cleanups: > > (diff'ing between v47 and v48): > > -- /* > - * In the shared case, TidStoreControl and radix_tree are backed by the > - * same DSA area and rt_memory_usage() returns the value including both. > - * So we don't need to add the size of TidStoreControl separately. > - */ > if (TidStoreIsShared(ts)) > - return sizeof(TidStore) + > shared_rt_memory_usage(ts->tree.shared); > + rt_mem = shared_rt_memory_usage(ts->tree.shared); > + else > + rt_mem = local_rt_memory_usage(ts->tree.local); > > - return sizeof(TidStore) + sizeof(TidStore) + > local_rt_memory_usage(ts->tree.local); > + return sizeof(TidStore) + sizeof(TidStoreControl) + rt_mem; > > Upthread, I meant that I don't see the need to include the size of > these structs *at all*. They're tiny, and the blocks/segments will > almost certainly have some empty space counted in the total anyway. > The returned size is already overestimated, so this extra code is just > a distraction. Agreed. > > - if (result->num_offsets + bmw_popcount(w) > result->max_offset) > + if (result->num_offsets + (sizeof(bitmapword) * BITS_PER_BITMAPWORD) > >= result->max_offset) > > I believe this math is wrong. We care about "result->num_offsets + > BITS_PER_BITMAPWORD", right? > Also, it seems if the condition evaluates to equal, we still have > enough space, in which case ">" the max is the right condition. Oops, you're right. Fixed. > > - if (off < 1 || off > MAX_TUPLES_PER_PAGE) > + if (off < 1 || off > MaxOffsetNumber) > > This can now use OffsetNumberIsValid(). Fixed. > > > 0013 to 0015 patches are also updates from v47 patch. > > > I'm thinking that we should change the order of the patches so that > > tidstore patch requires the patch for changing DSA segment sizes. That > > way, we can remove the complex max memory calculation part that we no > > longer use from the tidstore patch. > > I don't think there is any reason to have those calculations at all at > this point. Every patch in every version should at least *work > correctly*, without kludging m_w_m and without constraining max > segment size. I'm fine with the latter remaining in its own thread, > and I hope we can consider it an enhancement that respects the admin's > configured limits more effectively, and not a pre-requisite for not > breaking. I *think* we're there now, but it's hard to tell since 0015 > was at the very end. As I said recently, if something still fails, I'd > like to know why. So for v49, I took the liberty of removing the DSA > max segment patches for now, and squashing v48-0015. Fair enough. > > In addition for v49, I have quite a few cleanups: > > 0001 - This hasn't been touched in a very long time, but I ran > pgindent and clarified a comment > 0002 - We no longer need to isolate the rightmost bit anywhere, so > removed that part and revised the commit message accordingly. Thanks. > > radix tree: > 0003 - v48 plus squashed v48-0013 > 0004 - Removed or adjusted WIP, FIXME, TODO items. Some were outdated, > and I fixed most of the rest. > 0005 - Remove the RT_PTR_LOCAL macro, since it's not really useful anymore. > 0006 - RT_FREE_LEAF only needs the allocated pointer, so pass that. A > bit simpler. > 0007 - Uses the same idea from a previous cleanup of RT_SET, for RT_DELETE. > 0008 - Removes a holdover from the multi-value leaves era. > 0009 - It occurred to me that we need to have unique names for memory > contexts for different instantiations of the template. This is one way > to do it, by using the configured RT_PREFIX in the context name. I > also took an extra step to make the size class fanout show up > correctly on different platforms, but that's probably overkill and > undesirable, and I'll probably use only the class name next time. > 0010/11 - Make the array functions less surprising and with more > informative names. > 0012 - Restore a useful technique from Andres's prototype. This part > has been slow for a long time, so much that it showed up in a profile > where this path wasn't even taken much. These changes look good to me. I've squashed them. In addition, I've made some changes and cleanups: 0010 - address the above review comments. 0011 - simplify the radix tree iteration code. I hope it makes the code clear and readable. Also I removed RT_UPDATE_ITER_STACK(). 0012 - fix a typo 0013 - In RT_SHMEM case, we use SIZEOF_VOID_P for RT_VALUE_IS_EMBEDDABLE check, but I think it's not correct. Because DSA has its own pointer size, SIZEOF_DSA_POINTER, it could be 4 bytes even if SIZEOF_VOID_P is 8 bytes, for example in a case where !defined(PG_HAVE_ATOMIC_U64_SUPPORT). Please refer to dsa.h for details. 0014 - cleanup RT_VERIFY code. 0015 - change and cleanup RT_DUMP_NODE(). Now it dumps only one node and no longer supports dumping nodes recursively. 0016 - remove RT_DUMP_SEARCH() and RT_DUMP(). These seem no longer necessary. 0017 - MOve RT_DUMP_NODE to the debug function section, close to RT_STATS. 0018 - Fix a printf format in RT_STATS(). BTW, now that the inner and leaf nodes use the same structure, do we still need RT_NODE_BASE_XXX types? Most places where we use RT_NODE_BASE_XXX types can be replaced with RT_NODE_XXX types. Exceptions are RT_FANOUT_XX calculations: #if SIZEOF_VOID_P < 8 #define RT_FANOUT_16_LO ((96 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC)) #define RT_FANOUT_48 ((512 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC)) #else #define RT_FANOUT_16_LO ((160 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC)) #define RT_FANOUT_48 ((768 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC)) #endif /* SIZEOF_VOID_P < 8 */ But I think we can replace them with offsetof(RT_NODE_16, children) etc. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Jan 9, 2024 at 9:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > In addition, I've made some changes and cleanups: These look good to me, although I have not tried dumping a node in a while. > 0011 - simplify the radix tree iteration code. I hope it makes the > code clear and readable. Also I removed RT_UPDATE_ITER_STACK(). I'm very pleased with how much simpler it is now! > 0013 - In RT_SHMEM case, we use SIZEOF_VOID_P for > RT_VALUE_IS_EMBEDDABLE check, but I think it's not correct. Because > DSA has its own pointer size, SIZEOF_DSA_POINTER, it could be 4 bytes > even if SIZEOF_VOID_P is 8 bytes, for example in a case where > !defined(PG_HAVE_ATOMIC_U64_SUPPORT). Please refer to dsa.h for > details. Thanks for the pointer. ;-) > BTW, now that the inner and leaf nodes use the same structure, do we > still need RT_NODE_BASE_XXX types? Most places where we use > RT_NODE_BASE_XXX types can be replaced with RT_NODE_XXX types. That's been in the back of my mind as well. Maybe the common header should be the new "base" member? At least, something other than "n". > Exceptions are RT_FANOUT_XX calculations: > > #if SIZEOF_VOID_P < 8 > #define RT_FANOUT_16_LO ((96 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC)) > #define RT_FANOUT_48 ((512 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC)) > #else > #define RT_FANOUT_16_LO ((160 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC)) > #define RT_FANOUT_48 ((768 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC)) > #endif /* SIZEOF_VOID_P < 8 */ > > But I think we can replace them with offsetof(RT_NODE_16, children) etc. That makes sense. Do you want to have a go at it, or shall I? I think after that, the only big cleanup needed is putting things in a more readable order. I can do that at a later date, and other opportunities for beautification are pretty minor and localized. Rationalizing locking is the only thing left that requires a bit of thought.
On Tue, Jan 9, 2024 at 8:19 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Jan 9, 2024 at 9:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > In addition, I've made some changes and cleanups: > > These look good to me, although I have not tried dumping a node in a while. > > > 0011 - simplify the radix tree iteration code. I hope it makes the > > code clear and readable. Also I removed RT_UPDATE_ITER_STACK(). > > I'm very pleased with how much simpler it is now! > > > 0013 - In RT_SHMEM case, we use SIZEOF_VOID_P for > > RT_VALUE_IS_EMBEDDABLE check, but I think it's not correct. Because > > DSA has its own pointer size, SIZEOF_DSA_POINTER, it could be 4 bytes > > even if SIZEOF_VOID_P is 8 bytes, for example in a case where > > !defined(PG_HAVE_ATOMIC_U64_SUPPORT). Please refer to dsa.h for > > details. > > Thanks for the pointer. ;-) > > > BTW, now that the inner and leaf nodes use the same structure, do we > > still need RT_NODE_BASE_XXX types? Most places where we use > > RT_NODE_BASE_XXX types can be replaced with RT_NODE_XXX types. > > That's been in the back of my mind as well. Maybe the common header > should be the new "base" member? At least, something other than "n". Agreed. > > > Exceptions are RT_FANOUT_XX calculations: > > > > #if SIZEOF_VOID_P < 8 > > #define RT_FANOUT_16_LO ((96 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC)) > > #define RT_FANOUT_48 ((512 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC)) > > #else > > #define RT_FANOUT_16_LO ((160 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC)) > > #define RT_FANOUT_48 ((768 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC)) > > #endif /* SIZEOF_VOID_P < 8 */ > > > > But I think we can replace them with offsetof(RT_NODE_16, children) etc. > > That makes sense. Do you want to have a go at it, or shall I? I've done in 0010 patch in v51 patch set. Whereas RT_NODE_4 and RT_NODE_16 structs declaration needs RT_FANOUT_4_HI and RT_FANOUT_16_HI respectively, RT_FANOUT_16_LO and RT_FANOUT_48 need RT_NODE_16 and RT_NODE_48 structs declaration. So fanout declarations are now spread before and after RT_NODE_XXX struct declaration. It's a bit less readable, but I'm not sure of a better way. The previous updates are merged into the main radix tree patch and tidstore patch. Nothing changes in other patches from v50. > > I think after that, the only big cleanup needed is putting things in a > more readable order. I can do that at a later date, and other > opportunities for beautification are pretty minor and localized. Agreed. > > Rationalizing locking is the only thing left that requires a bit of thought. Right, I'll send a reply soon. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Jan 10, 2024 at 9:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've done in 0010 patch in v51 patch set. Whereas RT_NODE_4 and > RT_NODE_16 structs declaration needs RT_FANOUT_4_HI and > RT_FANOUT_16_HI respectively, RT_FANOUT_16_LO and RT_FANOUT_48 need > RT_NODE_16 and RT_NODE_48 structs declaration. So fanout declarations > are now spread before and after RT_NODE_XXX struct declaration. It's a > bit less readable, but I'm not sure of a better way. They were before and after the *_BASE types, so it's not really worse, I think. I did notice that RT_SLOT_IDX_LIMIT has been considered special for a very long time, before we even had size classes, so it's the same thing but even more far away. I have an idea to introduce *_MAX macros, allowing to turn RT_SLOT_IDX_LIMIT into RT_FANOUT_48_MAX, so that everything is in the same spot, and to make this area more consistent. I also noticed that I'd been assuming that RT_FANOUT_16_HI fits easily into a DSA size class, but that's only true on 64-bit, and in any case we don't want to assume it. I've attached an addendum .txt to demo this idea.
Attachment
On Mon, Jan 8, 2024 at 8:35 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Jan 3, 2024 at 9:10 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > I agree that we expose RT_LOCK_* functions and have tidstore use them, > > > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)" > > > calls part. I think that even if we expose them, we will still need to > > > do something like "if (TidStoreIsShared(ts)) > > > shared_rt_lock_share(ts->tree.shared)", no? > > > > I'll come back to this topic separately. > > To answer your question, sure, but that "if (TidStoreIsShared(ts))" > part would be pushed down into a function so that only one place has > to care about it. > > However, I'm starting to question whether we even need that. Meaning, > lock the tidstore separately. To "lock the tidstore" means to take a > lock, _separate_ from the radix tree's internal lock, to control > access to two fields in a separate "control object": > > +typedef struct TidStoreControl > +{ > + /* the number of tids in the store */ > + int64 num_tids; > + > + /* the maximum bytes a TidStore can use */ > + size_t max_bytes; > > I'm pretty sure max_bytes does not need to be in shared memory, and > certainly not under a lock: Thinking of a hypothetical > parallel-prune-phase scenario, one way would be for a leader process > to pass out ranges of blocks to workers, and when the limit is > exceeded, stop passing out blocks and wait for all the workers to > finish. True. I agreed that it doesn't need to be under a lock anyway, as it's read-only. > > As for num_tids, vacuum previously put the similar count in > > @@ -176,7 +179,8 @@ struct ParallelVacuumState > PVIndStats *indstats; > > /* Shared dead items space among parallel vacuum workers */ > - VacDeadItems *dead_items; > + TidStore *dead_items; > > VacDeadItems contained "num_items". What was the reason to have new > infrastructure for that count? And it doesn't seem like access to it > was controlled by a lock -- can you confirm? If we did get parallel > pruning, maybe the count would belong inside PVShared? I thought that since the tidstore is a general-purpose data structure the shared counter should be protected by a lock. One thing I'm concerned about is that we might need to update both the radix tree and the counter atomically in some cases. But that's true we don't need it for lazy vacuum at least for now. Even given the parallel scan phase, probably we won't need to have workers check the total number of stored tuples during a parallel scan. > > The number of tids is not that tightly bound to the tidstore's job. I > believe tidbitmap.c (a possible future client) doesn't care about the > global number of tids -- not only that, but AND/OR operations can > change the number in a non-obvious way, so it would not be convenient > to keep an accurate number anyway. But the lock would still be > mandatory with this patch. Very good point. > > If we can make vacuum work a bit closer to how it does now, it'd be a > big step up in readability, I think. Namely, getting rid of all the > locking logic inside tidstore.c and let the radix tree's locking do > the right thing. We'd need to make that work correctly when receiving > pointers to values upon lookup, and I already shared ideas for that. > But I want to see if there is any obstacle in the way of removing the > tidstore control object and it's separate lock. So I agree to remove both max_bytes and num_items from the control object.Also, as you mentioned, we can remove the tidstore control object itself. TidStoreGetHandle() returns a radix tree handle, and we can pass it to TidStoreAttach(). I'll try it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jan 8, 2024 at 8:35 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Wed, Jan 3, 2024 at 9:10 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > I agree that we expose RT_LOCK_* functions and have tidstore use them, > > > > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)" > > > > calls part. I think that even if we expose them, we will still need to > > > > do something like "if (TidStoreIsShared(ts)) > > > > shared_rt_lock_share(ts->tree.shared)", no? > > > > > > I'll come back to this topic separately. > > > > To answer your question, sure, but that "if (TidStoreIsShared(ts))" > > part would be pushed down into a function so that only one place has > > to care about it. > > > > However, I'm starting to question whether we even need that. Meaning, > > lock the tidstore separately. To "lock the tidstore" means to take a > > lock, _separate_ from the radix tree's internal lock, to control > > access to two fields in a separate "control object": > > > > +typedef struct TidStoreControl > > +{ > > + /* the number of tids in the store */ > > + int64 num_tids; > > + > > + /* the maximum bytes a TidStore can use */ > > + size_t max_bytes; > > > > I'm pretty sure max_bytes does not need to be in shared memory, and > > certainly not under a lock: Thinking of a hypothetical > > parallel-prune-phase scenario, one way would be for a leader process > > to pass out ranges of blocks to workers, and when the limit is > > exceeded, stop passing out blocks and wait for all the workers to > > finish. > > True. I agreed that it doesn't need to be under a lock anyway, as it's > read-only. > > > > > As for num_tids, vacuum previously put the similar count in > > > > @@ -176,7 +179,8 @@ struct ParallelVacuumState > > PVIndStats *indstats; > > > > /* Shared dead items space among parallel vacuum workers */ > > - VacDeadItems *dead_items; > > + TidStore *dead_items; > > > > VacDeadItems contained "num_items". What was the reason to have new > > infrastructure for that count? And it doesn't seem like access to it > > was controlled by a lock -- can you confirm? If we did get parallel > > pruning, maybe the count would belong inside PVShared? > > I thought that since the tidstore is a general-purpose data structure > the shared counter should be protected by a lock. One thing I'm > concerned about is that we might need to update both the radix tree > and the counter atomically in some cases. But that's true we don't > need it for lazy vacuum at least for now. Even given the parallel scan > phase, probably we won't need to have workers check the total number > of stored tuples during a parallel scan. > > > > > The number of tids is not that tightly bound to the tidstore's job. I > > believe tidbitmap.c (a possible future client) doesn't care about the > > global number of tids -- not only that, but AND/OR operations can > > change the number in a non-obvious way, so it would not be convenient > > to keep an accurate number anyway. But the lock would still be > > mandatory with this patch. > > Very good point. > > > > > If we can make vacuum work a bit closer to how it does now, it'd be a > > big step up in readability, I think. Namely, getting rid of all the > > locking logic inside tidstore.c and let the radix tree's locking do > > the right thing. We'd need to make that work correctly when receiving > > pointers to values upon lookup, and I already shared ideas for that. > > But I want to see if there is any obstacle in the way of removing the > > tidstore control object and it's separate lock. > > So I agree to remove both max_bytes and num_items from the control > object.Also, as you mentioned, we can remove the tidstore control > object itself. TidStoreGetHandle() returns a radix tree handle, and we > can pass it to TidStoreAttach(). I'll try it. > I realized that if we remove the whole tidstore control object including max_bytes, processes who attached the shared tidstore cannot use TidStoreIsFull() actually as it always returns true. Also they cannot use TidStoreReset() as well since it needs to pass max_bytes to RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it could be problematic for general use. If we remove it, we probably need a safeguard to prevent those who attached the tidstore from calling these functions. Or we can keep the control object but remove the lock and num_tids. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Jan 12, 2024 at 3:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > So I agree to remove both max_bytes and num_items from the control > > object.Also, as you mentioned, we can remove the tidstore control > > object itself. TidStoreGetHandle() returns a radix tree handle, and we > > can pass it to TidStoreAttach(). I'll try it. Thanks. It's worth looking closely here. > I realized that if we remove the whole tidstore control object > including max_bytes, processes who attached the shared tidstore cannot > use TidStoreIsFull() actually as it always returns true. I imagine that we'd replace that with a function (maybe an earlier version had it?) to report the memory usage to the caller, which should know where to find max_bytes. > Also they > cannot use TidStoreReset() as well since it needs to pass max_bytes to > RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it > could be problematic for general use. HEAD has no problem finding the necessary values, and I don't think it'd be difficult to maintain that ability. I'm not actually sure what "general use" needs to have, and I'm not sure anyone can guess. There's the future possibility of parallel heap-scanning, but I'm guessing a *lot* more needs to happen for that to work, so I'm not sure how much it buys us to immediately start putting those two fields in a special abstraction. The only other concrete use case mentioned in this thread that I remember is bitmap heap scan, and I believe that would never need to reset, only free the whole thing when finished. I spent some more time studying parallel vacuum, and have some thoughts. In HEAD, we have -/* - * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming. - */ -typedef struct VacDeadItems -{ - int max_items; /* # slots allocated in array */ - int num_items; /* current # of entries */ - - /* Sorted array of TIDs to delete from indexes */ - ItemPointerData items[FLEXIBLE_ARRAY_MEMBER]; -} VacDeadItems; ...which has the tids, plus two fields that function _very similarly_ to the two extra fields in the tidstore control object. It's a bit strange to me that the patch doesn't have this struct anymore. I suspect if we keep it around (just change "items" to be the local tidstore struct), the patch would have a bit less churn and look/work more like the current code. I think it might be easier to read if the v17 commits are suited to the current needs of vacuum, rather than try to anticipate all uses. Richer abstractions can come later if needed. Another stanza: - /* Prepare the dead_items space */ - dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc, - est_dead_items_len); - dead_items->max_items = max_items; - dead_items->num_items = 0; - MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items); - shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items); - pvs->dead_items = dead_items; With s/max_items/max_bytes/, I wonder if we can still use some of this, and parallel workers would have no problem getting the necessary info, as they do today. If not, I don't really understand why. I'm not very familiar with working with shared memory, and I know the tree itself needs some different setup, so it's quite possible I'm missing something. I find it difficult to kept straight these four things: - radix tree - radix tree control object - tidstore - tidstore control object Even with the code in front of me, it's hard to reason about how these concepts fit together. It'd be much more readable if this was simplified.
On Sun, Jan 14, 2024 at 10:43 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Fri, Jan 12, 2024 at 3:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > So I agree to remove both max_bytes and num_items from the control > > > object.Also, as you mentioned, we can remove the tidstore control > > > object itself. TidStoreGetHandle() returns a radix tree handle, and we > > > can pass it to TidStoreAttach(). I'll try it. > > Thanks. It's worth looking closely here. > > > I realized that if we remove the whole tidstore control object > > including max_bytes, processes who attached the shared tidstore cannot > > use TidStoreIsFull() actually as it always returns true. > > I imagine that we'd replace that with a function (maybe an earlier > version had it?) to report the memory usage to the caller, which > should know where to find max_bytes. > > > Also they > > cannot use TidStoreReset() as well since it needs to pass max_bytes to > > RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it > > could be problematic for general use. > > HEAD has no problem finding the necessary values, and I don't think > it'd be difficult to maintain that ability. I'm not actually sure what > "general use" needs to have, and I'm not sure anyone can guess. > There's the future possibility of parallel heap-scanning, but I'm > guessing a *lot* more needs to happen for that to work, so I'm not > sure how much it buys us to immediately start putting those two fields > in a special abstraction. The only other concrete use case mentioned > in this thread that I remember is bitmap heap scan, and I believe that > would never need to reset, only free the whole thing when finished. > > I spent some more time studying parallel vacuum, and have some > thoughts. In HEAD, we have > > -/* > - * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming. > - */ > -typedef struct VacDeadItems > -{ > - int max_items; /* # slots allocated in array */ > - int num_items; /* current # of entries */ > - > - /* Sorted array of TIDs to delete from indexes */ > - ItemPointerData items[FLEXIBLE_ARRAY_MEMBER]; > -} VacDeadItems; > > ...which has the tids, plus two fields that function _very similarly_ > to the two extra fields in the tidstore control object. It's a bit > strange to me that the patch doesn't have this struct anymore. > > I suspect if we keep it around (just change "items" to be the local > tidstore struct), the patch would have a bit less churn and look/work > more like the current code. I think it might be easier to read if the > v17 commits are suited to the current needs of vacuum, rather than try > to anticipate all uses. Richer abstractions can come later if needed. Just changing "items" to be the local tidstore struct could make the code tricky a bit, since max_bytes and num_items are on the shared memory while "items" is a local pointer to the shared tidstore. This is a reason why I abstract them behind TidStore. However, IIUC the current parallel vacuum can work with such VacDeadItems fields, fortunately. The leader process can use VacDeadItems allocated on DSM, and worker processes can use a local VacDeadItems of which max_bytes and num_items are copied from the shared one and "items" is a local pointer. Assuming parallel heap scan requires for both the leader and workers to update the shared VacDeadItems concurrently, we may need such richer abstractions. I've implemented this idea in the v52 patch set. Here is the summary of the updates: 0008: Remove the control object from tidstore. Also removed some unsupported functions such as TidStoreNumTids() 0009: Adjust lazy vacuum integration patch with the control object removal. I've not updated any locking code yet. Once we confirm this direction, I'll update the locking code too. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Just changing "items" to be the local tidstore struct could make the > code tricky a bit, since max_bytes and num_items are on the shared > memory while "items" is a local pointer to the shared tidstore. Thanks for trying it this way! I like the overall simplification but this aspect is not great. Hmm, I wonder if that's a side-effect of the "create" functions doing their own allocations and returning a pointer. Would it be less tricky if the structs were declared where we need them and passed to "init" functions? That may be a good idea for other reasons. It's awkward that the create function is declared like this: #ifdef RT_SHMEM RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes, dsa_area *dsa, int tranche_id); #else RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes); #endif An init function wouldn't need these parameters: it could look at the passed struct to know what to do.
On Wed, Jan 17, 2024 at 9:20 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Just changing "items" to be the local tidstore struct could make the > > code tricky a bit, since max_bytes and num_items are on the shared > > memory while "items" is a local pointer to the shared tidstore. > > Thanks for trying it this way! I like the overall simplification but > this aspect is not great. > Hmm, I wonder if that's a side-effect of the "create" functions doing > their own allocations and returning a pointer. Would it be less tricky > if the structs were declared where we need them and passed to "init" > functions? Seems worth trying. The current RT_CREATE() API is also convenient as other data structure such as simplehash.h and dshash.c supports a similar > > That may be a good idea for other reasons. It's awkward that the > create function is declared like this: > > #ifdef RT_SHMEM > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes, > dsa_area *dsa, > int tranche_id); > #else > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes); > #endif > > An init function wouldn't need these parameters: it could look at the > passed struct to know what to do. But the init function would initialize leaf_ctx etc,no? Initializing leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE. The same is true for dsa. I imagined that an init function would allocate a DSA memory for the control object. So I imagine we will end up still requiring some of them. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Jan 17, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Jan 17, 2024 at 9:20 AM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Just changing "items" to be the local tidstore struct could make the > > > code tricky a bit, since max_bytes and num_items are on the shared > > > memory while "items" is a local pointer to the shared tidstore. > > > > Thanks for trying it this way! I like the overall simplification but > > this aspect is not great. > > Hmm, I wonder if that's a side-effect of the "create" functions doing > > their own allocations and returning a pointer. Would it be less tricky > > if the structs were declared where we need them and passed to "init" > > functions? > > Seems worth trying. The current RT_CREATE() API is also convenient as > other data structure such as simplehash.h and dshash.c supports a > similar I don't happen to know if these paths had to solve similar trickiness with some values being local, and some shared. > > That may be a good idea for other reasons. It's awkward that the > > create function is declared like this: > > > > #ifdef RT_SHMEM > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes, > > dsa_area *dsa, > > int tranche_id); > > #else > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes); > > #endif > > > > An init function wouldn't need these parameters: it could look at the > > passed struct to know what to do. > > But the init function would initialize leaf_ctx etc,no? Initializing > leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE. I was more referring to the parameters that were different above depending on shared memory. My first thought was that the tricky part is because of the allocation in local memory, but it's certainly possible I've misunderstood the problem. > The same > is true for dsa. I imagined that an init function would allocate a DSA > memory for the control object. Yes: ... // embedded in VacDeadItems TidStore items; }; // NULL DSA in local case, etc dead_items->items.area = dead_items_dsa; dead_items->items.tranche_id = FOO_ID; TidStoreInit(&dead_items->items, vac_work_mem); That's how I imagined it would work (leaving out some details). I haven't tried it, so not sure how much it helps. Maybe it has other problems, but I'm hoping it's just a matter of programming. If we can't make this work nicely, I'd be okay with keeping the tid store control object. My biggest concern is unnecessary double-locking.
I wrote: > > Hmm, I wonder if that's a side-effect of the "create" functions doing > > their own allocations and returning a pointer. Would it be less tricky > > if the structs were declared where we need them and passed to "init" > > functions? If this is a possibility, I thought I'd first send the last (I hope) large-ish set of radix tree cleanups to avoid rebasing issues. I'm not including tidstore/vacuum here, because recent discussion has some up-in-the-air work. Should be self-explanatory, but some thing are worth calling out: 0012 and 0013: Some time ago I started passing insertpos as a parameter, but now see that is not ideal -- when growing from node16 to node48 we don't need it at all, so it's a wasted calculation. While reverting that, I found that this also allows passing constants in some cases. 0014 makes a cleaner separation between adding a child and growing a node, resulting in more compact-looking functions. 0019 is a bit unpolished, but I realized that it's pointless to assign a zero child when further up the call stack we overwrite it anyway with the actual value. With this, that assignment is skipped. This makes some comments and names strange, so needs a bit of polish, but wanted to get it out there anyway.
Attachment
On Wed, Jan 17, 2024 at 11:37 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Jan 17, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Jan 17, 2024 at 9:20 AM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Just changing "items" to be the local tidstore struct could make the > > > > code tricky a bit, since max_bytes and num_items are on the shared > > > > memory while "items" is a local pointer to the shared tidstore. > > > > > > Thanks for trying it this way! I like the overall simplification but > > > this aspect is not great. > > > Hmm, I wonder if that's a side-effect of the "create" functions doing > > > their own allocations and returning a pointer. Would it be less tricky > > > if the structs were declared where we need them and passed to "init" > > > functions? > > > > Seems worth trying. The current RT_CREATE() API is also convenient as > > other data structure such as simplehash.h and dshash.c supports a > > similar > > I don't happen to know if these paths had to solve similar trickiness > with some values being local, and some shared. > > > > That may be a good idea for other reasons. It's awkward that the > > > create function is declared like this: > > > > > > #ifdef RT_SHMEM > > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes, > > > dsa_area *dsa, > > > int tranche_id); > > > #else > > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes); > > > #endif > > > > > > An init function wouldn't need these parameters: it could look at the > > > passed struct to know what to do. > > > > But the init function would initialize leaf_ctx etc,no? Initializing > > leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE. > > I was more referring to the parameters that were different above > depending on shared memory. My first thought was that the tricky part > is because of the allocation in local memory, but it's certainly > possible I've misunderstood the problem. > > > The same > > is true for dsa. I imagined that an init function would allocate a DSA > > memory for the control object. > > Yes: > > ... > // embedded in VacDeadItems > TidStore items; > }; > > // NULL DSA in local case, etc > dead_items->items.area = dead_items_dsa; > dead_items->items.tranche_id = FOO_ID; > > TidStoreInit(&dead_items->items, vac_work_mem); > > That's how I imagined it would work (leaving out some details). I > haven't tried it, so not sure how much it helps. Maybe it has other > problems, but I'm hoping it's just a matter of programming. It seems we cannot make this work nicely. IIUC VacDeadItems is allocated in DSM and TidStore is embedded there. However, dead_items->items.area is a local pointer to dsa_area. So we cannot include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we would need to pass dsa_area to each interface by callers. > > If we can't make this work nicely, I'd be okay with keeping the tid > store control object. My biggest concern is unnecessary > double-locking. If we don't do any locking stuff in radix tree APIs and it's the user's responsibility at all, probably we don't need a lock for tidstore? That is, we expose lock functions as you mentioned and the user (like tidstore) acquires/releases the lock before/after accessing the radix tree and num_items. Currently (as of v52 patch) RT_FIND is doing so, but we would need to change RT_SET() and iteration functions as well. During trying this idea, I realized that there is a visibility problem in the radix tree template especially if we want to embed the radix tree in a struct. Considering a use case where we want to use a radix tree in an exposed struct, we would declare only interfaces in a .h file and define actual implementation in a .c file (FYI TupleHashTableData does a similar thing with simplehash.h). The .c file and .h file would be like: in .h file: #define RT_PREFIX local_rt #define RT_SCOPE extern #define RT_DECLARE #define RT_VALUE_TYPE BlocktableEntry #define RT_VARLEN_VALUE #include "lib/radixtree.h" typedef struct TidStore { : local_rt_radix_tree tree; /* embedded */ : } TidStore; in .c file: #define RT_PREFIX local_rt #define RT_SCOPE extern #define RT_DEFINE #define RT_VALUE_TYPE BlocktableEntry #define RT_VARLEN_VALUE #include "lib/radixtree.h" But it doesn't work as the compiler doesn't know the actual definition of local_rt_radix_tree. If the 'tree' is *local_rt_radix_tree, it works. The reason is that with RT_DECLARE but without RT_DEFINE, the radix tree template generates only forward declarations: #ifdef RT_DECLARE typedef struct RT_RADIX_TREE RT_RADIX_TREE; typedef struct RT_ITER RT_ITER; In order to make it work, we need to move the definitions required to expose RT_RADIX_TREE struct to RT_DECLARE part, which actually requires to move RT_NODE, RT_HANDLE, RT_NODE_PTR, RT_SIZE_CLASS_COUNT, and RT_RADIX_TREE_CONTROL etc. However RT_SIZE_CLASS_COUNT, used in RT_RADIX_TREE, could be bothersome. Since it refers to RT_SIZE_CLASS_INFO that further refers to many #defines and structs, we might end up moving many structs such as RT_NODE_4 etc to RT_DECLARE part as well. Or we can use a fixed number is stead of "lengthof(RT_SIZE_CLASS_INFO)". Apart from that, macros requried by both RT_DECLARE and RT_DEFINE such as RT_PAN and RT_MAX_LEVEL also needs to be moved to a common place where they are defined in both cases. Given these facts, I think that the current abstraction works nicely and it would make sense not to support embedding the radix tree. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > It seems we cannot make this work nicely. IIUC VacDeadItems is > allocated in DSM and TidStore is embedded there. However, > dead_items->items.area is a local pointer to dsa_area. So we cannot > include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we > would need to pass dsa_area to each interface by callers. Thanks again for exploring this line of thinking! Okay, it seems even if there's a way to make this work, it would be too invasive to justify when compared with the advantage I was hoping for. > > If we can't make this work nicely, I'd be okay with keeping the tid > > store control object. My biggest concern is unnecessary > > double-locking. > > If we don't do any locking stuff in radix tree APIs and it's the > user's responsibility at all, probably we don't need a lock for > tidstore? That is, we expose lock functions as you mentioned and the > user (like tidstore) acquires/releases the lock before/after accessing > the radix tree and num_items. I'm not quite sure what the point of "num_items" is anymore, because it was really tied to the array in VacDeadItems. dead_items->num_items is essential to reading/writing the array correctly. If this number is wrong, the array is corrupt. There is no such requirement for the radix tree. We don't need to know the number of tids to add to it or do a lookup, or anything. There are a number of places where we assert "the running count of the dead items" is the same as "the length of the dead items array", like here: @@ -2214,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel) BlockNumber threshold; Assert(vacrel->num_index_scans == 0); - Assert(vacrel->lpdead_items == vacrel->dead_items->num_items); + Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items)); As such, in HEAD I'm guessing it's arbitrary which one is used for control flow. Correct me if I'm mistaken. If I am wrong for some part of the code, it'd be good to understand when that invariant can't be maintained. @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel) * Do index vacuuming (call each index's ambulkdelete routine), then do * related heap vacuuming */ - if (dead_items->num_items > 0) + if (TidStoreNumTids(dead_items) > 0) lazy_vacuum(vacrel); Like here. In HEAD, could this have used vacrel->dead_items? @@ -2479,14 +2473,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel) * We set all LP_DEAD items from the first heap pass to LP_UNUSED during * the second heap pass. No more, no less. */ - Assert(index > 0); Assert(vacrel->num_index_scans > 1 || - (index == vacrel->lpdead_items && + (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items && vacuumed_pages == vacrel->lpdead_item_pages)); ereport(DEBUG2, - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages", - vacrel->relname, (long long) index, vacuumed_pages))); + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages", + vacrel->relname, TidStoreNumTids(vacrel->dead_items), + vacuumed_pages))); We assert that vacrel->lpdead_items has the expected value, and then the ereport repeats the function call (with a lock) to read the value we just consulted to pass the assert. If we *really* want to compare counts, maybe we could invent a debugging-only function that iterates over the tree and popcounts the bitmaps. That seems too expensive for regular assert builds, though. On the subject of debugging builds, I think it no longer makes sense to have the array for debug checking in tid store, even during development. A few months ago, we had an encoding scheme that looked simple on paper, but its code was fiendishly difficult to follow (at least for me). That's gone. In addition to the debugging count above, we could also put a copy of the key in the BlockTableEntry's header, in debug builds. We don't yet need to care about the key size, since we don't (yet) have runtime-embeddable values. > Currently (as of v52 patch) RT_FIND is > doing so, [meaning, there is no internal "automatic" locking here since after we switched to variable-length types, an outstanding TODO] Maybe it's okay to expose global locking for v17. I have one possible alternative: This week I tried an idea to use a callback there so that after internal unlocking, the caller received the value (or whatever else needs to happen, such as lookup an offset in the tid bitmap). I've attached a draft for that that passes radix tree tests. It's a bit awkward, but I'm guessing this would more closely match future internal atomic locking. Let me know what you think of the concept, and then do whichever way you think is best. (using v53 as the basis) I believe this is the only open question remaining. The rest is just polish and testing. > During trying this idea, I realized that there is a visibility problem > in the radix tree template If it's broken even without the embedding I'll look into this (I don't know if this configuration has ever been tested). I think a good test is putting the shared tid tree in it's own translation unit, to see if anything needs to be fixed. I'll go try that.
Attachment
On Thu, Jan 18, 2024 at 1:30 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > It seems we cannot make this work nicely. IIUC VacDeadItems is > > allocated in DSM and TidStore is embedded there. However, > > dead_items->items.area is a local pointer to dsa_area. So we cannot > > include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we > > would need to pass dsa_area to each interface by callers. > > Thanks again for exploring this line of thinking! Okay, it seems even > if there's a way to make this work, it would be too invasive to > justify when compared with the advantage I was hoping for. > > > > If we can't make this work nicely, I'd be okay with keeping the tid > > > store control object. My biggest concern is unnecessary > > > double-locking. > > > > If we don't do any locking stuff in radix tree APIs and it's the > > user's responsibility at all, probably we don't need a lock for > > tidstore? That is, we expose lock functions as you mentioned and the > > user (like tidstore) acquires/releases the lock before/after accessing > > the radix tree and num_items. > > I'm not quite sure what the point of "num_items" is anymore, because > it was really tied to the array in VacDeadItems. dead_items->num_items > is essential to reading/writing the array correctly. If this number is > wrong, the array is corrupt. There is no such requirement for the > radix tree. We don't need to know the number of tids to add to it or > do a lookup, or anything. True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking we need to have the number of TIDs in a tidstore, especially in the tidstore's control object. > > There are a number of places where we assert "the running count of the > dead items" is the same as "the length of the dead items array", like > here: > > @@ -2214,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel) > BlockNumber threshold; > > Assert(vacrel->num_index_scans == 0); > - Assert(vacrel->lpdead_items == vacrel->dead_items->num_items); > + Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items)); > > As such, in HEAD I'm guessing it's arbitrary which one is used for > control flow. Correct me if I'm mistaken. If I am wrong for some part > of the code, it'd be good to understand when that invariant can't be > maintained. > > @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel) > * Do index vacuuming (call each index's ambulkdelete routine), then do > * related heap vacuuming > */ > - if (dead_items->num_items > 0) > + if (TidStoreNumTids(dead_items) > 0) > lazy_vacuum(vacrel); > > Like here. In HEAD, could this have used vacrel->dead_items? > > @@ -2479,14 +2473,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel) > * We set all LP_DEAD items from the first heap pass to LP_UNUSED during > * the second heap pass. No more, no less. > */ > - Assert(index > 0); > Assert(vacrel->num_index_scans > 1 || > - (index == vacrel->lpdead_items && > + (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items && > vacuumed_pages == vacrel->lpdead_item_pages)); > > ereport(DEBUG2, > - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages", > - vacrel->relname, (long long) index, vacuumed_pages))); > + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers > in %u pages", > + vacrel->relname, TidStoreNumTids(vacrel->dead_items), > + vacuumed_pages))); > > We assert that vacrel->lpdead_items has the expected value, and then > the ereport repeats the function call (with a lock) to read the value > we just consulted to pass the assert. > > If we *really* want to compare counts, maybe we could invent a > debugging-only function that iterates over the tree and popcounts the > bitmaps. That seems too expensive for regular assert builds, though. IIUC lpdead_items is the total number of LP_DEAD items vacuumed during the whole lazy vacuum operation whereas num_items is the number of LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle. That is, after heap vacuum, the latter counter is reset while the former counter is not. The latter counter is used in lazyvacuum.c as well as the ereport in vac_bulkdel_one_index(). > > On the subject of debugging builds, I think it no longer makes sense > to have the array for debug checking in tid store, even during > development. A few months ago, we had an encoding scheme that looked > simple on paper, but its code was fiendishly difficult to follow (at > least for me). That's gone. In addition to the debugging count above, > we could also put a copy of the key in the BlockTableEntry's header, > in debug builds. We don't yet need to care about the key size, since > we don't (yet) have runtime-embeddable values. Putting a copy of the key in BlocktableEntry's header is an interesting idea. But the current debug code in the tidstore also makes sure that the tidstore returns TIDs in the correct order during an iterate operation. I think it still has a value and you can disable it by removing the "#define TIDSTORE_DEBUG" line. > > > Currently (as of v52 patch) RT_FIND is > > doing so, > > [meaning, there is no internal "automatic" locking here since after we > switched to variable-length types, an outstanding TODO] > Maybe it's okay to expose global locking for v17. I have one possible > alternative: > > This week I tried an idea to use a callback there so that after > internal unlocking, the caller received the value (or whatever else > needs to happen, such as lookup an offset in the tid bitmap). I've > attached a draft for that that passes radix tree tests. It's a bit > awkward, but I'm guessing this would more closely match future > internal atomic locking. Let me know what you think of the concept, > and then do whichever way you think is best. (using v53 as the basis) Thank you for verifying this idea! Interesting. While it's promising in terms of future atomic locking, I'm concerned it might not be easy to use if radix tree APIs supports only such callback style. I believe the caller would like to pass one more data along with val_data. For example, considering tidstore that has num_tids internally, it wants to pass both a pointer to BlocktableEntry and a pointer to TidStore itself so that it increments the counter while holding a lock. Another API idea for future atomic locking is to separate RT_SET()/RT_FIND() into begin and end. In RT_SET_BEGIN() API, we find the key, extend nodes if necessary, set the value, and return the result while holding the lock. For example, if the radix tree supports lock coupling, the leaf node and its parent remain locked. Then the caller does its job and calls RT_SET_END() that does cleanup stuff such as releasing locks. I've not fully considered this approach but even this idea seems complex and easy to use. I prefer the current simple approach as we support the simple locking mechanism for now. > > I believe this is the only open question remaining. The rest is just > polish and testing. Right. > > > During trying this idea, I realized that there is a visibility problem > > in the radix tree template > > If it's broken even without the embedding I'll look into this (I don't > know if this configuration has ever been tested). I think a good test > is putting the shared tid tree in it's own translation unit, to see if > anything needs to be fixed. I'll go try that. Thanks. BTW in radixtree.h pg_attribute_unused() is used for some functions, but is it for debugging purposes? I don't see why it's used only for some functions. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Jan 19, 2024 at 2:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Jan 18, 2024 at 1:30 PM John Naylor <johncnaylorls@gmail.com> wrote: > > I'm not quite sure what the point of "num_items" is anymore, because > > it was really tied to the array in VacDeadItems. dead_items->num_items > > is essential to reading/writing the array correctly. If this number is > > wrong, the array is corrupt. There is no such requirement for the > > radix tree. We don't need to know the number of tids to add to it or > > do a lookup, or anything. > > True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking > we need to have the number of TIDs in a tidstore, especially in the > tidstore's control object. Hmm, it would be kind of sad to require explicit locking in tidstore.c is only for maintaining that one number at all times. Aside from the two ereports after an index scan / second heap pass, the only non-assert place where it's used is @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel) * Do index vacuuming (call each index's ambulkdelete routine), then do * related heap vacuuming */ - if (dead_items->num_items > 0) + if (TidStoreNumTids(dead_items) > 0) lazy_vacuum(vacrel); ...and that condition can be checked by doing a single step of iteration to see if it shows anything. But for the ereport, my idea for iteration + popcount is probably quite slow. > IIUC lpdead_items is the total number of LP_DEAD items vacuumed during > the whole lazy vacuum operation whereas num_items is the number of > LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle. > That is, after heap vacuum, the latter counter is reset while the > former counter is not. > > The latter counter is used in lazyvacuum.c as well as the ereport in > vac_bulkdel_one_index(). Ah, of course. > Putting a copy of the key in BlocktableEntry's header is an > interesting idea. But the current debug code in the tidstore also > makes sure that the tidstore returns TIDs in the correct order during > an iterate operation. I think it still has a value and you can disable > it by removing the "#define TIDSTORE_DEBUG" line. Fair enough. I just thought it'd be less work to leave this out in case we change how locking is called. > > This week I tried an idea to use a callback there so that after > > internal unlocking, the caller received the value (or whatever else > > needs to happen, such as lookup an offset in the tid bitmap). I've > > attached a draft for that that passes radix tree tests. It's a bit > > awkward, but I'm guessing this would more closely match future > > internal atomic locking. Let me know what you think of the concept, > > and then do whichever way you think is best. (using v53 as the basis) > > Thank you for verifying this idea! Interesting. While it's promising > in terms of future atomic locking, I'm concerned it might not be easy > to use if radix tree APIs supports only such callback style. Yeah, it's quite awkward. It could be helped by only exposing it for varlen types. For simply returning "present or not" (used a lot in the regression tests), we could skip the callback if the data is null. That is all also extra stuff. > I believe > the caller would like to pass one more data along with val_data. For That's trivial, however, if I understand you correctly. With "void *", a callback can receive anything, including a struct containing additional pointers to elsewhere. > example, considering tidstore that has num_tids internally, it wants > to pass both a pointer to BlocktableEntry and a pointer to TidStore > itself so that it increments the counter while holding a lock. Hmm, so a callback to RT_SET also. That's interesting! Anyway, I agree it needs to be simple, since the first use doesn't even have multiple writers. > BTW in radixtree.h pg_attribute_unused() is used for some functions, > but is it for debugging purposes? I don't see why it's used only for > some functions. It was there to silence warnings about unused functions. I only see one remaining, and it's already behind a debug symbol, so we might not need this attribute anymore.
I wrote: > On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > During trying this idea, I realized that there is a visibility problem > > in the radix tree template > > If it's broken even without the embedding I'll look into this (I don't > know if this configuration has ever been tested). I think a good test > is putting the shared tid tree in it's own translation unit, to see if > anything needs to be fixed. I'll go try that. Here's a quick test that this works. The only thing that really needed fixing in the template was failure to un-define one symbol. The rest was just moving some things around.
Attachment
On Fri, Jan 19, 2024 at 6:48 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Fri, Jan 19, 2024 at 2:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Jan 18, 2024 at 1:30 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > I'm not quite sure what the point of "num_items" is anymore, because > > > it was really tied to the array in VacDeadItems. dead_items->num_items > > > is essential to reading/writing the array correctly. If this number is > > > wrong, the array is corrupt. There is no such requirement for the > > > radix tree. We don't need to know the number of tids to add to it or > > > do a lookup, or anything. > > > > True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking > > we need to have the number of TIDs in a tidstore, especially in the > > tidstore's control object. > > Hmm, it would be kind of sad to require explicit locking in tidstore.c > is only for maintaining that one number at all times. Aside from the > two ereports after an index scan / second heap pass, the only > non-assert place where it's used is > > @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel) > * Do index vacuuming (call each index's ambulkdelete routine), then do > * related heap vacuuming > */ > - if (dead_items->num_items > 0) > + if (TidStoreNumTids(dead_items) > 0) > lazy_vacuum(vacrel); > > ...and that condition can be checked by doing a single step of > iteration to see if it shows anything. But for the ereport, my idea > for iteration + popcount is probably quite slow. Right. On further thought, as you pointed out before, "num_tids" should not be in tidstore in terms of integration with tidbitmap.c, because tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no longer accurate and useful. Similarly, looking at tidbitmap.c, it has npages and nchunks but they will not be necessary in lazy vacuum use case. Also, assuming that we support parallel heap pruning, probably we need to somehow lock the tidstore while adding tids to the tidstore concurrently by parallel vacuum worker. But in tidbitmap use case, we don't need to lock the tidstore since it doesn't have multiple writers. Given these facts, different statistics and different lock strategies are required by different use case. So I think there are 3 options: 1. expose lock functions for tidstore and the caller manages the statistics in the outside of tidstore. For example, in lazyvacuum.c we would have a TidStore for tid storage as well as VacDeadItemsInfo that has num_tids and max_bytes. Both are in LVRelState. For parallel vacuum, we pass both to the workers via DSM and pass both to function where the statistics are required. As for the exposed lock functions, when adding tids to the tidstore, the caller would need to call something like TidStoreLockExclusive(ts) that further calls LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally. 2. add callback functions to tidstore so that the caller can do its work while holding a lock on the tidstore. This is like the idea we just discussed for radix tree. The caller passes a callback function and user data to TidStoreSetBlockOffsets(), and the callback is called after setting tids. Similar to option 1, the statistics need to be stored in a different area. 3. keep tidstore.c and tidbitmap.c separate implementations but use radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its control object and doesn't have any lossy page support. On the other hand, in tidbitmap.c we replace simplehash with radix tree. This makes tidstore.c simple but we would end up having different data structures for similar usage. I think it's worth trying option 1. What do you think, John? > > > IIUC lpdead_items is the total number of LP_DEAD items vacuumed during > > the whole lazy vacuum operation whereas num_items is the number of > > LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle. > > That is, after heap vacuum, the latter counter is reset while the > > former counter is not. > > > > The latter counter is used in lazyvacuum.c as well as the ereport in > > vac_bulkdel_one_index(). > > Ah, of course. > > > Putting a copy of the key in BlocktableEntry's header is an > > interesting idea. But the current debug code in the tidstore also > > makes sure that the tidstore returns TIDs in the correct order during > > an iterate operation. I think it still has a value and you can disable > > it by removing the "#define TIDSTORE_DEBUG" line. > > Fair enough. I just thought it'd be less work to leave this out in > case we change how locking is called. > > > > This week I tried an idea to use a callback there so that after > > > internal unlocking, the caller received the value (or whatever else > > > needs to happen, such as lookup an offset in the tid bitmap). I've > > > attached a draft for that that passes radix tree tests. It's a bit > > > awkward, but I'm guessing this would more closely match future > > > internal atomic locking. Let me know what you think of the concept, > > > and then do whichever way you think is best. (using v53 as the basis) > > > > Thank you for verifying this idea! Interesting. While it's promising > > in terms of future atomic locking, I'm concerned it might not be easy > > to use if radix tree APIs supports only such callback style. > > Yeah, it's quite awkward. It could be helped by only exposing it for > varlen types. For simply returning "present or not" (used a lot in the > regression tests), we could skip the callback if the data is null. > That is all also extra stuff. > > > I believe > > the caller would like to pass one more data along with val_data. For > > That's trivial, however, if I understand you correctly. With "void *", > a callback can receive anything, including a struct containing > additional pointers to elsewhere. > > > example, considering tidstore that has num_tids internally, it wants > > to pass both a pointer to BlocktableEntry and a pointer to TidStore > > itself so that it increments the counter while holding a lock. > > Hmm, so a callback to RT_SET also. That's interesting! > > Anyway, I agree it needs to be simple, since the first use doesn't > even have multiple writers. Right. > > > BTW in radixtree.h pg_attribute_unused() is used for some functions, > > but is it for debugging purposes? I don't see why it's used only for > > some functions. > > It was there to silence warnings about unused functions. I only see > one remaining, and it's already behind a debug symbol, so we might not > need this attribute anymore. Okay. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Jan 22, 2024 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On further thought, as you pointed out before, "num_tids" should not > be in tidstore in terms of integration with tidbitmap.c, because > tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no > longer accurate and useful. Similarly, looking at tidbitmap.c, it has > npages and nchunks but they will not be necessary in lazy vacuum use > case. Also, assuming that we support parallel heap pruning, probably > we need to somehow lock the tidstore while adding tids to the tidstore > concurrently by parallel vacuum worker. But in tidbitmap use case, we > don't need to lock the tidstore since it doesn't have multiple > writers. Not currently, and it does seem bad to require locking where it's not required. (That would be a prerequisite for parallel index scan. It's been tried before with the hash table, but concurrency didn't scale well with the hash table. I have no reason to think that the radix tree would scale significantly better with the same global LW lock, but as you know there are other locking schemes possible.) > Given these facts, different statistics and different lock > strategies are required by different use case. So I think there are 3 > options: > > 1. expose lock functions for tidstore and the caller manages the > statistics in the outside of tidstore. For example, in lazyvacuum.c we > would have a TidStore for tid storage as well as VacDeadItemsInfo that > has num_tids and max_bytes. Both are in LVRelState. For parallel > vacuum, we pass both to the workers via DSM and pass both to function > where the statistics are required. As for the exposed lock functions, > when adding tids to the tidstore, the caller would need to call > something like TidStoreLockExclusive(ts) that further calls > LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally. The advantage here is that vacuum can avoid locking entirely while using shared memory, just like it does now, and has the option to add it later. IIUC, the radix tree struct would have a lock member, but wouldn't take any locks internally? Maybe we still need one for RT_MEMORY_USAGE? For that, I see dsa_get_total_size() takes its own DSA_AREA_LOCK -- maybe that's enough? That seems simplest, and is not very far from what we do now. If we do this, then the lock functions should be where we branch for is_shared. > 2. add callback functions to tidstore so that the caller can do its > work while holding a lock on the tidstore. This is like the idea we > just discussed for radix tree. The caller passes a callback function > and user data to TidStoreSetBlockOffsets(), and the callback is called > after setting tids. Similar to option 1, the statistics need to be > stored in a different area. I think we'll have to move to something like this eventually, but it seems like overkill right now. > 3. keep tidstore.c and tidbitmap.c separate implementations but use > radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its > control object and doesn't have any lossy page support. On the other > hand, in tidbitmap.c we replace simplehash with radix tree. This makes > tidstore.c simple but we would end up having different data structures > for similar usage. They have so much in common that it's worth it to use the same interface and (eventually) value type. They just need separate paths for adding tids, as we've discussed. > I think it's worth trying option 1. What do you think, John? +1
On Wed, Jan 17, 2024 at 12:32 PM John Naylor <johncnaylorls@gmail.com> wrote: > > I wrote: > > > > Hmm, I wonder if that's a side-effect of the "create" functions doing > > > their own allocations and returning a pointer. Would it be less tricky > > > if the structs were declared where we need them and passed to "init" > > > functions? > > If this is a possibility, I thought I'd first send the last (I hope) > large-ish set of radix tree cleanups to avoid rebasing issues. I'm not > including tidstore/vacuum here, because recent discussion has some > up-in-the-air work. Thank you for updating the patches! These updates look good to me. > > Should be self-explanatory, but some thing are worth calling out: > 0012 and 0013: Some time ago I started passing insertpos as a > parameter, but now see that is not ideal -- when growing from node16 > to node48 we don't need it at all, so it's a wasted calculation. While > reverting that, I found that this also allows passing constants in > some cases. > 0014 makes a cleaner separation between adding a child and growing a > node, resulting in more compact-looking functions. > 0019 is a bit unpolished, but I realized that it's pointless to assign > a zero child when further up the call stack we overwrite it anyway > with the actual value. With this, that assignment is skipped. This > makes some comments and names strange, so needs a bit of polish, but > wanted to get it out there anyway. Cool. I'll merge these patches in the next version v54 patch set. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Jan 22, 2024 at 2:36 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Mon, Jan 22, 2024 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On further thought, as you pointed out before, "num_tids" should not > > be in tidstore in terms of integration with tidbitmap.c, because > > tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no > > longer accurate and useful. Similarly, looking at tidbitmap.c, it has > > npages and nchunks but they will not be necessary in lazy vacuum use > > case. Also, assuming that we support parallel heap pruning, probably > > we need to somehow lock the tidstore while adding tids to the tidstore > > concurrently by parallel vacuum worker. But in tidbitmap use case, we > > don't need to lock the tidstore since it doesn't have multiple > > writers. > > Not currently, and it does seem bad to require locking where it's not required. > > (That would be a prerequisite for parallel index scan. It's been tried > before with the hash table, but concurrency didn't scale well with the > hash table. I have no reason to think that the radix tree would scale > significantly better with the same global LW lock, but as you know > there are other locking schemes possible.) > > > Given these facts, different statistics and different lock > > strategies are required by different use case. So I think there are 3 > > options: > > > > 1. expose lock functions for tidstore and the caller manages the > > statistics in the outside of tidstore. For example, in lazyvacuum.c we > > would have a TidStore for tid storage as well as VacDeadItemsInfo that > > has num_tids and max_bytes. Both are in LVRelState. For parallel > > vacuum, we pass both to the workers via DSM and pass both to function > > where the statistics are required. As for the exposed lock functions, > > when adding tids to the tidstore, the caller would need to call > > something like TidStoreLockExclusive(ts) that further calls > > LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally. > > The advantage here is that vacuum can avoid locking entirely while > using shared memory, just like it does now, and has the option to add > it later. True. > IIUC, the radix tree struct would have a lock member, but wouldn't > take any locks internally? Maybe we still need one for > RT_MEMORY_USAGE? For that, I see dsa_get_total_size() takes its own > DSA_AREA_LOCK -- maybe that's enough? I think that's a good point. So there will be no place where the radix tree takes any locks internally. > > That seems simplest, and is not very far from what we do now. If we do > this, then the lock functions should be where we branch for is_shared. Agreed. > > > 2. add callback functions to tidstore so that the caller can do its > > work while holding a lock on the tidstore. This is like the idea we > > just discussed for radix tree. The caller passes a callback function > > and user data to TidStoreSetBlockOffsets(), and the callback is called > > after setting tids. Similar to option 1, the statistics need to be > > stored in a different area. > > I think we'll have to move to something like this eventually, but it > seems like overkill right now. Right. > > > 3. keep tidstore.c and tidbitmap.c separate implementations but use > > radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its > > control object and doesn't have any lossy page support. On the other > > hand, in tidbitmap.c we replace simplehash with radix tree. This makes > > tidstore.c simple but we would end up having different data structures > > for similar usage. > > They have so much in common that it's worth it to use the same > interface and (eventually) value type. They just need separate paths > for adding tids, as we've discussed. Agreed. > > > I think it's worth trying option 1. What do you think, John? > > +1 Thanks! Before working on this idea, since the latest patches conflict with the current HEAD, I share the latest patch set (v54). Here is the summary: - As for radix tree part, it's based on v53 patch. I've squashed most of cleanups and changes in v53 except for "DRAFT: Stop using invalid pointers as placeholders." as I thought you might want to still work on it. BTW it includes "#undef RT_SHMEM". - As for tidstore, it's based on v51. That is, it still has the control object and num_tids there. - As for vacuum integration, it's also based on v51. But we no longer need to change has_lpdead_items and LVPagePruneState thanks to the recent commit c120550edb8 and e313a61137. For the next version patch, I'll work on this idea and try to clean up locking stuff both in tidstore and radix tree. Or if you're already working on some of them, please let me know. I'll review it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Jan 22, 2024 at 2:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > For the next version patch, I'll work on this idea and try to clean up > locking stuff both in tidstore and radix tree. Or if you're already > working on some of them, please let me know. I'll review it. Okay go ahead, sounds good. I plan to look at the tests since they haven't been looked at in a while.
On Mon, Jan 22, 2024 at 5:18 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Mon, Jan 22, 2024 at 2:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > For the next version patch, I'll work on this idea and try to clean up > > locking stuff both in tidstore and radix tree. Or if you're already > > working on some of them, please let me know. I'll review it. > > Okay go ahead, sounds good. I plan to look at the tests since they > haven't been looked at in a while. I've attached the latest patch set. Here are updates from v54 patch: 0005 - Expose radix tree lock functions and remove all locks taken internally in radixtree.h. 0008 - Remove tidstore's control object. 0009 - Add tidstore lock functions. 0011 - Add VacDeadItemsInfo to store "max_bytes" and "num_items" separate from TidStore. Also make lazy vacuum and parallel vacuum use it. The new patches probably need to be polished but the VacDeadItemInfo idea looks good to me. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Jan 23, 2024 at 12:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jan 22, 2024 at 5:18 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Mon, Jan 22, 2024 at 2:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > For the next version patch, I'll work on this idea and try to clean up > > > locking stuff both in tidstore and radix tree. Or if you're already > > > working on some of them, please let me know. I'll review it. > > > > Okay go ahead, sounds good. I plan to look at the tests since they > > haven't been looked at in a while. > > I've attached the latest patch set. Here are updates from v54 patch: > > 0005 - Expose radix tree lock functions and remove all locks taken > internally in radixtree.h. > 0008 - Remove tidstore's control object. > 0009 - Add tidstore lock functions. > 0011 - Add VacDeadItemsInfo to store "max_bytes" and "num_items" > separate from TidStore. Also make lazy vacuum and parallel vacuum use > it. John pointed out offlist the tarball includes only the patches up to 0009. I've attached the correct one. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > The new patches probably need to be polished but the VacDeadItemInfo > idea looks good to me. That idea looks good to me, too. Since you already likely know what you'd like to polish, I don't have much to say except for a few questions below. I also did a quick sweep through every patch, so some of these comments are unrelated to recent changes: v55-0003: +size_t +dsa_get_total_size(dsa_area *area) +{ + size_t size; + + LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED); + size = area->control->total_segment_size; + LWLockRelease(DSA_AREA_LOCK(area)); I looked and found dsa.c doesn't already use shared locks in HEAD, even dsa_dump. Not sure why that is... +/* + * Calculate the slab blocksize so that we can allocate at least 32 chunks + * from the block. + */ +#define RT_SLAB_BLOCK_SIZE(size) \ + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32) The first parameter seems to be trying to make the block size exact, but that's not right, because of the chunk header, and maybe alignment. If the default block size is big enough to waste only a tiny amount of space, let's just use that as-is. Also, I think all block sizes in the code base have been a power of two, but I'm not sure how much that matters. +#ifdef RT_SHMEM + fprintf(stderr, " [%d] chunk %x slot " DSA_POINTER_FORMAT "\n", + i, n4->chunks[i], n4->children[i]); +#else + fprintf(stderr, " [%d] chunk %x slot %p\n", + i, n4->chunks[i], n4->children[i]); +#endif Maybe we could invent a child pointer format, so we only #ifdef in one place. --- /dev/null +++ b/src/test/modules/test_radixtree/meson.build @@ -0,0 +1,35 @@ +# FIXME: prevent install during main install, but not during test :/ Can you look into this? test_radixtree.c: +/* + * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO? + */ +static int rt_node_class_fanouts[] = { + 4, /* RT_CLASS_3 */ + 15, /* RT_CLASS_32_MIN */ + 32, /* RT_CLASS_32_MAX */ + 125, /* RT_CLASS_125 */ + 256 /* RT_CLASS_256 */ +}; These numbers have been wrong a long time, too, but only matters for figuring out where it went wrong when something is broken. And for the XXX, instead of trying to use the largest number that should fit (it's obviously not testing that the expected node can actually hold that number anyway), it seems we can just use a "big enough" number to cause growing into the desired size class. As far as cleaning up the tests, I always wondered why these didn't use EXPECT_TRUE, EXPECT_FALSE, etc. as in Andres's prototype where where convenient, and leave comments above the tests. That seemed like a good idea to me -- was there a reason to have hand-written branches and elog messages everywhere? --- a/src/tools/pginclude/cpluspluscheck +++ b/src/tools/pginclude/cpluspluscheck @@ -101,6 +101,12 @@ do test "$f" = src/include/nodes/nodetags.h && continue test "$f" = src/backend/nodes/nodetags.h && continue + # radixtree_*_impl.h cannot be included standalone: they are just code fragments. + test "$f" = src/include/lib/radixtree_delete_impl.h && continue + test "$f" = src/include/lib/radixtree_insert_impl.h && continue + test "$f" = src/include/lib/radixtree_iter_impl.h && continue + test "$f" = src/include/lib/radixtree_search_impl.h && continue Ha! I'd forgotten about these -- they're long outdated. v55-0005: - * The radix tree is locked in shared mode during the iteration, so - * RT_END_ITERATE needs to be called when finished to release the lock. + * The caller needs to acquire a lock in shared mode during the iteration + * if necessary. "need if necessary" is maybe better phrased as "is the caller's responsibility" + /* + * We can rely on DSA_AREA_LOCK to get the total amount of DSA memory. + */ total = dsa_get_total_size(tree->dsa); Maybe better to have a header comment for RT_MEMORY_USAGE that the caller doesn't need to take a lock. v55-0006: "WIP: Not built, since some benchmarks have broken" -- I'll work on this when I re-run some benchmarks. v55-0007: + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, + * and stored in the radix tree. This hasn't been true for a few months now, and I thought we fixed this in some earlier version? + * TODO: The caller must be certain that no other backend will attempt to + * access the TidStore before calling this function. Other backend must + * explicitly call TidStoreDetach() to free up backend-local memory associated + * with the TidStore. The backend that calls TidStoreDestroy() must not call + * TidStoreDetach(). Do we need to do anything now? v55-0008: -TidStoreAttach(dsa_area *area, TidStoreHandle handle) +TidStoreAttach(dsa_area *area, dsa_pointer rt_dp) "handle" seemed like a fine name. Is that not the case anymore? The new one is kind of cryptic. The commit message just says "remove control object" -- does that imply that we need to think of this parameter differently, or is it unrelated? (Same with dead_items_handle in 0011) v55-0011: + /* + * Recreate the tidstore with the same max_bytes limitation. We cannot + * use neither maintenance_work_mem nor autovacuum_work_mem as they could + * already be changed. + */ I don't understand this part.
On Wed, Jan 24, 2024 at 3:42 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > The new patches probably need to be polished but the VacDeadItemInfo > > idea looks good to me. > > That idea looks good to me, too. Since you already likely know what > you'd like to polish, I don't have much to say except for a few > questions below. I also did a quick sweep through every patch, so some > of these comments are unrelated to recent changes: Thank you! > > v55-0003: > > +size_t > +dsa_get_total_size(dsa_area *area) > +{ > + size_t size; > + > + LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED); > + size = area->control->total_segment_size; > + LWLockRelease(DSA_AREA_LOCK(area)); > > I looked and found dsa.c doesn't already use shared locks in HEAD, > even dsa_dump. Not sure why that is... Oh, the dsa_dump part seems to be a bug. But it'll keep it consistent with others. > > +/* > + * Calculate the slab blocksize so that we can allocate at least 32 chunks > + * from the block. > + */ > +#define RT_SLAB_BLOCK_SIZE(size) \ > + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32) > > The first parameter seems to be trying to make the block size exact, > but that's not right, because of the chunk header, and maybe > alignment. If the default block size is big enough to waste only a > tiny amount of space, let's just use that as-is. Agreed. > Also, I think all > block sizes in the code base have been a power of two, but I'm not > sure how much that matters. Did you mean all slab block sizes we use in radixtree.h? > > +#ifdef RT_SHMEM > + fprintf(stderr, " [%d] chunk %x slot " DSA_POINTER_FORMAT "\n", > + i, n4->chunks[i], n4->children[i]); > +#else > + fprintf(stderr, " [%d] chunk %x slot %p\n", > + i, n4->chunks[i], n4->children[i]); > +#endif > > Maybe we could invent a child pointer format, so we only #ifdef in one place. WIll change. > > --- /dev/null > +++ b/src/test/modules/test_radixtree/meson.build > @@ -0,0 +1,35 @@ > +# FIXME: prevent install during main install, but not during test :/ > > Can you look into this? Okay, I'll look at it. > > test_radixtree.c: > > +/* > + * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO? > + */ > +static int rt_node_class_fanouts[] = { > + 4, /* RT_CLASS_3 */ > + 15, /* RT_CLASS_32_MIN */ > + 32, /* RT_CLASS_32_MAX */ > + 125, /* RT_CLASS_125 */ > + 256 /* RT_CLASS_256 */ > +}; > > These numbers have been wrong a long time, too, but only matters for > figuring out where it went wrong when something is broken. And for the > XXX, instead of trying to use the largest number that should fit (it's > obviously not testing that the expected node can actually hold that > number anyway), it seems we can just use a "big enough" number to > cause growing into the desired size class. > > As far as cleaning up the tests, I always wondered why these didn't > use EXPECT_TRUE, EXPECT_FALSE, etc. as in Andres's prototype where > where convenient, and leave comments above the tests. That seemed like > a good idea to me -- was there a reason to have hand-written branches > and elog messages everywhere? The current test is based on test_integerset. I agree that we can improve it by using EXPECT_TRUE etc. > > --- a/src/tools/pginclude/cpluspluscheck > +++ b/src/tools/pginclude/cpluspluscheck > @@ -101,6 +101,12 @@ do > test "$f" = src/include/nodes/nodetags.h && continue > test "$f" = src/backend/nodes/nodetags.h && continue > > + # radixtree_*_impl.h cannot be included standalone: they are just > code fragments. > + test "$f" = src/include/lib/radixtree_delete_impl.h && continue > + test "$f" = src/include/lib/radixtree_insert_impl.h && continue > + test "$f" = src/include/lib/radixtree_iter_impl.h && continue > + test "$f" = src/include/lib/radixtree_search_impl.h && continue > > Ha! I'd forgotten about these -- they're long outdated. Will remove. > > v55-0005: > > - * The radix tree is locked in shared mode during the iteration, so > - * RT_END_ITERATE needs to be called when finished to release the lock. > + * The caller needs to acquire a lock in shared mode during the iteration > + * if necessary. > > "need if necessary" is maybe better phrased as "is the caller's responsibility" Will fix. > > + /* > + * We can rely on DSA_AREA_LOCK to get the total amount of DSA memory. > + */ > total = dsa_get_total_size(tree->dsa); > > Maybe better to have a header comment for RT_MEMORY_USAGE that the > caller doesn't need to take a lock. Will fix. > > v55-0006: > > "WIP: Not built, since some benchmarks have broken" -- I'll work on > this when I re-run some benchmarks. Thanks! > > v55-0007: > > + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, > + * and stored in the radix tree. > > This hasn't been true for a few months now, and I thought we fixed > this in some earlier version? Yeah, I'll fix it. > > + * TODO: The caller must be certain that no other backend will attempt to > + * access the TidStore before calling this function. Other backend must > + * explicitly call TidStoreDetach() to free up backend-local memory associated > + * with the TidStore. The backend that calls TidStoreDestroy() must not call > + * TidStoreDetach(). > > Do we need to do anything now? No, will remove it. > > v55-0008: > > -TidStoreAttach(dsa_area *area, TidStoreHandle handle) > +TidStoreAttach(dsa_area *area, dsa_pointer rt_dp) > > "handle" seemed like a fine name. Is that not the case anymore? The > new one is kind of cryptic. The commit message just says "remove > control object" -- does that imply that we need to think of this > parameter differently, or is it unrelated? (Same with > dead_items_handle in 0011) Since it's actually just a radix tree's handle it was kind of unnatural to me to use the same dsa_pointer as different handles. But rethinking it, I agree "handle" is a fine name. > > v55-0011: > > + /* > + * Recreate the tidstore with the same max_bytes limitation. We cannot > + * use neither maintenance_work_mem nor autovacuum_work_mem as they could > + * already be changed. > + */ > > I don't understand this part. I wanted to mean that if maintenance_work_mem is changed and the config file is reloaded, its value could no longer be the same as the one that we used when initializing the parallel vacuum. That's why we need to store max_bytes in the DSM. I'll rephrase it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Jan 26, 2024 at 11:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Jan 24, 2024 at 3:42 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > The new patches probably need to be polished but the VacDeadItemInfo > > > idea looks good to me. > > > > That idea looks good to me, too. Since you already likely know what > > you'd like to polish, I don't have much to say except for a few > > questions below. I also did a quick sweep through every patch, so some > > of these comments are unrelated to recent changes: > > Thank you! > > > > > +/* > > + * Calculate the slab blocksize so that we can allocate at least 32 chunks > > + * from the block. > > + */ > > +#define RT_SLAB_BLOCK_SIZE(size) \ > > + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32) > > > > The first parameter seems to be trying to make the block size exact, > > but that's not right, because of the chunk header, and maybe > > alignment. If the default block size is big enough to waste only a > > tiny amount of space, let's just use that as-is. > > Agreed. > As of v55 patch, the sizes of each node class are: - node4: 40 bytes - node16_lo: 168 bytes - node16_hi: 296 bytes - node48: 784 bytes - node256: 2088 bytes If we use SLAB_DEFAULT_BLOCK_SIZE (8kB) for each node class, we waste (approximately): - node4: 32 bytes - node16_lo: 128 bytes - node16_hi: 200 bytes - node48: 352 bytes - node256: 1928 bytes We might want to calculate a better slab block size for node256 at least. > > > > + * TODO: The caller must be certain that no other backend will attempt to > > + * access the TidStore before calling this function. Other backend must > > + * explicitly call TidStoreDetach() to free up backend-local memory associated > > + * with the TidStore. The backend that calls TidStoreDestroy() must not call > > + * TidStoreDetach(). > > > > Do we need to do anything now? > > No, will remove it. > I misunderstood something. I think the above statement is still true but we don't need to do anything at this stage. It's a typical usage that the leader destroys the shared data after confirming all workers are detached. It's not a TODO but probably a NOTE. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Jan 29, 2024 at 2:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > +/* > > > + * Calculate the slab blocksize so that we can allocate at least 32 chunks > > > + * from the block. > > > + */ > > > +#define RT_SLAB_BLOCK_SIZE(size) \ > > > + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32) > > > > > > The first parameter seems to be trying to make the block size exact, > > > but that's not right, because of the chunk header, and maybe > > > alignment. If the default block size is big enough to waste only a > > > tiny amount of space, let's just use that as-is. > If we use SLAB_DEFAULT_BLOCK_SIZE (8kB) for each node class, we waste > [snip] > We might want to calculate a better slab block size for node256 at least. I meant the macro could probably be Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N) (Right now N=32). I also realize I didn't answer your question earlier about block sizes being powers of two. I was talking about PG in general -- I was thinking all block sizes were powers of two. If that's true, I'm not sure if it's because programmers find the macro calculations easy to reason about, or if there was an implementation reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or just above a power of two, so if we did round that up it would be 128kB. > > > + * TODO: The caller must be certain that no other backend will attempt to > > > + * access the TidStore before calling this function. Other backend must > > > + * explicitly call TidStoreDetach() to free up backend-local memory associated > > > + * with the TidStore. The backend that calls TidStoreDestroy() must not call > > > + * TidStoreDetach(). > > > > > > Do we need to do anything now? > > > > No, will remove it. > > > > I misunderstood something. I think the above statement is still true > but we don't need to do anything at this stage. It's a typical usage > that the leader destroys the shared data after confirming all workers > are detached. It's not a TODO but probably a NOTE. Okay.
On Mon, Jan 29, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Mon, Jan 29, 2024 at 2:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > +/* > > > > + * Calculate the slab blocksize so that we can allocate at least 32 chunks > > > > + * from the block. > > > > + */ > > > > +#define RT_SLAB_BLOCK_SIZE(size) \ > > > > + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32) > > > > > > > > The first parameter seems to be trying to make the block size exact, > > > > but that's not right, because of the chunk header, and maybe > > > > alignment. If the default block size is big enough to waste only a > > > > tiny amount of space, let's just use that as-is. > > > If we use SLAB_DEFAULT_BLOCK_SIZE (8kB) for each node class, we waste > > [snip] > > We might want to calculate a better slab block size for node256 at least. > > I meant the macro could probably be > > Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N) > > (Right now N=32). I also realize I didn't answer your question earlier > about block sizes being powers of two. I was talking about PG in > general -- I was thinking all block sizes were powers of two. If > that's true, I'm not sure if it's because programmers find the macro > calculations easy to reason about, or if there was an implementation > reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or > just above a power of two, so if we did round that up it would be > 128kB. Thank you for your explanation. It might be better to follow other codes. Does the calculation below make sense to you? RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i]; Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE; while (inner_blocksize < 32 * size_class.allocsize) inner_blocksize <<= 1; As for the lock mode in dsa.c, I've posted a question[1]. Regards, [1] https://www.postgresql.org/message-id/CAD21AoALgrU2sGWzgq%2B6G9X0ynqyVOjMR5_k4HgsGRWae1j%3DwQ%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Jan 30, 2024 at 7:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jan 29, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote: > > I meant the macro could probably be > > > > Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N) > > > > (Right now N=32). I also realize I didn't answer your question earlier > > about block sizes being powers of two. I was talking about PG in > > general -- I was thinking all block sizes were powers of two. If > > that's true, I'm not sure if it's because programmers find the macro > > calculations easy to reason about, or if there was an implementation > > reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or > > just above a power of two, so if we did round that up it would be > > 128kB. > > Thank you for your explanation. It might be better to follow other > codes. Does the calculation below make sense to you? > > RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i]; > Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE; > while (inner_blocksize < 32 * size_class.allocsize) > inner_blocksize <<= 1; It does make sense, but we can do it more simply: Max(SLAB_DEFAULT_BLOCK_SIZE, pg_nextpower2_32(size * 32))
On Tue, Jan 30, 2024 at 7:20 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Jan 30, 2024 at 7:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Jan 29, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > I meant the macro could probably be > > > > > > Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N) > > > > > > (Right now N=32). I also realize I didn't answer your question earlier > > > about block sizes being powers of two. I was talking about PG in > > > general -- I was thinking all block sizes were powers of two. If > > > that's true, I'm not sure if it's because programmers find the macro > > > calculations easy to reason about, or if there was an implementation > > > reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or > > > just above a power of two, so if we did round that up it would be > > > 128kB. > > > > Thank you for your explanation. It might be better to follow other > > codes. Does the calculation below make sense to you? > > > > RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i]; > > Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE; > > while (inner_blocksize < 32 * size_class.allocsize) > > inner_blocksize <<= 1; > > It does make sense, but we can do it more simply: > > Max(SLAB_DEFAULT_BLOCK_SIZE, pg_nextpower2_32(size * 32)) Thanks! I've attached the new patch set (v56). I've squashed previous updates and addressed review comments on v55 in separate patches. Here are the update summary: 0004: fix compiler warning caught by ci test. 0005-0008: address review comments on radix tree codes. 0009: cleanup #define and #undef 0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is undefined after including radixtree.h so we should not use it in test code. 0013-0015: address review comments on tidstore codes. 0017-0018: address review comments on vacuum integration codes. Looking at overall changes, there are still XXX and TODO comments in radixtree.h: --- * XXX There are 4 node kinds, and this should never be increased, * for several reasons: * 1. With 5 or more kinds, gcc tends to use a jump table for switch * statements. * 2. The 4 kinds can be represented with 2 bits, so we have the option * in the future to tag the node pointer with the kind, even on * platforms with 32-bit pointers. This might speed up node traversal * in trees with highly random node kinds. * 3. We can have multiple size classes per node kind. Can we just remove "XXX"? --- * WIP: notes about traditional radix tree trading off span vs height... Are you going to write it? --- #ifdef RT_SHMEM /* WIP: do we really need this? */ typedef dsa_pointer RT_HANDLE; #endif I think it's worth having it. --- * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit * inside a single bitmapword on most platforms, so it's a good starting * point. We can make it higher if we need to. */ #define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4) Are you going to work something on this? --- /* WIP: We could go first to the higher node16 size class */ newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO); Does it mean to go to RT_CLASS_16_HI and then further go to RT_CLASS_16_LO upon further deletion? --- * TODO: The current locking mechanism is not optimized for high concurrency * with mixed read-write workloads. In the future it might be worthwhile * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in * the paper "The ART of Practical Synchronization" by the same authors as * the ART paper, 2016. I think it's not TODO for now, but a future improvement. We can remove it. --- /* TODO: consider 5 with subclass 1 or 2. */ #define RT_FANOUT_4 4 Is there something we need to do here? --- /* * Return index of the chunk and slot arrays for inserting into the node, * such that the chunk array remains ordered. * TODO: Improve performance for non-SIMD platforms. */ Are you going to work on this? --- /* Delete the element at 'idx' */ /* TODO: replace slow memmove's */ Are you going to work on this? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Jan 31, 2024 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I've attached the new patch set (v56). I've squashed previous updates > and addressed review comments on v55 in separate patches. Here are the > update summary: > > 0004: fix compiler warning caught by ci test. > 0005-0008: address review comments on radix tree codes. > 0009: cleanup #define and #undef > 0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is > undefined after including radixtree.h so we should not use it in test > code. Great, thanks! I have a few questions and comments on v56, then I'll address yours below with the attached v57, which is mostly cosmetic adjustments. v56-0003: (Looking closer at tests) +static const bool rt_test_stats = false; I'm thinking we should just remove everything that depends on this, and keep this module entirely about correctness. + for (int shift = 0; shift <= (64 - 8); shift += 8) + test_node_types(shift); I'm not sure what the test_node_types_* functions are testing that test_basic doesn't. They have a different, and confusing, way to stop at every size class and check the keys/values. It seems we can replace all that with two more calls (asc/desc) to test_basic, with the maximum level. It's pretty hard to see what test_pattern() is doing, or why it's useful. I wonder if instead the test could use something like the benchmark where random integers are masked off. That seems simpler. I can work on that, but I'd like to hear your side about test_pattern(). v56-0007: + * + * Since we can rely on DSA_AREA_LOCK to get the total amount of DSA memory, + * the caller doesn't need to take a lock. Maybe something like "Since dsa_get_total_size() does appropriate locking ..."? v56-0008 Thanks, I like how the tests look now. -NOTICE: testing node 4 with height 0 and ascending keys ... +NOTICE: testing node 1 with height 0 and ascending keys Now that the number is not intended to match a size class, "node X" seems out of place. Maybe we could have a separate array with strings? + 1, /* RT_CLASS_4 */ This should be more than one, so that the basic test still exercises paths that shift elements around. + 100, /* RT_CLASS_48 */ This node currently holds 64 for local memory. + 255 /* RT_CLASS_256 */ This is the only one where we know exactly how many it can take, so may as well keep it at 256. v56-0012: The test module for tidstore could use a few more comments. v56-0015: +typedef dsa_pointer TidStoreHandle; + -TidStoreAttach(dsa_area *area, dsa_pointer rt_dp) +TidStoreAttach(dsa_area *area, TidStoreHandle handle) { TidStore *ts; + dsa_pointer rt_dp = handle; My earlier opinion was that "handle" was a nicer variable name, but this brings back the typedef and also keeps the variable name I didn't like, but pushes it down into the function. I'm a bit confused, so I've kept these not-squashed for now. ----------------------------------------------------------------------------------- Now, for v57: > Looking at overall changes, there are still XXX and TODO comments in > radixtree.h: That's fine, as long as it's intentional as a message to readers. That said, we can get rid of some: > --- > * XXX There are 4 node kinds, and this should never be increased, > * for several reasons: > * 1. With 5 or more kinds, gcc tends to use a jump table for switch > * statements. > * 2. The 4 kinds can be represented with 2 bits, so we have the option > * in the future to tag the node pointer with the kind, even on > * platforms with 32-bit pointers. This might speed up node traversal > * in trees with highly random node kinds. > * 3. We can have multiple size classes per node kind. > > Can we just remove "XXX"? How about "NOTE"? > --- > * WIP: notes about traditional radix tree trading off span vs height... > > Are you going to write it? Yes, when I draft a rough commit message, (for next time). > --- > #ifdef RT_SHMEM > /* WIP: do we really need this? */ > typedef dsa_pointer RT_HANDLE; > #endif > > I think it's worth having it. Okay, removed WIP in v57-0004. > --- > * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit > * inside a single bitmapword on most platforms, so it's a good starting > * point. We can make it higher if we need to. > */ > #define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4) > > Are you going to work something on this? Hard-coded 64 for readability, and changed this paragraph to explain the current rationale more clearly: "The paper uses at most 64 for this node kind, and one advantage for us is that "isset" is a single bitmapword on most platforms, rather than an array, allowing the compiler to get rid of loops." > --- > /* WIP: We could go first to the higher node16 size class */ > newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO); > > Does it mean to go to RT_CLASS_16_HI and then further go to > RT_CLASS_16_LO upon further deletion? Yes. It wouldn't be much work to make shrinking symmetrical with growing (a good thing), but it's not essential so I haven't done it yet. > --- > * TODO: The current locking mechanism is not optimized for high concurrency > * with mixed read-write workloads. In the future it might be worthwhile > * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in > * the paper "The ART of Practical Synchronization" by the same authors as > * the ART paper, 2016. > > I think it's not TODO for now, but a future improvement. We can remove it. It _is_ a TODO, regardless of when it happens. > --- > /* TODO: consider 5 with subclass 1 or 2. */ > #define RT_FANOUT_4 4 > > Is there something we need to do here? Changed to: "To save memory in trees with sparse keys, it would make sense to have two size classes for the smallest kind (perhaps a high class of 5 and a low class of 2), but it would be more effective to utilize lazy expansion and path compression." > --- > /* > * Return index of the chunk and slot arrays for inserting into the node, > * such that the chunk array remains ordered. > * TODO: Improve performance for non-SIMD platforms. > */ > > Are you going to work on this? A small step in v57-0010. I've found a way to kill two birds with one stone, by first checking for the case that the keys are inserted in order. This also helps the SIMD case because it must branch anyway to avoid bitscanning a zero bitfield. This moves the branch up and turns a mask into an assert, looking a bit nicer. I've removed the TODO, but maybe we should add it to the search_eq function. > --- > /* Delete the element at 'idx' */ > /* TODO: replace slow memmove's */ > > Are you going to work on this? Done in v57-0011. The rest: v57-0004 - 0008 should be self explanatory, but questions/pushback welcome. v57-0009 - I'm thinking leaves don't need to be memset at all. The value written should be entirely the caller's responsibility, it seems. v57-0013 - the bench module can be built locally again v57-0016 - minor comment edits in tid store My todo: - benchmark tid store / vacuum again, since we haven't since varlen types and removing unnecessary locks. I'm pretty sure there's an accidental memset call that crept in there, but I'm running out of steam today. - leftover comment etc work
Attachment
On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Jan 31, 2024 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've attached the new patch set (v56). I've squashed previous updates > > and addressed review comments on v55 in separate patches. Here are the > > update summary: > > > > 0004: fix compiler warning caught by ci test. > > 0005-0008: address review comments on radix tree codes. > > 0009: cleanup #define and #undef > > 0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is > > undefined after including radixtree.h so we should not use it in test > > code. > > Great, thanks! > > I have a few questions and comments on v56, then I'll address yours > below with the attached v57, which is mostly cosmetic adjustments. Thank you for the comments! I've squashed previous updates and your changes. > > v56-0003: > > (Looking closer at tests) > > +static const bool rt_test_stats = false; > > I'm thinking we should just remove everything that depends on this, > and keep this module entirely about correctness. Agreed. Removed in 0006 patch. > > + for (int shift = 0; shift <= (64 - 8); shift += 8) > + test_node_types(shift); > > I'm not sure what the test_node_types_* functions are testing that > test_basic doesn't. They have a different, and confusing, way to stop > at every size class and check the keys/values. It seems we can replace > all that with two more calls (asc/desc) to test_basic, with the > maximum level. Agreed, addressed in 0007 patch. > > It's pretty hard to see what test_pattern() is doing, or why it's > useful. I wonder if instead the test could use something like the > benchmark where random integers are masked off. That seems simpler. I > can work on that, but I'd like to hear your side about test_pattern(). Yeah, test_pattern() is originally created for the integerset so it doesn't necessarily fit the radixtree. I agree to use some tests from benchmarks. > > v56-0007: > > + * > + * Since we can rely on DSA_AREA_LOCK to get the total amount of DSA memory, > + * the caller doesn't need to take a lock. > > Maybe something like "Since dsa_get_total_size() does appropriate locking ..."? Agreed. Fixed in 0005 patch. > > v56-0008 > > Thanks, I like how the tests look now. > > -NOTICE: testing node 4 with height 0 and ascending keys > ... > +NOTICE: testing node 1 with height 0 and ascending keys > > Now that the number is not intended to match a size class, "node X" > seems out of place. Maybe we could have a separate array with strings? > > + 1, /* RT_CLASS_4 */ > > This should be more than one, so that the basic test still exercises > paths that shift elements around. > > + 100, /* RT_CLASS_48 */ > > This node currently holds 64 for local memory. > > + 255 /* RT_CLASS_256 */ > > This is the only one where we know exactly how many it can take, so > may as well keep it at 256. Fixed in 0008 patch. > > v56-0012: > > The test module for tidstore could use a few more comments. Addressed in 0012 patch. > > v56-0015: > > +typedef dsa_pointer TidStoreHandle; > + > > -TidStoreAttach(dsa_area *area, dsa_pointer rt_dp) > +TidStoreAttach(dsa_area *area, TidStoreHandle handle) > { > TidStore *ts; > + dsa_pointer rt_dp = handle; > > My earlier opinion was that "handle" was a nicer variable name, but > this brings back the typedef and also keeps the variable name I didn't > like, but pushes it down into the function. I'm a bit confused, so > I've kept these not-squashed for now. I misunderstood your comment. I've changed to use a variable name rt_handle and removed the TidStoreHandle type. 0013 patch. > > ----------------------------------------------------------------------------------- > > Now, for v57: > > > Looking at overall changes, there are still XXX and TODO comments in > > radixtree.h: > > That's fine, as long as it's intentional as a message to readers. That > said, we can get rid of some: > > > --- > > * XXX There are 4 node kinds, and this should never be increased, > > * for several reasons: > > * 1. With 5 or more kinds, gcc tends to use a jump table for switch > > * statements. > > * 2. The 4 kinds can be represented with 2 bits, so we have the option > > * in the future to tag the node pointer with the kind, even on > > * platforms with 32-bit pointers. This might speed up node traversal > > * in trees with highly random node kinds. > > * 3. We can have multiple size classes per node kind. > > > > Can we just remove "XXX"? > > How about "NOTE"? Agreed. > > > --- > > * WIP: notes about traditional radix tree trading off span vs height... > > > > Are you going to write it? > > Yes, when I draft a rough commit message, (for next time). Thanks! > > > --- > > #ifdef RT_SHMEM > > /* WIP: do we really need this? */ > > typedef dsa_pointer RT_HANDLE; > > #endif > > > > I think it's worth having it. > > Okay, removed WIP in v57-0004. > > > --- > > * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit > > * inside a single bitmapword on most platforms, so it's a good starting > > * point. We can make it higher if we need to. > > */ > > #define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4) > > > > Are you going to work something on this? > > Hard-coded 64 for readability, and changed this paragraph to explain > the current rationale more clearly: > > "The paper uses at most 64 for this node kind, and one advantage for us > is that "isset" is a single bitmapword on most platforms, rather than > an array, allowing the compiler to get rid of loops." LGTM. > > > --- > > /* WIP: We could go first to the higher node16 size class */ > > newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO); > > > > Does it mean to go to RT_CLASS_16_HI and then further go to > > RT_CLASS_16_LO upon further deletion? > > Yes. It wouldn't be much work to make shrinking symmetrical with > growing (a good thing), but it's not essential so I haven't done it > yet. Okay, let's keep it as WIP. > > > --- > > * TODO: The current locking mechanism is not optimized for high concurrency > > * with mixed read-write workloads. In the future it might be worthwhile > > * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in > > * the paper "The ART of Practical Synchronization" by the same authors as > > * the ART paper, 2016. > > > > I think it's not TODO for now, but a future improvement. We can remove it. > > It _is_ a TODO, regardless of when it happens. Understood. > > > --- > > /* TODO: consider 5 with subclass 1 or 2. */ > > #define RT_FANOUT_4 4 > > > > Is there something we need to do here? > > Changed to: > > "To save memory in trees with sparse keys, it would make sense to have two > size classes for the smallest kind (perhaps a high class of 5 and a low class > of 2), but it would be more effective to utilize lazy expansion and > path compression." LGTM. But there is an extra '*' in the last line: + /* : + * of 2), but it would be more effective to utilize lazy expansion and + * path compression. + * */ Fixed in 0004 patch. > > > --- > > /* > > * Return index of the chunk and slot arrays for inserting into the node, > > * such that the chunk array remains ordered. > > * TODO: Improve performance for non-SIMD platforms. > > */ > > > > Are you going to work on this? > > A small step in v57-0010. I've found a way to kill two birds with one > stone, by first checking for the case that the keys are inserted in > order. This also helps the SIMD case because it must branch anyway to > avoid bitscanning a zero bitfield. This moves the branch up and turns > a mask into an assert, looking a bit nicer. I've removed the TODO, but > maybe we should add it to the search_eq function. Great! > > > --- > > /* Delete the element at 'idx' */ > > /* TODO: replace slow memmove's */ > > > > Are you going to work on this? > > Done in v57-0011. LGTM. > > The rest: > v57-0004 - 0008 should be self explanatory, but questions/pushback welcome. > v57-0009 - I'm thinking leaves don't need to be memset at all. The > value written should be entirely the caller's responsibility, it > seems. > v57-0013 - the bench module can be built locally again > v57-0016 - minor comment edits in tid store These fixes look good to me. > > My todo: > - benchmark tid store / vacuum again, since we haven't since varlen > types and removing unnecessary locks. I'm pretty sure there's an > accidental memset call that crept in there, but I'm running out of > steam today. > - leftover comment etc work Thanks. I'm also going to do some benchmarks and tests. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote: > > My todo: > > - benchmark tid store / vacuum again, since we haven't since varlen > > types and removing unnecessary locks. I ran a vacuum benchmark similar to the one in [1] (unlogged tables for reproducibility), but smaller tables (100 million records), deleting only the last 20% of the table, and including a parallel vacuum test. Scripts attached. monotonically ordered int column index: master: system usage: CPU: user: 4.27 s, system: 0.41 s, elapsed: 4.70 s system usage: CPU: user: 4.23 s, system: 0.44 s, elapsed: 4.69 s system usage: CPU: user: 4.26 s, system: 0.39 s, elapsed: 4.66 s v-59: system usage: CPU: user: 3.10 s, system: 0.44 s, elapsed: 3.56 s system usage: CPU: user: 3.07 s, system: 0.35 s, elapsed: 3.43 s system usage: CPU: user: 3.07 s, system: 0.36 s, elapsed: 3.44 s uuid column index: master: system usage: CPU: user: 18.22 s, system: 1.70 s, elapsed: 20.01 s system usage: CPU: user: 17.70 s, system: 1.70 s, elapsed: 19.48 s system usage: CPU: user: 18.48 s, system: 1.59 s, elapsed: 20.43 s v-59: system usage: CPU: user: 5.18 s, system: 1.18 s, elapsed: 6.45 s system usage: CPU: user: 6.56 s, system: 1.39 s, elapsed: 7.99 s system usage: CPU: user: 6.51 s, system: 1.44 s, elapsed: 8.05 s int & uuid indexes in parallel: master: system usage: CPU: user: 4.53 s, system: 1.22 s, elapsed: 20.43 s system usage: CPU: user: 4.49 s, system: 1.29 s, elapsed: 20.98 s system usage: CPU: user: 4.46 s, system: 1.33 s, elapsed: 20.50 s v59: system usage: CPU: user: 2.09 s, system: 0.32 s, elapsed: 4.86 s system usage: CPU: user: 3.76 s, system: 0.51 s, elapsed: 8.92 s system usage: CPU: user: 3.83 s, system: 0.54 s, elapsed: 9.09 s Over all, I'm pleased with these results, although I'm confused why sometimes with the patch the first run reports running faster than the others. I'm curious what others get. Traversing a tree that lives in DSA has some overhead, as expected, but still comes out way ahead of master. There are still some micro-benchmarks we could do on tidstore, and it'd be good to find out worse-case memory use (1 dead tuple each on spread-out pages), but this is decent demonstration. > > I'm not sure what the test_node_types_* functions are testing that > > test_basic doesn't. They have a different, and confusing, way to stop > > at every size class and check the keys/values. It seems we can replace > > all that with two more calls (asc/desc) to test_basic, with the > > maximum level. v58-0008: + /* borrowed from RT_MAX_SHIFT */ + const int max_shift = (pg_leftmost_one_pos64(UINT64_MAX) / BITS_PER_BYTE) * BITS_PER_BYTE; This is harder to read than "64 - 8", and doesn't really help maintainability either. Maybe "(sizeof(uint64) - 1) * BITS_PER_BYTE" is a good compromise. + /* leaf nodes */ + test_basic(test_info, 0); + /* internal nodes */ + test_basic(test_info, 8); + + /* max-level nodes */ + test_basic(test_info, max_shift); This three-way terminology is not very informative. How about: + /* a tree with one level, i.e. a single node under the root node. */ ... + /* a tree with two levels */ ... + /* a tree with the maximum number of levels */ +static void +test_basic(rt_node_class_test_elem *test_info, int shift) +{ + elog(NOTICE, "testing node %s with shift %d", test_info->class_name, shift); + + /* Test nodes while changing the key insertion order */ + do_test_basic(test_info->nkeys, shift, false); + do_test_basic(test_info->nkeys, shift, true); Adding a level of indirection makes this harder to read, and do we still know whether a test failed in asc or desc keys? > > My earlier opinion was that "handle" was a nicer variable name, but > > this brings back the typedef and also keeps the variable name I didn't > > like, but pushes it down into the function. I'm a bit confused, so > > I've kept these not-squashed for now. > > I misunderstood your comment. I've changed to use a variable name > rt_handle and removed the TidStoreHandle type. 0013 patch. (diff against an earlier version) - pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items); + pvs->shared->dead_items_dp = TidStoreGetHandle(dead_items); Shall we use "handle" in vacuum_parallel.c as well? > > I'm pretty sure there's an > > accidental memset call that crept in there, but I'm running out of > > steam today. I have just a little bit of work to add for v59: v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero any bitmapwords. That can only happen if e.g. there is an offset > 128 and there are none between 64 and 128, so not a huge deal but I think it's a bit nicer in this patch. > > > * WIP: notes about traditional radix tree trading off span vs height... > > > > > > Are you going to write it? > > > > Yes, when I draft a rough commit message, (for next time). I haven't gotten to the commit message, but: v59-0004 - I did some rewriting of the top header comment to explain ART concepts for new readers, made small comment changes, and tidied up some indentation that pgindent won't touch v59-0005 - re-pgindent'ed [1] https://www.postgresql.org/message-id/CAFBsxsHUxmXYy0y4RrhMcNe-R11Bm099Xe-wUdb78pOu0%2BPT2Q%40mail.gmail.com
Attachment
On Sat, Feb 10, 2024 at 9:29 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > My todo: > > > - benchmark tid store / vacuum again, since we haven't since varlen > > > types and removing unnecessary locks. > > I ran a vacuum benchmark similar to the one in [1] (unlogged tables > for reproducibility), but smaller tables (100 million records), > deleting only the last 20% of the table, and including a parallel > vacuum test. Scripts attached. > > monotonically ordered int column index: > > master: > system usage: CPU: user: 4.27 s, system: 0.41 s, elapsed: 4.70 s > system usage: CPU: user: 4.23 s, system: 0.44 s, elapsed: 4.69 s > system usage: CPU: user: 4.26 s, system: 0.39 s, elapsed: 4.66 s > > v-59: > system usage: CPU: user: 3.10 s, system: 0.44 s, elapsed: 3.56 s > system usage: CPU: user: 3.07 s, system: 0.35 s, elapsed: 3.43 s > system usage: CPU: user: 3.07 s, system: 0.36 s, elapsed: 3.44 s > > uuid column index: > > master: > system usage: CPU: user: 18.22 s, system: 1.70 s, elapsed: 20.01 s > system usage: CPU: user: 17.70 s, system: 1.70 s, elapsed: 19.48 s > system usage: CPU: user: 18.48 s, system: 1.59 s, elapsed: 20.43 s > > v-59: > system usage: CPU: user: 5.18 s, system: 1.18 s, elapsed: 6.45 s > system usage: CPU: user: 6.56 s, system: 1.39 s, elapsed: 7.99 s > system usage: CPU: user: 6.51 s, system: 1.44 s, elapsed: 8.05 s > > int & uuid indexes in parallel: > > master: > system usage: CPU: user: 4.53 s, system: 1.22 s, elapsed: 20.43 s > system usage: CPU: user: 4.49 s, system: 1.29 s, elapsed: 20.98 s > system usage: CPU: user: 4.46 s, system: 1.33 s, elapsed: 20.50 s > > v59: > system usage: CPU: user: 2.09 s, system: 0.32 s, elapsed: 4.86 s > system usage: CPU: user: 3.76 s, system: 0.51 s, elapsed: 8.92 s > system usage: CPU: user: 3.83 s, system: 0.54 s, elapsed: 9.09 s > > Over all, I'm pleased with these results, although I'm confused why > sometimes with the patch the first run reports running faster than the > others. I'm curious what others get. Traversing a tree that lives in > DSA has some overhead, as expected, but still comes out way ahead of > master. Thanks! That's a great improvement. I've also run the same scripts in my environment just in case and got similar results: monotonically ordered int column index: master: system usage: CPU: user: 14.81 s, system: 0.90 s, elapsed: 15.74 s system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s system usage: CPU: user: 14.85 s, system: 0.70 s, elapsed: 15.57 s v-59: system usage: CPU: user: 9.47 s, system: 1.04 s, elapsed: 10.53 s system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s system usage: CPU: user: 9.59 s, system: 0.86 s, elapsed: 10.47 s uuid column index: master: system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s system usage: CPU: user: 28.05 s, system: 1.37 s, elapsed: 29.47 s system usage: CPU: user: 28.46 s, system: 1.36 s, elapsed: 29.88 s v-59: system usage: CPU: user: 14.87 s, system: 1.13 s, elapsed: 16.02 s system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s system usage: CPU: user: 10.96 s, system: 1.24 s, elapsed: 12.22 s int & uuid indexes in parallel: master: system usage: CPU: user: 15.81 s, system: 1.43 s, elapsed: 34.31 s system usage: CPU: user: 15.84 s, system: 1.41 s, elapsed: 34.34 s system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s v-59: system usage: CPU: user: 10.93 s, system: 0.92 s, elapsed: 17.59 s system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s system usage: CPU: user: 10.90 s, system: 1.01 s, elapsed: 17.45 s > > There are still some micro-benchmarks we could do on tidstore, and > it'd be good to find out worse-case memory use (1 dead tuple each on > spread-out pages), but this is decent demonstration. I've tested a simple case where vacuum removes 33k dead tuples spread about every 10 pages. master: 198,000 bytes (=33000 * 6) system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s v-59: 2,834,432 bytes (reported by TidStoreMemoryUsage()) system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s > > > > I'm not sure what the test_node_types_* functions are testing that > > > test_basic doesn't. They have a different, and confusing, way to stop > > > at every size class and check the keys/values. It seems we can replace > > > all that with two more calls (asc/desc) to test_basic, with the > > > maximum level. > > v58-0008: > > + /* borrowed from RT_MAX_SHIFT */ > + const int max_shift = (pg_leftmost_one_pos64(UINT64_MAX) / > BITS_PER_BYTE) * BITS_PER_BYTE; > > This is harder to read than "64 - 8", and doesn't really help > maintainability either. > Maybe "(sizeof(uint64) - 1) * BITS_PER_BYTE" is a good compromise. > > + /* leaf nodes */ > + test_basic(test_info, 0); > > + /* internal nodes */ > + test_basic(test_info, 8); > + > + /* max-level nodes */ > + test_basic(test_info, max_shift); > > This three-way terminology is not very informative. How about: > > + /* a tree with one level, i.e. a single node under the root node. */ > ... > + /* a tree with two levels */ > ... > + /* a tree with the maximum number of levels */ Agreed. > > +static void > +test_basic(rt_node_class_test_elem *test_info, int shift) > +{ > + elog(NOTICE, "testing node %s with shift %d", test_info->class_name, shift); > + > + /* Test nodes while changing the key insertion order */ > + do_test_basic(test_info->nkeys, shift, false); > + do_test_basic(test_info->nkeys, shift, true); > > Adding a level of indirection makes this harder to read, and do we > still know whether a test failed in asc or desc keys? Agreed, it seems to be better to keep the previous logging style. > > > > My earlier opinion was that "handle" was a nicer variable name, but > > > this brings back the typedef and also keeps the variable name I didn't > > > like, but pushes it down into the function. I'm a bit confused, so > > > I've kept these not-squashed for now. > > > > I misunderstood your comment. I've changed to use a variable name > > rt_handle and removed the TidStoreHandle type. 0013 patch. > > (diff against an earlier version) > - pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items); > + pvs->shared->dead_items_dp = TidStoreGetHandle(dead_items); > > Shall we use "handle" in vacuum_parallel.c as well? Agreed. > > > > I'm pretty sure there's an > > > accidental memset call that crept in there, but I'm running out of > > > steam today. > > I have just a little bit of work to add for v59: > > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero > any bitmapwords. That can only happen if e.g. there is an offset > 128 > and there are none between 64 and 128, so not a huge deal but I think > it's a bit nicer in this patch. LGTM. > > > > > * WIP: notes about traditional radix tree trading off span vs height... > > > > > > > > Are you going to write it? > > > > > > Yes, when I draft a rough commit message, (for next time). > > I haven't gotten to the commit message, but: I've drafted the commit message. > > v59-0004 - I did some rewriting of the top header comment to explain > ART concepts for new readers, made small comment changes, and tidied > up some indentation that pgindent won't touch > v59-0005 - re-pgindent'ed LGTM, squashed all changes. I've attached these updates from v59 in separate patches. I've run regression tests with valgrind and run the coverity scan, and I don't see critical issues. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sat, Feb 10, 2024 at 9:29 PM John Naylor <johncnaylorls@gmail.com> wrote: > I've also run the same scripts in my environment just in case and got > similar results: Thanks for testing, looks good as well. > > There are still some micro-benchmarks we could do on tidstore, and > > it'd be good to find out worse-case memory use (1 dead tuple each on > > spread-out pages), but this is decent demonstration. > > I've tested a simple case where vacuum removes 33k dead tuples spread > about every 10 pages. > > master: > 198,000 bytes (=33000 * 6) > system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s > > v-59: > 2,834,432 bytes (reported by TidStoreMemoryUsage()) > system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s The memory usage for the sparse case may be a concern, although it's not bad -- a multiple of something small is probably not huge in practice. See below for an option we have for this. > > > > I'm pretty sure there's an > > > > accidental memset call that crept in there, but I'm running out of > > > > steam today. > > > > I have just a little bit of work to add for v59: > > > > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero > > any bitmapwords. That can only happen if e.g. there is an offset > 128 > > and there are none between 64 and 128, so not a huge deal but I think > > it's a bit nicer in this patch. > > LGTM. Okay, I've squashed this. > I've drafted the commit message. Thanks, this is a good start. > I've run regression tests with valgrind and run the coverity scan, and > I don't see critical issues. Great! Now, I think we're in pretty good shape. There are a couple of things that might be objectionable, so I want to try to improve them in the little time we have: 1. Memory use for the sparse case. I shared an idea a few months ago of how runtime-embeddable values (true combined pointer-value slots) could work for tids. I don't think this is a must-have, but it's not a lot of code, and I have this working: v61-0006: Preparatory refactoring -- I think we should do this anyway, since the intent seems more clear to me. v61-0007: Runtime-embeddable tids -- Optional for v17, but should reduce memory regressions, so should be considered. Up to 3 tids can be stored in the last level child pointer. It's not polished, but I'll only proceed with that if we think we need this. "flags" iis called that because it could hold tidbitmap.c booleans (recheck, lossy) in the future, in addition to reserving space for the pointer tag. Note: I hacked the tests to only have 2 offsets per block to demo, but of course both paths should be tested. 2. Management of memory contexts. It's pretty verbose and messy. I think the abstraction could be better: A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we can't destroy or reset it. That means we have to do a lot of manual work. B: Passing "max_bytes" to the radix tree was my idea, I believe, but it seems the wrong responsibility. Not all uses will have a work_mem-type limit, I'm guessing. We only use it for limiting the max block size, and aset's default 8MB is already plenty small for vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so smaller, and there it makes sense to limit the max blocksize this way. C: The context for values has complex #ifdefs based on the value length/varlen, but it's both too much and not enough. If we get a bump context, how would we shoehorn that in for values for vacuum but not for tidbitmap? Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to TidStoreCreate(), and then to RT_CREATE. That context will contain the values (for local mem), and the node slabs will be children of the value context. That way, measuring memory usage and free-ing can just call with this parent context, and let recursion handle the rest. Perhaps the passed context can also hold the radix-tree struct, but I'm not sure since I haven't tried it. What do you think? With this resolved, I think the radix tree is pretty close to committable. The tid store will likely need some polish yet, but no major issues I know of. (And, finally, a small thing I that I wanted to share just so I don't forget, but maybe not worth the attention: In Andres's prototype, there is a comment wondering if an update can skip checking if it first need to create a root node. This is pretty easy, and done in v61-0008.)
Attachment
v61 had a brown-paper-bag bug in the embedded tids patch that didn't present in the tidstore test, but caused vacuum to fail, fixed in v62.
Attachment
On Thu, Feb 15, 2024 at 8:26 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sat, Feb 10, 2024 at 9:29 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > I've also run the same scripts in my environment just in case and got > > similar results: > > Thanks for testing, looks good as well. > > > > There are still some micro-benchmarks we could do on tidstore, and > > > it'd be good to find out worse-case memory use (1 dead tuple each on > > > spread-out pages), but this is decent demonstration. > > > > I've tested a simple case where vacuum removes 33k dead tuples spread > > about every 10 pages. > > > > master: > > 198,000 bytes (=33000 * 6) > > system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s > > > > v-59: > > 2,834,432 bytes (reported by TidStoreMemoryUsage()) > > system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s > > The memory usage for the sparse case may be a concern, although it's > not bad -- a multiple of something small is probably not huge in > practice. See below for an option we have for this. > > > > > > I'm pretty sure there's an > > > > > accidental memset call that crept in there, but I'm running out of > > > > > steam today. > > > > > > I have just a little bit of work to add for v59: > > > > > > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero > > > any bitmapwords. That can only happen if e.g. there is an offset > 128 > > > and there are none between 64 and 128, so not a huge deal but I think > > > it's a bit nicer in this patch. > > > > LGTM. > > Okay, I've squashed this. > > > I've drafted the commit message. > > Thanks, this is a good start. > > > I've run regression tests with valgrind and run the coverity scan, and > > I don't see critical issues. > > Great! > > Now, I think we're in pretty good shape. There are a couple of things > that might be objectionable, so I want to try to improve them in the > little time we have: > > 1. Memory use for the sparse case. I shared an idea a few months ago > of how runtime-embeddable values (true combined pointer-value slots) > could work for tids. I don't think this is a must-have, but it's not a > lot of code, and I have this working: > > v61-0006: Preparatory refactoring -- I think we should do this anyway, > since the intent seems more clear to me. Looks good refactoring to me. > v61-0007: Runtime-embeddable tids -- Optional for v17, but should > reduce memory regressions, so should be considered. Up to 3 tids can > be stored in the last level child pointer. It's not polished, but I'll > only proceed with that if we think we need this. "flags" iis called > that because it could hold tidbitmap.c booleans (recheck, lossy) in > the future, in addition to reserving space for the pointer tag. Note: > I hacked the tests to only have 2 offsets per block to demo, but of > course both paths should be tested. Interesting. I've run the same benchmark tests we did[1][2] (the median of 3 runs): monotonically ordered int column index: master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s uuid column index: master: system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s v-59: system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s v-62: system usage: CPU: user: 4.06 s, system: 0.98 s, elapsed: 5.06 s int & uuid indexes in parallel: master: system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s v-59: system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s v-62: system usage: CPU: user: 2.54 s, system: 0.94 s, elapsed: 6.00 s sparse case: master: 198,000 bytes (=33000 * 6) system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s v-59: 2,834,432 bytes (reported by TidStoreMemoryUsage()) system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s v-62: 729,088 bytes (reported by TidStoreMemoryUsage()) system usage: CPU: user: 4.63 s, system: 0.86 s, elapsed: 5.50 s I'm happy to see a huge improvement. While it's really fascinating to me, I'm concerned about the time left until the feature freeze. We need to polish both tidstore and vacuum integration patches in 5 weeks. Personally I'd like to have it as a separate patch for now, and focus on completing the main three patches since we might face some issues after pushing these patches. I think with 0007 patch it's a big win but it's still a win even without 0007 patch. > > 2. Management of memory contexts. It's pretty verbose and messy. I > think the abstraction could be better: > A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we > can't destroy or reset it. That means we have to do a lot of manual > work. > B: Passing "max_bytes" to the radix tree was my idea, I believe, but > it seems the wrong responsibility. Not all uses will have a > work_mem-type limit, I'm guessing. We only use it for limiting the max > block size, and aset's default 8MB is already plenty small for > vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so > smaller, and there it makes sense to limit the max blocksize this way. > C: The context for values has complex #ifdefs based on the value > length/varlen, but it's both too much and not enough. If we get a bump > context, how would we shoehorn that in for values for vacuum but not > for tidbitmap? > > Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to > TidStoreCreate(), and then to RT_CREATE. That context will contain the > values (for local mem), and the node slabs will be children of the > value context. That way, measuring memory usage and free-ing can just > call with this parent context, and let recursion handle the rest. > Perhaps the passed context can also hold the radix-tree struct, but > I'm not sure since I haven't tried it. What do you think? If I understand your idea correctly, RT_CREATE() creates the context for values as a child of the passed context and the node slabs as children of the value context. That way, measuring memory usage can just call with the value context. It sounds like a good idea. But it was not clear to me how to address point B and C. Another variant of this idea would be that RT_CREATE() creates the parent context of the value context to store radix-tree struct. That is, the hierarchy would be like: A MemoryContext (passed by vacuum through tidstore) - radix tree memory context (store radx-tree struct, control struct, and iterator) - value context (aset, slab, or bump) - node slab contexts Freeing can just call with the radix tree memory context. And perhaps it works even if tidstore passes CurrentMemoryContex to RT_CREATE()? > > With this resolved, I think the radix tree is pretty close to > committable. The tid store will likely need some polish yet, but no > major issues I know of. Agreed. > > (And, finally, a small thing I that I wanted to share just so I don't > forget, but maybe not worth the attention: In Andres's prototype, > there is a comment wondering if an update can skip checking if it > first need to create a root node. This is pretty easy, and done in > v61-0008.) LGTM, thanks! Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > v61-0007: Runtime-embeddable tids -- Optional for v17, but should > > reduce memory regressions, so should be considered. Up to 3 tids can > > be stored in the last level child pointer. It's not polished, but I'll > > only proceed with that if we think we need this. "flags" iis called > > that because it could hold tidbitmap.c booleans (recheck, lossy) in > > the future, in addition to reserving space for the pointer tag. Note: > > I hacked the tests to only have 2 offsets per block to demo, but of > > course both paths should be tested. > > Interesting. I've run the same benchmark tests we did[1][2] (the > median of 3 runs): > > monotonically ordered int column index: > > master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s > v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s > v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s Hmm, that's strange -- this test is intended to delete all records from the last 20% of the blocks, so I wouldn't expect any improvement here, only in the sparse case. Maybe something is wrong. All the more reason to put it off... > I'm happy to see a huge improvement. While it's really fascinating to > me, I'm concerned about the time left until the feature freeze. We > need to polish both tidstore and vacuum integration patches in 5 > weeks. Personally I'd like to have it as a separate patch for now, and > focus on completing the main three patches since we might face some > issues after pushing these patches. I think with 0007 patch it's a big > win but it's still a win even without 0007 patch. Agreed to not consider it for initial commit. I'll hold on to it for some future time. > > 2. Management of memory contexts. It's pretty verbose and messy. I > > think the abstraction could be better: > > A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we > > can't destroy or reset it. That means we have to do a lot of manual > > work. > > B: Passing "max_bytes" to the radix tree was my idea, I believe, but > > it seems the wrong responsibility. Not all uses will have a > > work_mem-type limit, I'm guessing. We only use it for limiting the max > > block size, and aset's default 8MB is already plenty small for > > vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so > > smaller, and there it makes sense to limit the max blocksize this way. > > C: The context for values has complex #ifdefs based on the value > > length/varlen, but it's both too much and not enough. If we get a bump > > context, how would we shoehorn that in for values for vacuum but not > > for tidbitmap? > > > > Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to > > TidStoreCreate(), and then to RT_CREATE. That context will contain the > > values (for local mem), and the node slabs will be children of the > > value context. That way, measuring memory usage and free-ing can just > > call with this parent context, and let recursion handle the rest. > > Perhaps the passed context can also hold the radix-tree struct, but > > I'm not sure since I haven't tried it. What do you think? > > If I understand your idea correctly, RT_CREATE() creates the context > for values as a child of the passed context and the node slabs as > children of the value context. That way, measuring memory usage can > just call with the value context. It sounds like a good idea. But it > was not clear to me how to address point B and C. For B & C, vacuum would create a context to pass to TidStoreCreate, and it wouldn't need to bother changing max block size. RT_CREATE would use that directly for leaves (if any), and would only create child slab contexts under it. It would not need to know about max_bytes. Modifyng your diagram a bit, something like: - caller-supplied radix tree memory context (the 3 structs -- and leaves, if any) (aset (or future bump?)) - node slab contexts This might only be workable with aset, if we need to individually free the structs. (I haven't studied this, it was a recent idea) It's simpler, because with small fixed length values, we don't need to detect that and avoid creating a leaf context. All leaves would live in the same context as the structs. > Another variant of this idea would be that RT_CREATE() creates the > parent context of the value context to store radix-tree struct. That > is, the hierarchy would be like: > > A MemoryContext (passed by vacuum through tidstore) > - radix tree memory context (store radx-tree struct, control > struct, and iterator) > - value context (aset, slab, or bump) > - node slab contexts The template handling the value context here is complex, and is what I meant by 'C' above. Most fixed length allocations in all of the backend are aset, so it seems fine to use it always. > Freeing can just call with the radix tree memory context. And perhaps > it works even if tidstore passes CurrentMemoryContex to RT_CREATE()? Seems like it would, but would keep some complexity, as I mentioned.
On Fri, Feb 16, 2024 at 12:41 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > v61-0007: Runtime-embeddable tids -- Optional for v17, but should > > > reduce memory regressions, so should be considered. Up to 3 tids can > > > be stored in the last level child pointer. It's not polished, but I'll > > > only proceed with that if we think we need this. "flags" iis called > > > that because it could hold tidbitmap.c booleans (recheck, lossy) in > > > the future, in addition to reserving space for the pointer tag. Note: > > > I hacked the tests to only have 2 offsets per block to demo, but of > > > course both paths should be tested. > > > > Interesting. I've run the same benchmark tests we did[1][2] (the > > median of 3 runs): > > > > monotonically ordered int column index: > > > > master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s > > v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s > > v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s > > Hmm, that's strange -- this test is intended to delete all records > from the last 20% of the blocks, so I wouldn't expect any improvement > here, only in the sparse case. Maybe something is wrong. All the more > reason to put it off... Okay, let's dig it deeper later. > > > I'm happy to see a huge improvement. While it's really fascinating to > > me, I'm concerned about the time left until the feature freeze. We > > need to polish both tidstore and vacuum integration patches in 5 > > weeks. Personally I'd like to have it as a separate patch for now, and > > focus on completing the main three patches since we might face some > > issues after pushing these patches. I think with 0007 patch it's a big > > win but it's still a win even without 0007 patch. > > Agreed to not consider it for initial commit. I'll hold on to it for > some future time. > > > > 2. Management of memory contexts. It's pretty verbose and messy. I > > > think the abstraction could be better: > > > A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we > > > can't destroy or reset it. That means we have to do a lot of manual > > > work. > > > B: Passing "max_bytes" to the radix tree was my idea, I believe, but > > > it seems the wrong responsibility. Not all uses will have a > > > work_mem-type limit, I'm guessing. We only use it for limiting the max > > > block size, and aset's default 8MB is already plenty small for > > > vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so > > > smaller, and there it makes sense to limit the max blocksize this way. > > > C: The context for values has complex #ifdefs based on the value > > > length/varlen, but it's both too much and not enough. If we get a bump > > > context, how would we shoehorn that in for values for vacuum but not > > > for tidbitmap? > > > > > > Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to > > > TidStoreCreate(), and then to RT_CREATE. That context will contain the > > > values (for local mem), and the node slabs will be children of the > > > value context. That way, measuring memory usage and free-ing can just > > > call with this parent context, and let recursion handle the rest. > > > Perhaps the passed context can also hold the radix-tree struct, but > > > I'm not sure since I haven't tried it. What do you think? > > > > If I understand your idea correctly, RT_CREATE() creates the context > > for values as a child of the passed context and the node slabs as > > children of the value context. That way, measuring memory usage can > > just call with the value context. It sounds like a good idea. But it > > was not clear to me how to address point B and C. > > For B & C, vacuum would create a context to pass to TidStoreCreate, > and it wouldn't need to bother changing max block size. RT_CREATE > would use that directly for leaves (if any), and would only create > child slab contexts under it. It would not need to know about > max_bytes. Modifyng your diagram a bit, something like: > > - caller-supplied radix tree memory context (the 3 structs -- and > leaves, if any) (aset (or future bump?)) > - node slab contexts > > This might only be workable with aset, if we need to individually free > the structs. (I haven't studied this, it was a recent idea) > It's simpler, because with small fixed length values, we don't need to > detect that and avoid creating a leaf context. All leaves would live > in the same context as the structs. Thank you for the explanation. I think that vacuum and tidbitmap (and future users) would end up having the same max block size calculation. And it seems slightly odd layering to me that max-block-size-specified context is created on vacuum (or tidbitmap) layer, a varlen-value radix tree is created by tidstore layer, and the passed context is used for leaves (if varlen-value is used) on radix tree layer. Another idea is to create a max-block-size-specified context on the tidstore layer. That is, vacuum and tidbitmap pass a work_mem and a flag indicating whether the tidstore can use the bump context, and tidstore creates a (aset of bump) memory context with the calculated max block size and passes it to the radix tree. As for using the bump memory context, I feel that we need to store iterator struct in aset context at least as it can be individually freed and re-created. Or it might not be necessary to allocate the iterator struct in the same context as radix tree. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Feb 19, 2024 at 9:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I think that vacuum and tidbitmap (and future users) would end up > having the same max block size calculation. And it seems slightly odd > layering to me that max-block-size-specified context is created on > vacuum (or tidbitmap) layer, a varlen-value radix tree is created by > tidstore layer, and the passed context is used for leaves (if > varlen-value is used) on radix tree layer. That sounds slightly more complicated than I was thinking of, but we could actually be talking about the same thing: I'm drawing a distinction between "used = must be detected / #ifdef'd" and "used = actually happens to call allocation". I meant that the passed context would _always_ be used for leaves, regardless of varlen or not. So with fixed-length values short enough to live in child pointer slots, that context would still be used for iteration etc. > Another idea is to create a > max-block-size-specified context on the tidstore layer. That is, > vacuum and tidbitmap pass a work_mem and a flag indicating whether the > tidstore can use the bump context, and tidstore creates a (aset of > bump) memory context with the calculated max block size and passes it > to the radix tree. That might be a better abstraction since both uses have some memory limit. > As for using the bump memory context, I feel that we need to store > iterator struct in aset context at least as it can be individually > freed and re-created. Or it might not be necessary to allocate the > iterator struct in the same context as radix tree. Okay, that's one thing I was concerned about. Since we don't actually have a bump context yet, it seems simple to assume aset for non-nodes, and if we do get it, we can adjust slightly. Anyway, this seems like a good thing to try to clean up, but it's also not a show-stopper. On that note: I will be going on honeymoon shortly, and then to PGConf India, so I will have sporadic connectivity for the next 10 days and won't be doing any hacking during that time. Andres, did you want to take a look at the radix tree patch 0003? Aside from the above possible cleanup, most of it should be stable.
On Mon, Feb 19, 2024 at 7:47 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Mon, Feb 19, 2024 at 9:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I think that vacuum and tidbitmap (and future users) would end up > > having the same max block size calculation. And it seems slightly odd > > layering to me that max-block-size-specified context is created on > > vacuum (or tidbitmap) layer, a varlen-value radix tree is created by > > tidstore layer, and the passed context is used for leaves (if > > varlen-value is used) on radix tree layer. > > That sounds slightly more complicated than I was thinking of, but we > could actually be talking about the same thing: I'm drawing a > distinction between "used = must be detected / #ifdef'd" and "used = > actually happens to call allocation". I meant that the passed context > would _always_ be used for leaves, regardless of varlen or not. So > with fixed-length values short enough to live in child pointer slots, > that context would still be used for iteration etc. > > > Another idea is to create a > > max-block-size-specified context on the tidstore layer. That is, > > vacuum and tidbitmap pass a work_mem and a flag indicating whether the > > tidstore can use the bump context, and tidstore creates a (aset of > > bump) memory context with the calculated max block size and passes it > > to the radix tree. > > That might be a better abstraction since both uses have some memory limit. I've drafted this idea, and fixed a bug in tidstore.c. Here is the summary of updates from v62: - removed v62-0007 patch as we discussed - squashed v62-0006 and v62-0008 patches into 0003 patch - v63-0008 patch fixes a bug in tidstore. - v63-0009 patch is a draft idea of cleanup memory context handling. > > > As for using the bump memory context, I feel that we need to store > > iterator struct in aset context at least as it can be individually > > freed and re-created. Or it might not be necessary to allocate the > > iterator struct in the same context as radix tree. > > Okay, that's one thing I was concerned about. Since we don't actually > have a bump context yet, it seems simple to assume aset for non-nodes, > and if we do get it, we can adjust slightly. Anyway, this seems like a > good thing to try to clean up, but it's also not a show-stopper. > > On that note: I will be going on honeymoon shortly, and then to PGConf > India, so I will have sporadic connectivity for the next 10 days and > won't be doing any hacking during that time. Thank you for letting us know. Enjoy yourself! Regards -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Feb 20, 2024 at 1:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > - v63-0008 patch fixes a bug in tidstore. - page->nwords = wordnum + 1; - Assert(page->nwords = WORDS_PER_PAGE(offsets[num_offsets - 1])); + page->nwords = wordnum; + Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1])); Yikes, I'm guessing this failed in a non-assert builds? I wonder why my compiler didn't yell at me... Have you tried a tidstore-debug build without asserts? > - v63-0009 patch is a draft idea of cleanup memory context handling. Thanks, looks pretty good! + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext, + "tidstore storage", "tidstore storage" sounds a bit strange -- maybe look at some other context names for ideas. - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize); + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL + ? tree->leaf_ctx + : tree->context, allocsize); Instead of branching here, can we copy "context" to "leaf_ctx" when necessary (those names should look more like eachother, btw)? I think that means anything not covered by this case: +#ifndef RT_VARLEN_VALUE_SIZE + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC)) + tree->leaf_ctx = SlabContextCreate(ctx, + RT_STR(RT_PREFIX) "radix_tree leaf contex", + RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), + sizeof(RT_VALUE_TYPE)); +#endif /* !RT_VARLEN_VALUE_SIZE */ ...also, we should document why we're using slab here. On that, I don't recall why we are? We've never had a fixed-length type test case on 64-bit, so it wasn't because it won through benchmarking. It seems a hold-over from the days of "multi-value leaves". Is it to avoid the possibility of space wastage with non-power-of-two size types? For this stanza that remains unchanged: for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++) { MemoryContextDelete(tree->node_slabs[i]); } if (tree->leaf_ctx) { MemoryContextDelete(tree->leaf_ctx); } ...is there a reason we can't just delete tree->ctx, and let that recursively delete child contexts? Secondly, I thought about my recent work to skip checking if we first need to create a root node, and that has a harmless (for vacuum at least) but slightly untidy behavior: When RT_SET is first called, and the key is bigger than 255, new nodes will go on top of the root node. These have chunk '0'. If all subsequent keys are big enough, the orginal root node will stay empty. If all keys are deleted, there will be a chain of empty nodes remaining. Again, I believe this is harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work on this, but likely not today. Thirdly, cosmetic: With the introduction of single-value leaves, it seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?
I'm looking at RT_FREE_RECURSE again (only used for DSA memory), and I'm not convinced it's freeing all the memory. It's been many months since we discussed this last, but IIRC we cannot just tell DSA to free all its segments, right? Is there currently anything preventing us from destroying the whole DSA area at once? + /* The last level node has pointers to values */ + if (shift == 0) + { + dsa_free(tree->dsa, ptr); + return; + } IIUC, this doesn't actually free leaves, it only frees the last-level node. And, this function is unaware of whether children could be embedded values. I'm thinking we need to get rid of the above pre-check and instead, each node kind to have something like (e.g. node4): RT_PTR_ALLOC child = n4->children[i]; if (shift > 0) RT_FREE_RECURSE(tree, child, shift - RT_SPAN); else if (!RT_CHILDPTR_IS_VALUE(child)) dsa_free(tree->dsa, child); ...or am I missing something?
I wrote: > Secondly, I thought about my recent work to skip checking if we first > need to create a root node, and that has a harmless (for vacuum at > least) but slightly untidy behavior: When RT_SET is first called, and > the key is bigger than 255, new nodes will go on top of the root node. > These have chunk '0'. If all subsequent keys are big enough, the > orginal root node will stay empty. If all keys are deleted, there will > be a chain of empty nodes remaining. Again, I believe this is > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work > on this, but likely not today. This turns out to be a lot trickier than it looked, so it seems best to allow a trivial amount of waste, as long as it's documented somewhere. It also wouldn't be terrible to re-add those branches, since they're highly predictable. I just noticed there are a lot of unused function parameters (referring to parent slots) leftover from a few weeks ago. Those are removed in v64-0009. 0010 makes the obvious name change in those remaining to "parent_slot". 0011 is a simplification in two places regarding reserving slots. This should be a bit easier to read and possibly makes it easier on the compiler.
Attachment
On Thu, Feb 29, 2024 at 8:43 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Feb 20, 2024 at 1:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > - v63-0008 patch fixes a bug in tidstore. > > - page->nwords = wordnum + 1; > - Assert(page->nwords = WORDS_PER_PAGE(offsets[num_offsets - 1])); > + page->nwords = wordnum; > + Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1])); > > Yikes, I'm guessing this failed in a non-assert builds? I wonder why > my compiler didn't yell at me... Have you tried a tidstore-debug build > without asserts? Yes. I didn't get any failures. > > > - v63-0009 patch is a draft idea of cleanup memory context handling. > > Thanks, looks pretty good! > > + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext, > + "tidstore storage", > > "tidstore storage" sounds a bit strange -- maybe look at some other > context names for ideas. Agreed. How about "tidstore's radix tree"? > > - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize); > + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL > + ? tree->leaf_ctx > + : tree->context, allocsize); > > Instead of branching here, can we copy "context" to "leaf_ctx" when > necessary (those names should look more like eachother, btw)? I think > that means anything not covered by this case: > > +#ifndef RT_VARLEN_VALUE_SIZE > + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC)) > + tree->leaf_ctx = SlabContextCreate(ctx, > + RT_STR(RT_PREFIX) "radix_tree leaf contex", > + RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > + sizeof(RT_VALUE_TYPE)); > +#endif /* !RT_VARLEN_VALUE_SIZE */ > > ...also, we should document why we're using slab here. On that, I > don't recall why we are? We've never had a fixed-length type test case > on 64-bit, so it wasn't because it won through benchmarking. It seems > a hold-over from the days of "multi-value leaves". Is it to avoid the > possibility of space wastage with non-power-of-two size types? Yes, it matches my understanding. > > For this stanza that remains unchanged: > > for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++) > { > MemoryContextDelete(tree->node_slabs[i]); > } > > if (tree->leaf_ctx) > { > MemoryContextDelete(tree->leaf_ctx); > } > > ...is there a reason we can't just delete tree->ctx, and let that > recursively delete child contexts? I thought that considering the RT_CREATE doesn't create its own memory context but just uses the passed context, it might be a bit unusable to delete the passed context in the radix tree code. For example, if a caller creates a radix tree (or tidstore) on a memory context and wants to recreate it again and again, he also needs to re-create the memory context together. It might be okay if we leave comments on RT_CREATE as a side effect, though. This is the same reason why we don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(), On Fri, Mar 1, 2024 at 1:15 PM John Naylor <johncnaylorls@gmail.com> wrote: > > I'm looking at RT_FREE_RECURSE again (only used for DSA memory), and > I'm not convinced it's freeing all the memory. It's been many months > since we discussed this last, but IIRC we cannot just tell DSA to free > all its segments, right? Right. > Is there currently anything preventing us > from destroying the whole DSA area at once? When it comes to tidstore and parallel vacuum, we initialize DSA and create a tidstore there at the beginning of the lazy vacuum, and recreate the tidstore again after the heap vacuum. So I don't want to destroy the whole DSA when destroying the tidstore. Otherwise, we will need to create a new DSA and pass its handle somehow. Probably the bitmap scan case is similar. Given that bitmap scan (re)creates tidbitmap in the same DSA multiple times, it's better to avoid freeing the whole DSA. > > + /* The last level node has pointers to values */ > + if (shift == 0) > + { > + dsa_free(tree->dsa, ptr); > + return; > + } > > IIUC, this doesn't actually free leaves, it only frees the last-level > node. And, this function is unaware of whether children could be > embedded values. I'm thinking we need to get rid of the above > pre-check and instead, each node kind to have something like (e.g. > node4): > > RT_PTR_ALLOC child = n4->children[i]; > > if (shift > 0) > RT_FREE_RECURSE(tree, child, shift - RT_SPAN); > else if (!RT_CHILDPTR_IS_VALUE(child)) > dsa_free(tree->dsa, child); > > ...or am I missing something? You're not missing anything. RT_FREE_RECURSE() has not been updated for a long time. If we still need to use RT_FREE_RECURSE(), it should be updated. > Thirdly, cosmetic: With the introduction of single-value leaves, it > seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think? Agreed. On Fri, Mar 1, 2024 at 3:58 PM John Naylor <johncnaylorls@gmail.com> wrote: > > I wrote: > > > Secondly, I thought about my recent work to skip checking if we first > > need to create a root node, and that has a harmless (for vacuum at > > least) but slightly untidy behavior: When RT_SET is first called, and > > the key is bigger than 255, new nodes will go on top of the root node. > > These have chunk '0'. If all subsequent keys are big enough, the > > orginal root node will stay empty. If all keys are deleted, there will > > be a chain of empty nodes remaining. Again, I believe this is > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work > > on this, but likely not today. > > This turns out to be a lot trickier than it looked, so it seems best > to allow a trivial amount of waste, as long as it's documented > somewhere. It also wouldn't be terrible to re-add those branches, > since they're highly predictable. > > I just noticed there are a lot of unused function parameters > (referring to parent slots) leftover from a few weeks ago. Those are > removed in v64-0009. 0010 makes the obvious name change in those > remaining to "parent_slot". 0011 is a simplification in two places > regarding reserving slots. This should be a bit easier to read and > possibly makes it easier on the compiler. Thank you for the updates. I've briefly looked at these changes and they look good to me. I'm going to review them again in depth. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Feb 29, 2024 at 8:43 PM John Naylor <johncnaylorls@gmail.com> wrote: > > + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext, > > + "tidstore storage", > > > > "tidstore storage" sounds a bit strange -- maybe look at some other > > context names for ideas. > > Agreed. How about "tidstore's radix tree"? That might be okay. I'm now thinking "TID storage". On that note, one improvement needed when we polish tidstore.c is to make sure it's spelled "TID" in comments, like other files do already. > > - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize); > > + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL > > + ? tree->leaf_ctx > > + : tree->context, allocsize); > > > > Instead of branching here, can we copy "context" to "leaf_ctx" when > > necessary (those names should look more like eachother, btw)? I think > > that means anything not covered by this case: > > > > +#ifndef RT_VARLEN_VALUE_SIZE > > + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC)) > > + tree->leaf_ctx = SlabContextCreate(ctx, > > + RT_STR(RT_PREFIX) "radix_tree leaf contex", > > + RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > > + sizeof(RT_VALUE_TYPE)); > > +#endif /* !RT_VARLEN_VALUE_SIZE */ > > > > ...also, we should document why we're using slab here. On that, I > > don't recall why we are? We've never had a fixed-length type test case > > on 64-bit, so it wasn't because it won through benchmarking. It seems > > a hold-over from the days of "multi-value leaves". Is it to avoid the > > possibility of space wastage with non-power-of-two size types? > > Yes, it matches my understanding. There are two issues quoted here, so not sure if you mean both or only the last one... For the latter, I'm not sure it makes sense to have code and #ifdef's to force slab for large-enough fixed-length values just because we can. There may never be such a use-case anyway. I'm also not against it, either, but it seems like a premature optimization. > > For this stanza that remains unchanged: > > > > for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++) > > { > > MemoryContextDelete(tree->node_slabs[i]); > > } > > > > if (tree->leaf_ctx) > > { > > MemoryContextDelete(tree->leaf_ctx); > > } > > > > ...is there a reason we can't just delete tree->ctx, and let that > > recursively delete child contexts? > > I thought that considering the RT_CREATE doesn't create its own memory > context but just uses the passed context, it might be a bit unusable > to delete the passed context in the radix tree code. For example, if a > caller creates a radix tree (or tidstore) on a memory context and > wants to recreate it again and again, he also needs to re-create the > memory context together. It might be okay if we leave comments on > RT_CREATE as a side effect, though. This is the same reason why we > don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(), Right, I should have said "reset". Resetting a context will delete it's children as well, and seems like it should work to reset the tree context, and we don't have to know whether that context actually contains leaves at all. That should allow copying "tree context" to "leaf context" in the case where we have no special context for leaves.
On Sun, Mar 3, 2024 at 2:43 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Feb 29, 2024 at 8:43 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext, > > > + "tidstore storage", > > > > > > "tidstore storage" sounds a bit strange -- maybe look at some other > > > context names for ideas. > > > > Agreed. How about "tidstore's radix tree"? > > That might be okay. I'm now thinking "TID storage". On that note, one > improvement needed when we polish tidstore.c is to make sure it's > spelled "TID" in comments, like other files do already. > > > > - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize); > > > + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL > > > + ? tree->leaf_ctx > > > + : tree->context, allocsize); > > > > > > Instead of branching here, can we copy "context" to "leaf_ctx" when > > > necessary (those names should look more like eachother, btw)? I think > > > that means anything not covered by this case: > > > > > > +#ifndef RT_VARLEN_VALUE_SIZE > > > + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC)) > > > + tree->leaf_ctx = SlabContextCreate(ctx, > > > + RT_STR(RT_PREFIX) "radix_tree leaf contex", > > > + RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > > > + sizeof(RT_VALUE_TYPE)); > > > +#endif /* !RT_VARLEN_VALUE_SIZE */ > > > > > > ...also, we should document why we're using slab here. On that, I > > > don't recall why we are? We've never had a fixed-length type test case > > > on 64-bit, so it wasn't because it won through benchmarking. It seems > > > a hold-over from the days of "multi-value leaves". Is it to avoid the > > > possibility of space wastage with non-power-of-two size types? > > > > Yes, it matches my understanding. > > There are two issues quoted here, so not sure if you mean both or only > the last one... I meant only the last one. > > For the latter, I'm not sure it makes sense to have code and #ifdef's > to force slab for large-enough fixed-length values just because we > can. There may never be such a use-case anyway. I'm also not against > it, either, but it seems like a premature optimization. Reading the old threads, the fact that using a slab context for leaves originally came from Andres's prototype patch, was to avoid rounding up the bytes to a power of 2 number by aset.c. It makes sense to me to use a slab context for this case. To measure the effect of using a slab, I've updated bench_radix_tree so it uses a large fixed-length value. The struct I used is: typedef struct mytype { uint64 a; uint64 b; uint64 c; uint64 d; char e[100]; } mytype; The struct size is 136 bytes with padding, just above a power-of-2. The simple benchmark test showed using a slab context for leaves is more space efficient. The results are: slab: = #select * from bench_load_random_int(1000000); mem_allocated | load_ms ---------------+--------- 405643264 | 560 (1 row) aset: =# select * from bench_load_random_int(1000000); mem_allocated | load_ms ---------------+--------- 527777792 | 576 (1 row) > > > > For this stanza that remains unchanged: > > > > > > for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++) > > > { > > > MemoryContextDelete(tree->node_slabs[i]); > > > } > > > > > > if (tree->leaf_ctx) > > > { > > > MemoryContextDelete(tree->leaf_ctx); > > > } > > > > > > ...is there a reason we can't just delete tree->ctx, and let that > > > recursively delete child contexts? > > > > I thought that considering the RT_CREATE doesn't create its own memory > > context but just uses the passed context, it might be a bit unusable > > to delete the passed context in the radix tree code. For example, if a > > caller creates a radix tree (or tidstore) on a memory context and > > wants to recreate it again and again, he also needs to re-create the > > memory context together. It might be okay if we leave comments on > > RT_CREATE as a side effect, though. This is the same reason why we > > don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(), > > Right, I should have said "reset". Resetting a context will delete > it's children as well, and seems like it should work to reset the tree > context, and we don't have to know whether that context actually > contains leaves at all. That should allow copying "tree context" to > "leaf context" in the case where we have no special context for > leaves. Resetting the tree->context seems to work. But I think we should note for callers that the dsa_area passed to RT_CREATE should be created in a different context than the context passed to RT_CREATE because otherwise RT_FREE() will also free the dsa_area. For example, the following code in test_radixtree.c will no longer work: dsa = dsa_create(tranche_id); radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id); : rt_free(radixtree); dsa_detach(dsa); // dsa is already freed. So I think that a practical usage of the radix tree will be that the caller creates a memory context for a radix tree and passes it to RT_CREATE(). I've attached an update patch set: - 0008 updates RT_FREE_RECURSE(). - 0009 patch is an updated version of cleanup radix tree memory handling. - 0010 updates comments in tidstore.c such as replacing "Tid" with "TID". - 0011 rename TidStore to TIDSTORE all places. - 0012 update bench_radix_tree so it uses a (possibly large) struct instead of uint64. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sun, Mar 3, 2024 at 2:43 PM John Naylor <johncnaylorls@gmail.com> wrote: > > Right, I should have said "reset". Resetting a context will delete > > it's children as well, and seems like it should work to reset the tree > > context, and we don't have to know whether that context actually > > contains leaves at all. That should allow copying "tree context" to > > "leaf context" in the case where we have no special context for > > leaves. > > Resetting the tree->context seems to work. But I think we should note > for callers that the dsa_area passed to RT_CREATE should be created in > a different context than the context passed to RT_CREATE because > otherwise RT_FREE() will also free the dsa_area. For example, the > following code in test_radixtree.c will no longer work: > > dsa = dsa_create(tranche_id); > radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id); > : > rt_free(radixtree); > dsa_detach(dsa); // dsa is already freed. > > So I think that a practical usage of the radix tree will be that the > caller creates a memory context for a radix tree and passes it to > RT_CREATE(). That sounds workable to me. > I've attached an update patch set: > > - 0008 updates RT_FREE_RECURSE(). Thanks! > - 0009 patch is an updated version of cleanup radix tree memory handling. Looks pretty good, as does the rest. I'm going through again, squashing and making tiny adjustments to the template. The only thing not done is changing the test with many values to resemble the perf test more. I wrote: > > Secondly, I thought about my recent work to skip checking if we first > > need to create a root node, and that has a harmless (for vacuum at > > least) but slightly untidy behavior: When RT_SET is first called, and > > the key is bigger than 255, new nodes will go on top of the root node. > > These have chunk '0'. If all subsequent keys are big enough, the > > orginal root node will stay empty. If all keys are deleted, there will > > be a chain of empty nodes remaining. Again, I believe this is > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work > > on this, but likely not today. > > This turns out to be a lot trickier than it looked, so it seems best > to allow a trivial amount of waste, as long as it's documented > somewhere. It also wouldn't be terrible to re-add those branches, > since they're highly predictable. I put a little more work into this, and got it working, just needs a small amount of finicky coding. I'll share tomorrow. I have a question about RT_FREE_RECURSE: + check_stack_depth(); + CHECK_FOR_INTERRUPTS(); I'm not sure why these are here: The first seems overly paranoid, although harmless, but the second is probably a bad idea. Why should the user be able to to interrupt the freeing of memory? Also, I'm not quite happy that RT_ITER has a copy of a pointer to the tree, leading to coding like "iter->tree->ctl->root". I *think* it would be easier to read if the tree was a parameter to these iteration functions. That would require an API change, so the tests/tidstore would have some churn. I can do that, but before trying I wanted to see what you think -- is there some reason to keep the current way?
On Mon, Mar 4, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sun, Mar 3, 2024 at 2:43 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > Right, I should have said "reset". Resetting a context will delete > > > it's children as well, and seems like it should work to reset the tree > > > context, and we don't have to know whether that context actually > > > contains leaves at all. That should allow copying "tree context" to > > > "leaf context" in the case where we have no special context for > > > leaves. > > > > Resetting the tree->context seems to work. But I think we should note > > for callers that the dsa_area passed to RT_CREATE should be created in > > a different context than the context passed to RT_CREATE because > > otherwise RT_FREE() will also free the dsa_area. For example, the > > following code in test_radixtree.c will no longer work: > > > > dsa = dsa_create(tranche_id); > > radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id); > > : > > rt_free(radixtree); > > dsa_detach(dsa); // dsa is already freed. > > > > So I think that a practical usage of the radix tree will be that the > > caller creates a memory context for a radix tree and passes it to > > RT_CREATE(). > > That sounds workable to me. > > > I've attached an update patch set: > > > > - 0008 updates RT_FREE_RECURSE(). > > Thanks! > > > - 0009 patch is an updated version of cleanup radix tree memory handling. > > Looks pretty good, as does the rest. I'm going through again, > squashing and making tiny adjustments to the template. The only thing > not done is changing the test with many values to resemble the perf > test more. > > I wrote: > > > Secondly, I thought about my recent work to skip checking if we first > > > need to create a root node, and that has a harmless (for vacuum at > > > least) but slightly untidy behavior: When RT_SET is first called, and > > > the key is bigger than 255, new nodes will go on top of the root node. > > > These have chunk '0'. If all subsequent keys are big enough, the > > > orginal root node will stay empty. If all keys are deleted, there will > > > be a chain of empty nodes remaining. Again, I believe this is > > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to > > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work > > > on this, but likely not today. > > > > This turns out to be a lot trickier than it looked, so it seems best > > to allow a trivial amount of waste, as long as it's documented > > somewhere. It also wouldn't be terrible to re-add those branches, > > since they're highly predictable. > > I put a little more work into this, and got it working, just needs a > small amount of finicky coding. I'll share tomorrow. > > I have a question about RT_FREE_RECURSE: > > + check_stack_depth(); > + CHECK_FOR_INTERRUPTS(); > > I'm not sure why these are here: The first seems overly paranoid, > although harmless, but the second is probably a bad idea. Why should > the user be able to to interrupt the freeing of memory? Good catch. We should not check the interruption there. > Also, I'm not quite happy that RT_ITER has a copy of a pointer to the > tree, leading to coding like "iter->tree->ctl->root". I *think* it > would be easier to read if the tree was a parameter to these iteration > functions. That would require an API change, so the tests/tidstore > would have some churn. I can do that, but before trying I wanted to > see what you think -- is there some reason to keep the current way? I considered both usages, there are two reasons for the current style. I'm concerned that if we pass both the tree and RT_ITER to iteration functions, the caller could mistakenly pass a different tree than the one that was specified to create the RT_ITER. And the second reason is just to make it consistent with other data structures such as dynahash.c and dshash.c, but I now realized that in simplehash.h we pass both the hash table and the iterator. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Mar 5, 2024 at 8:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Mar 4, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Resetting the tree->context seems to work. But I think we should note > > > for callers that the dsa_area passed to RT_CREATE should be created in > > > a different context than the context passed to RT_CREATE because > > > otherwise RT_FREE() will also free the dsa_area. For example, the > > > following code in test_radixtree.c will no longer work: I've added a comment in v66-0004, which contains a number of other small corrections and edits. On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Thirdly, cosmetic: With the introduction of single-value leaves, it > > seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think? > > Agreed. Done in v66-0005. v66-0006 removes outdated tests for invalid root that somehow got left over. > > I wrote: > > > > Secondly, I thought about my recent work to skip checking if we first > > > > need to create a root node, and that has a harmless (for vacuum at > > > > least) but slightly untidy behavior: When RT_SET is first called, and > > > > the key is bigger than 255, new nodes will go on top of the root node. > > > > These have chunk '0'. If all subsequent keys are big enough, the > > > > orginal root node will stay empty. If all keys are deleted, there will > > > > be a chain of empty nodes remaining. Again, I believe this is > > > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to > > > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work > > > > on this, but likely not today. > > I put a little more work into this, and got it working, just needs a > > small amount of finicky coding. I'll share tomorrow. Done in v66-0007. I'm a bit disappointed in the extra messiness this adds, although it's not a lot. > > + check_stack_depth(); > > + CHECK_FOR_INTERRUPTS(); > > > > I'm not sure why these are here: The first seems overly paranoid, > > although harmless, but the second is probably a bad idea. Why should > > the user be able to to interrupt the freeing of memory? > > Good catch. We should not check the interruption there. Removed in v66-0008. > > Also, I'm not quite happy that RT_ITER has a copy of a pointer to the > > tree, leading to coding like "iter->tree->ctl->root". I *think* it > > would be easier to read if the tree was a parameter to these iteration > > functions. That would require an API change, so the tests/tidstore > > would have some churn. I can do that, but before trying I wanted to > > see what you think -- is there some reason to keep the current way? > > I considered both usages, there are two reasons for the current style. > I'm concerned that if we pass both the tree and RT_ITER to iteration > functions, the caller could mistakenly pass a different tree than the > one that was specified to create the RT_ITER. And the second reason is > just to make it consistent with other data structures such as > dynahash.c and dshash.c, but I now realized that in simplehash.h we > pass both the hash table and the iterator. Okay, then I don't think it's worth messing with at this point. On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote: > > It's pretty hard to see what test_pattern() is doing, or why it's > > useful. I wonder if instead the test could use something like the > > benchmark where random integers are masked off. That seems simpler. I > > can work on that, but I'd like to hear your side about test_pattern(). > > Yeah, test_pattern() is originally created for the integerset so it > doesn't necessarily fit the radixtree. I agree to use some tests from > benchmarks. Done in v66-0009. I'd be curious to hear any feedback. I like the aspect that the random numbers come from a different seed every time the test runs. v66-0010/0011 run pgindent, the latter with one typedef added for the test module. 0012 - 0017 are copied from v65, and I haven't done any work on tidstore or vacuum, except for squashing most v65 follow-up patches. I'd like to push 0001 and 0002 shortly, and then do another sweep over 0003, with remaining feedback, and get that in so we get some buildfarm testing before the remaining polishing work on tidstore/vacuum.
Attachment
On Tue, Mar 5, 2024 at 6:41 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > It's pretty hard to see what test_pattern() is doing, or why it's > > > useful. I wonder if instead the test could use something like the > > > benchmark where random integers are masked off. That seems simpler. I > > > can work on that, but I'd like to hear your side about test_pattern(). > > > > Yeah, test_pattern() is originally created for the integerset so it > > doesn't necessarily fit the radixtree. I agree to use some tests from > > benchmarks. > > Done in v66-0009. I'd be curious to hear any feedback. I like the > aspect that the random numbers come from a different seed every time > the test runs. The new tests look good. Here are some comments: --- + expected = keys[i]; + iterval = rt_iterate_next(iter, &iterkey); - ndeleted++; + EXPECT_TRUE(iterval != NULL); + EXPECT_EQ_U64(iterkey, expected); + EXPECT_EQ_U64(*iterval, expected); Can we verify that the iteration returns keys in ascending order? --- + /* reset random number generator for deletion */ + pg_prng_seed(&state, seed); Why is resetting the seed required here? --- The radix tree (and dsa in TSET_SHARED_RT case) should be freed at the end. --- radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext, "test_radix_tree", ALLOCSET_DEFAULT_SIZES); We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I think it's better to use either one for consistency. > I'd like to push 0001 and 0002 shortly, and then do another sweep over > 0003, with remaining feedback, and get that in so we get some > buildfarm testing before the remaining polishing work on > tidstore/vacuum. Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to polish tidstore and vacuum patches and update commit messages. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Mar 5, 2024 at 6:41 PM John Naylor <johncnaylorls@gmail.com> wrote: > > Done in v66-0009. I'd be curious to hear any feedback. I like the > > aspect that the random numbers come from a different seed every time > > the test runs. > > The new tests look good. Here are some comments: > > --- > + expected = keys[i]; > + iterval = rt_iterate_next(iter, &iterkey); > > - ndeleted++; > + EXPECT_TRUE(iterval != NULL); > + EXPECT_EQ_U64(iterkey, expected); > + EXPECT_EQ_U64(*iterval, expected); > > Can we verify that the iteration returns keys in ascending order? We get the "expected" value from the keys we saved in the now-sorted array, so we do already. Unless I misunderstand you. > --- > + /* reset random number generator for deletion */ > + pg_prng_seed(&state, seed); > > Why is resetting the seed required here? Good catch - My intention was to delete in the same random order we inserted with. We still have the keys in the array, but they're sorted by now. I forgot to go the extra step and use the prng when generating the keys for deletion -- will fix. > --- > The radix tree (and dsa in TSET_SHARED_RT case) should be freed at the end. Will fix. > --- > radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext, > "test_radix_tree", > ALLOCSET_DEFAULT_SIZES); > > We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I > think it's better to use either one for consistency. Will change to "small", since 32-bit platforms will use slab for leaves. I'll look at the memory usage and estimate what 32-bit platforms will use, and maybe adjust the number of keys. A few megabytes is fine, but not many megabytes. > > I'd like to push 0001 and 0002 shortly, and then do another sweep over > > 0003, with remaining feedback, and get that in so we get some > > buildfarm testing before the remaining polishing work on > > tidstore/vacuum. > > Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to > polish tidstore and vacuum patches and update commit messages. Sounds good.
On Wed, Mar 6, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Mar 5, 2024 at 6:41 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > Done in v66-0009. I'd be curious to hear any feedback. I like the > > > aspect that the random numbers come from a different seed every time > > > the test runs. > > > > The new tests look good. Here are some comments: > > > > --- > > + expected = keys[i]; > > + iterval = rt_iterate_next(iter, &iterkey); > > > > - ndeleted++; > > + EXPECT_TRUE(iterval != NULL); > > + EXPECT_EQ_U64(iterkey, expected); > > + EXPECT_EQ_U64(*iterval, expected); > > > > Can we verify that the iteration returns keys in ascending order? > > We get the "expected" value from the keys we saved in the now-sorted > array, so we do already. Unless I misunderstand you. Ah, you're right. Please ignore this comment. > > > --- > > radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext, > > "test_radix_tree", > > ALLOCSET_DEFAULT_SIZES); > > > > We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I > > think it's better to use either one for consistency. > > Will change to "small", since 32-bit platforms will use slab for leaves. Agreed. > > I'll look at the memory usage and estimate what 32-bit platforms will > use, and maybe adjust the number of keys. A few megabytes is fine, but > not many megabytes. Thanks, sounds good. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Hi, On 2024-03-05 16:41:30 +0700, John Naylor wrote: > I'd like to push 0001 and 0002 shortly, and then do another sweep over > 0003, with remaining feedback, and get that in so we get some > buildfarm testing before the remaining polishing work on > tidstore/vacuum. A few ARM buildfarm animals are complaining: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18 Greetings, Andres Freund
On Wed, Mar 6, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2024-03-05 16:41:30 +0700, John Naylor wrote: > > I'd like to push 0001 and 0002 shortly, and then do another sweep over > > 0003, with remaining feedback, and get that in so we get some > > buildfarm testing before the remaining polishing work on > > tidstore/vacuum. > > A few ARM buildfarm animals are complaining: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18 > The error message we got is: ../../src/include/port/simd.h:326:71: error: incompatible type for argument 1 of \342\200\230vshrq_n_s8\342\200\231 uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7)); ^ Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead. Regard, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Mar 6, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote: > > A few ARM buildfarm animals are complaining: > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18 > > > > The error message we got is: > > ../../src/include/port/simd.h:326:71: error: incompatible type for > argument 1 of \342\200\230vshrq_n_s8\342\200\231 > uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7)); > ^ > > Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead. That sounds plausible, and I'll look further. (Hmm, I thought we had run this code on Arm already...)
Hi, On March 6, 2024 9:06:50 AM GMT+01:00, John Naylor <johncnaylorls@gmail.com> wrote: >On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> On Wed, Mar 6, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote: > >> > A few ARM buildfarm animals are complaining: >> > >> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02 >> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03 >> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18 >> > >> >> The error message we got is: >> >> ../../src/include/port/simd.h:326:71: error: incompatible type for >> argument 1 of \342\200\230vshrq_n_s8\342\200\231 >> uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7)); >> ^ >> >> Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead. > >That sounds plausible, and I'll look further. > >(Hmm, I thought we had run this code on Arm already...) Perhaps we should switch one of the CI jobs to ARM... Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Wed, Mar 6, 2024 at 3:06 PM John Naylor <johncnaylorls@gmail.com> wrote: > > (Hmm, I thought we had run this code on Arm already...) CI MacOS uses Clang on aarch64, which has been working fine. The failing animals are on gcc 7.3...
On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > ../../src/include/port/simd.h:326:71: error: incompatible type for > argument 1 of \342\200\230vshrq_n_s8\342\200\231 > uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7)); > ^ > > Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead. I've looked around and it seems clang is more lax on conversions. Since it works fine for clang, I think we just need a cast here for gcc. I've attached a blind attempt at a fix -- I'll apply shortly unless someone happens to test and find it doesn't work.
Attachment
On Wed, Mar 6, 2024 at 5:33 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > ../../src/include/port/simd.h:326:71: error: incompatible type for > > argument 1 of \342\200\230vshrq_n_s8\342\200\231 > > uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7)); > > ^ > > > > Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead. > > I've looked around and it seems clang is more lax on conversions. > Since it works fine for clang, I think we just need a cast here for > gcc. I've attached a blind attempt at a fix -- I'll apply shortly > unless someone happens to test and find it doesn't work. I've reproduced the same error on my raspberry pi, and confirmed the patch fixes the error. My previous idea was wrong. With my proposal, the regression test for radix tree failed on my raspberry pi. On the other hand, with your patch the tests passed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 6, 2024 at 3:40 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Mar 6, 2024 at 5:33 PM John Naylor <johncnaylorls@gmail.com> wrote: > > I've looked around and it seems clang is more lax on conversions. > > Since it works fine for clang, I think we just need a cast here for > > gcc. I've attached a blind attempt at a fix -- I'll apply shortly > > unless someone happens to test and find it doesn't work. > > I've reproduced the same error on my raspberry pi, and confirmed the > patch fixes the error. > > My previous idea was wrong. With my proposal, the regression test for > radix tree failed on my raspberry pi. On the other hand, with your > patch the tests passed. Pushed, and at least parula's green now, thanks for testing! And thanks, Andres, for the ping!
On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I'd like to push 0001 and 0002 shortly, and then do another sweep over > > 0003, with remaining feedback, and get that in so we get some > > buildfarm testing before the remaining polishing work on > > tidstore/vacuum. > > Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to > polish tidstore and vacuum patches and update commit messages. I don't think v66 got a CI run because of vacuumlazy.c bitrot, so I'm attaching v67 which fixes that and has some small cosmetic adjustments to the template. One functional change for debugging build is that RT_STATS now prints out the number of leaves. I'll squash and push 0001 tomorrow morning unless there are further comments.
Attachment
Actually, I forgot -- I had one more question: Masahiko, is there a reason for this extra local variable, which uses the base type, rather than the typedef'd parameter? +RT_SCOPE RT_RADIX_TREE * +RT_ATTACH(dsa_area *dsa, RT_HANDLE handle) +{ + RT_RADIX_TREE *tree; + dsa_pointer control; + + tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE)); + + /* Find the control object in shared memory */ + control = handle;
On Wed, Mar 6, 2024 at 8:25 PM John Naylor <johncnaylorls@gmail.com> wrote: > > Actually, I forgot -- I had one more question: Masahiko, is there a > reason for this extra local variable, which uses the base type, rather > than the typedef'd parameter? > > +RT_SCOPE RT_RADIX_TREE * > +RT_ATTACH(dsa_area *dsa, RT_HANDLE handle) > +{ > + RT_RADIX_TREE *tree; > + dsa_pointer control; > + > + tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE)); > + > + /* Find the control object in shared memory */ > + control = handle; I think it's mostly because of readability; it makes clear that the handle should be castable to dsa_pointer and it's a control object. I borrowed it from dshash_attach(). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 6, 2024 at 8:20 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > I'd like to push 0001 and 0002 shortly, and then do another sweep over > > > 0003, with remaining feedback, and get that in so we get some > > > buildfarm testing before the remaining polishing work on > > > tidstore/vacuum. > > > > Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to > > polish tidstore and vacuum patches and update commit messages. > > I don't think v66 got a CI run because of vacuumlazy.c bitrot, so I'm > attaching v67 which fixes that and has some small cosmetic adjustments > to the template. Thank you for updating the patch. > One functional change for debugging build is that > RT_STATS now prints out the number of leaves. I'll squash and push > 0001 tomorrow morning unless there are further comments. The 0001 patch looks good to me. I have some minor comments: --- /dev/null +++ b/src/test/modules/test_radixtree/Makefile @@ -0,0 +1,23 @@ +# src/test/modules/test_radixtree/Makefile + +MODULE_big = test_radixtree +OBJS = \ + $(WIN32RES) \ + test_radixtree.o +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c" + "src/backend/lib/radixtree.c" should be updated to "src/include/lib/radixtree.h". --- --- /dev/null +++ b/src/test/modules/test_radixtree/README @@ -0,0 +1,7 @@ +test_integerset contains unit tests for testing the integer set implementation +in src/backend/lib/integerset.c. + +The tests verify the correctness of the implementation, but they can also be +used as a micro-benchmark. If you set the 'intset_test_stats' flag in +test_integerset.c, the tests will print extra information about execution time +and memory usage. This file is not updated for test_radixtree. I think we can remove it as the test cases in test_radixtree are clear. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > + /* Find the control object in shared memory */ > > + control = handle; > > I think it's mostly because of readability; it makes clear that the > handle should be castable to dsa_pointer and it's a control object. I > borrowed it from dshash_attach(). I find that a bit strange, but I went ahead and kept it. On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > The 0001 patch looks good to me. I have some minor comments: > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c" > + > > "src/backend/lib/radixtree.c" should be updated to > "src/include/lib/radixtree.h". Done. > --- /dev/null > +++ b/src/test/modules/test_radixtree/README > @@ -0,0 +1,7 @@ > +test_integerset contains unit tests for testing the integer set implementation > +in src/backend/lib/integerset.c. > + > +The tests verify the correctness of the implementation, but they can also be > +used as a micro-benchmark. If you set the 'intset_test_stats' flag in > +test_integerset.c, the tests will print extra information about execution time > +and memory usage. > > This file is not updated for test_radixtree. I think we can remove it > as the test cases in test_radixtree are clear. Done. I pushed this with a few last-minute cosmetic adjustments. This has been a very long time coming, but we're finally in the home stretch! Already, I see sifaka doesn't like this, and I'm looking now...
On Thu, Mar 7, 2024 at 12:55 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > + /* Find the control object in shared memory */ > > > + control = handle; > > > > I think it's mostly because of readability; it makes clear that the > > handle should be castable to dsa_pointer and it's a control object. I > > borrowed it from dshash_attach(). > > I find that a bit strange, but I went ahead and kept it. > > > > On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > The 0001 patch looks good to me. I have some minor comments: > > > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c" > > + > > > > "src/backend/lib/radixtree.c" should be updated to > > "src/include/lib/radixtree.h". > > Done. > > > --- /dev/null > > +++ b/src/test/modules/test_radixtree/README > > @@ -0,0 +1,7 @@ > > +test_integerset contains unit tests for testing the integer set implementation > > +in src/backend/lib/integerset.c. > > + > > +The tests verify the correctness of the implementation, but they can also be > > +used as a micro-benchmark. If you set the 'intset_test_stats' flag in > > +test_integerset.c, the tests will print extra information about execution time > > +and memory usage. > > > > This file is not updated for test_radixtree. I think we can remove it > > as the test cases in test_radixtree are clear. > > Done. I pushed this with a few last-minute cosmetic adjustments. This > has been a very long time coming, but we're finally in the home > stretch! > > Already, I see sifaka doesn't like this, and I'm looking now... It's complaining that these forward declarations... /* generate forward declarations necessary to use the radix tree */ #ifdef RT_DECLARE typedef struct RT_RADIX_TREE RT_RADIX_TREE; typedef struct RT_ITER RT_ITER; ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11 feature [-Werror,-Wtypedef-redefinition]" I'll look in the other templates to see if what they do.
On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 7, 2024 at 12:55 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > + /* Find the control object in shared memory */ > > > > + control = handle; > > > > > > I think it's mostly because of readability; it makes clear that the > > > handle should be castable to dsa_pointer and it's a control object. I > > > borrowed it from dshash_attach(). > > > > I find that a bit strange, but I went ahead and kept it. > > > > > > > > On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > The 0001 patch looks good to me. I have some minor comments: > > > > > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c" > > > + > > > > > > "src/backend/lib/radixtree.c" should be updated to > > > "src/include/lib/radixtree.h". > > > > Done. > > > > > --- /dev/null > > > +++ b/src/test/modules/test_radixtree/README > > > @@ -0,0 +1,7 @@ > > > +test_integerset contains unit tests for testing the integer set implementation > > > +in src/backend/lib/integerset.c. > > > + > > > +The tests verify the correctness of the implementation, but they can also be > > > +used as a micro-benchmark. If you set the 'intset_test_stats' flag in > > > +test_integerset.c, the tests will print extra information about execution time > > > +and memory usage. > > > > > > This file is not updated for test_radixtree. I think we can remove it > > > as the test cases in test_radixtree are clear. > > > > Done. I pushed this with a few last-minute cosmetic adjustments. This > > has been a very long time coming, but we're finally in the home > > stretch! > > > > Already, I see sifaka doesn't like this, and I'm looking now... > > It's complaining that these forward declarations... > > /* generate forward declarations necessary to use the radix tree */ > #ifdef RT_DECLARE > > typedef struct RT_RADIX_TREE RT_RADIX_TREE; > typedef struct RT_ITER RT_ITER; > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11 > feature [-Werror,-Wtypedef-redefinition]" > > I'll look in the other templates to see if what they do. Their "declare" sections have full typedefs. I found it works to leave out the typedef for the "define" section, but I first want to reproduce the build failure. In addition, olingo and grassquit are showing different kinds of "AddressSanitizer: odr-violation" errors, which I'm not sure what to make of -- example: ==1862767==ERROR: AddressSanitizer: odr-violation (0x7fc257476b60): [1] size=256 'pg_leftmost_one_pos' /home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34 [2] size=256 'pg_leftmost_one_pos' /home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34 These globals were registered at these points: [1]: #0 0x563564b97bf6 in __asan_register_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #1 0x563564b98d1d in __asan_register_elf_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #2 0x7fc265c3fe3d in call_init elf/dl-init.c:74:3 #3 0x7fc265c3fe3d in call_init elf/dl-init.c:26:1 [2]: #0 0x563564b97bf6 in __asan_register_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #1 0x563564b98d1d in __asan_register_elf_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #2 0x7fc2649847f5 in call_init csu/../csu/libc-start.c:145:3 #3 0x7fc2649847f5 in __libc_start_main csu/../csu/libc-start.c:347:5
On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Thu, Mar 7, 2024 at 12:55 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > + /* Find the control object in shared memory */ > > > > > + control = handle; > > > > > > > > I think it's mostly because of readability; it makes clear that the > > > > handle should be castable to dsa_pointer and it's a control object. I > > > > borrowed it from dshash_attach(). > > > > > > I find that a bit strange, but I went ahead and kept it. > > > > > > > > > > > > On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > The 0001 patch looks good to me. I have some minor comments: > > > > > > > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c" > > > > + > > > > > > > > "src/backend/lib/radixtree.c" should be updated to > > > > "src/include/lib/radixtree.h". > > > > > > Done. > > > > > > > --- /dev/null > > > > +++ b/src/test/modules/test_radixtree/README > > > > @@ -0,0 +1,7 @@ > > > > +test_integerset contains unit tests for testing the integer set implementation > > > > +in src/backend/lib/integerset.c. > > > > + > > > > +The tests verify the correctness of the implementation, but they can also be > > > > +used as a micro-benchmark. If you set the 'intset_test_stats' flag in > > > > +test_integerset.c, the tests will print extra information about execution time > > > > +and memory usage. > > > > > > > > This file is not updated for test_radixtree. I think we can remove it > > > > as the test cases in test_radixtree are clear. > > > > > > Done. I pushed this with a few last-minute cosmetic adjustments. This > > > has been a very long time coming, but we're finally in the home > > > stretch! > > > > > > Already, I see sifaka doesn't like this, and I'm looking now... > > > > It's complaining that these forward declarations... > > > > /* generate forward declarations necessary to use the radix tree */ > > #ifdef RT_DECLARE > > > > typedef struct RT_RADIX_TREE RT_RADIX_TREE; > > typedef struct RT_ITER RT_ITER; > > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11 > > feature [-Werror,-Wtypedef-redefinition]" > > > > I'll look in the other templates to see if what they do. > > Their "declare" sections have full typedefs. I found it works to leave > out the typedef for the "define" section, but I first want to > reproduce the build failure. Right. I've reproduced this build failure on my machine by specifying flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the below change seems to fix the problem: --- a/src/include/lib/radixtree.h +++ b/src/include/lib/radixtree.h @@ -676,7 +676,7 @@ typedef struct RT_RADIX_TREE_CONTROL } RT_RADIX_TREE_CONTROL; /* Entry point for allocating and accessing the tree */ -typedef struct RT_RADIX_TREE +struct RT_RADIX_TREE { MemoryContext context; @@ -691,7 +691,7 @@ typedef struct RT_RADIX_TREE /* leaf_context is used only for single-value leaves */ MemoryContextData *leaf_context; #endif -} RT_RADIX_TREE; +}; /* * Iteration support. @@ -714,7 +714,7 @@ typedef struct RT_NODE_ITER } RT_NODE_ITER; /* state for iterating over the whole radix tree */ -typedef struct RT_ITER +struct RT_ITER { RT_RADIX_TREE *tree; @@ -728,7 +728,7 @@ typedef struct RT_ITER /* The key constructed during iteration */ uint64 key; -} RT_ITER; +}; /* verification (available only in assert-enabled builds) */ > > In addition, olingo and grassquit are showing different kinds of > "AddressSanitizer: odr-violation" errors, which I'm not sure what to > make of -- example: > > ==1862767==ERROR: AddressSanitizer: odr-violation (0x7fc257476b60): > [1] size=256 'pg_leftmost_one_pos' > /home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34 > [2] size=256 'pg_leftmost_one_pos' > /home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34 > These globals were registered at these points: > [1]: > #0 0x563564b97bf6 in __asan_register_globals > (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6) > (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) > #1 0x563564b98d1d in __asan_register_elf_globals > (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d) > (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) > #2 0x7fc265c3fe3d in call_init elf/dl-init.c:74:3 > #3 0x7fc265c3fe3d in call_init elf/dl-init.c:26:1 > > [2]: > #0 0x563564b97bf6 in __asan_register_globals > (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6) > (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) > #1 0x563564b98d1d in __asan_register_elf_globals > (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d) > (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) > #2 0x7fc2649847f5 in call_init csu/../csu/libc-start.c:145:3 > #3 0x7fc2649847f5 in __libc_start_main csu/../csu/libc-start.c:347:5 I'll look at them too. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 7, 2024 at 3:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > In addition, olingo and grassquit are showing different kinds of > > "AddressSanitizer: odr-violation" errors, which I'm not sure what to > > make of -- example: odr-violation seems to refer to One Definition Rule (ODR). According to Wikipedia[1]: The One Definition Rule (ODR) is an important rule of the C++ programming language that prescribes that classes/structs and non-inline functions cannot have more than one definition in the entire program and template and types cannot have more than one definition by translation unit. It is defined in the ISO C++ Standard (ISO/IEC 14882) 2003, at section 3.2. Some other programming languages have similar but differently defined rules towards the same objective. I don't fully understand this concept yet but are these two different build failures related? Regards, [1] https://en.wikipedia.org/wiki/One_Definition_Rule -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11 > > > feature [-Werror,-Wtypedef-redefinition]" > > > > > > I'll look in the other templates to see if what they do. > > > > Their "declare" sections have full typedefs. I found it works to leave > > out the typedef for the "define" section, but I first want to > > reproduce the build failure. > > Right. I've reproduced this build failure on my machine by specifying > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the > below change seems to fix the problem: Confirmed, will push shortly.
On Thu, Mar 7, 2024 at 4:01 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11 > > > > feature [-Werror,-Wtypedef-redefinition]" > > > > > > > > I'll look in the other templates to see if what they do. > > > > > > Their "declare" sections have full typedefs. I found it works to leave > > > out the typedef for the "define" section, but I first want to > > > reproduce the build failure. > > > > Right. I've reproduced this build failure on my machine by specifying > > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the > > below change seems to fix the problem: > > Confirmed, will push shortly. mamba complained different build errors[1]: 2740 | fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys); | ~~^ ~~~~~~~~~~~~~~~~~~~ | | | | long int int64 {aka long long int} | %lld ../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld' expects argument of type 'long int', but argument 4 has type 'int64' {aka 'long long int'} [-Werror=format=] 2752 | fprintf(stderr, ", n%d = %ld", size_class.fanout, tree->ctl->num_nodes[i]); | ~~^ ~~~~~~~~~~~~~~~~~~~~~~~ | | | | long int int64 {aka long long int} | %lld ../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld' expects argument of type 'long int', but argument 3 has type 'int64' {aka 'long long int'} [-Werror=format=] 2755 | fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves); | ~~^ ~~~~~~~~~~~~~~~~~~~~~ | | | | long int int64 {aka long long int} | %lld Regards, [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18 -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 7, 2024 at 2:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 7, 2024 at 4:01 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11 > > > > > feature [-Werror,-Wtypedef-redefinition]" > > > > > > > > > > I'll look in the other templates to see if what they do. > > > > > > > > Their "declare" sections have full typedefs. I found it works to leave > > > > out the typedef for the "define" section, but I first want to > > > > reproduce the build failure. > > > > > > Right. I've reproduced this build failure on my machine by specifying > > > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the > > > below change seems to fix the problem: > > > > Confirmed, will push shortly. > > mamba complained different build errors[1]: > > 2740 | fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys); > | ~~^ ~~~~~~~~~~~~~~~~~~~ > | | | > | long int int64 {aka long long int} > | %lld > ../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld' > expects argument of type 'long int', but argument 4 has type 'int64' > {aka 'long long int'} [-Werror=format=] > 2752 | fprintf(stderr, ", n%d = %ld", size_class.fanout, > tree->ctl->num_nodes[i]); > | ~~^ > ~~~~~~~~~~~~~~~~~~~~~~~ > | | > | > | long int > int64 {aka long long int} > | %lld > ../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld' > expects argument of type 'long int', but argument 3 has type 'int64' > {aka 'long long int'} [-Werror=format=] > 2755 | fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves); > | ~~^ ~~~~~~~~~~~~~~~~~~~~~ > | | | > | long int int64 {aka long long int} > | %lld > > Regards, > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18 Yeah, the attached fixes it for me.
Attachment
On Thu, Mar 7, 2024 at 4:21 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 7, 2024 at 2:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Mar 7, 2024 at 4:01 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > > > > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11 > > > > > > feature [-Werror,-Wtypedef-redefinition]" > > > > > > > > > > > > I'll look in the other templates to see if what they do. > > > > > > > > > > Their "declare" sections have full typedefs. I found it works to leave > > > > > out the typedef for the "define" section, but I first want to > > > > > reproduce the build failure. > > > > > > > > Right. I've reproduced this build failure on my machine by specifying > > > > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the > > > > below change seems to fix the problem: > > > > > > Confirmed, will push shortly. > > > > mamba complained different build errors[1]: > > > > 2740 | fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys); > > | ~~^ ~~~~~~~~~~~~~~~~~~~ > > | | | > > | long int int64 {aka long long int} > > | %lld > > ../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld' > > expects argument of type 'long int', but argument 4 has type 'int64' > > {aka 'long long int'} [-Werror=format=] > > 2752 | fprintf(stderr, ", n%d = %ld", size_class.fanout, > > tree->ctl->num_nodes[i]); > > | ~~^ > > ~~~~~~~~~~~~~~~~~~~~~~~ > > | | > > | > > | long int > > int64 {aka long long int} > > | %lld > > ../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld' > > expects argument of type 'long int', but argument 3 has type 'int64' > > {aka 'long long int'} [-Werror=format=] > > 2755 | fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves); > > | ~~^ ~~~~~~~~~~~~~~~~~~~~~ > > | | | > > | long int int64 {aka long long int} > > | %lld > > > > Regards, > > > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18 > > Yeah, the attached fixes it for me. Thanks, LGTM. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 7, 2024 at 1:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > odr-violation seems to refer to One Definition Rule (ODR). According > to Wikipedia[1]: > > The One Definition Rule (ODR) is an important rule of the C++ > programming language that prescribes that classes/structs and > non-inline functions cannot have more than one definition in the > entire program and template and types cannot have more than one > definition by translation unit. It is defined in the ISO C++ Standard > (ISO/IEC 14882) 2003, at section 3.2. Some other programming languages > have similar but differently defined rules towards the same objective. > > I don't fully understand this concept yet but are these two different > build failures related? I thought it may have something to do with the prerequisite commit that moved some symbols from bitmapset.c to .h: /* Select appropriate bit-twiddling functions for bitmap word size */ #if BITS_PER_BITMAPWORD == 32 #define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w) #define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w) #define bmw_popcount(w) pg_popcount32(w) #elif BITS_PER_BITMAPWORD == 64 #define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w) #define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w) #define bmw_popcount(w) pg_popcount64(w) #else #error "invalid BITS_PER_BITMAPWORD" #endif ...but olingo's error seems strange to me, because it is complaining of pg_leftmost_one_pos, which refers to the lookup table in pg_bitutils.c -- I thought all buildfarm members used the bitscan instructions. grassquit is complaining of pg_popcount64, which is a global function, also in pg_bitutils.c. Not sure what to make of this, since we're just pointing symbols at things which should have a single definition...
On Thu, Mar 7, 2024 at 1:19 PM John Naylor <johncnaylorls@gmail.com> wrote: > > In addition, olingo and grassquit are showing different kinds of > "AddressSanitizer: odr-violation" errors, which I'm not sure what to > make of -- example: This might be relevant: $ git grep 'link_with: pgport_srv' src/test/modules/test_radixtree/meson.build: link_with: pgport_srv, No other test module uses this directive, and indeed, removing this still builds fine for me. Thoughts?
On Thu, Mar 7, 2024 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 7, 2024 at 1:19 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > In addition, olingo and grassquit are showing different kinds of > > "AddressSanitizer: odr-violation" errors, which I'm not sure what to > > make of -- example: > > This might be relevant: > > $ git grep 'link_with: pgport_srv' > src/test/modules/test_radixtree/meson.build: link_with: pgport_srv, > > No other test module uses this directive, and indeed, removing this > still builds fine for me. Thoughts? Yeah, it could be the culprit. The test_radixtree/meson.build is the sole extension that explicitly specifies a link with pgport_srv. I think we can get rid of it as I've also confirmed the build still fine even without it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 7, 2024 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote: > > $ git grep 'link_with: pgport_srv' > > src/test/modules/test_radixtree/meson.build: link_with: pgport_srv, > > > > No other test module uses this directive, and indeed, removing this > > still builds fine for me. Thoughts? > > Yeah, it could be the culprit. The test_radixtree/meson.build is the > sole extension that explicitly specifies a link with pgport_srv. I > think we can get rid of it as I've also confirmed the build still fine > even without it. olingo and grassquit have turned green, so that must have been it.
On Thu, Mar 7, 2024 at 8:06 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Mar 7, 2024 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > $ git grep 'link_with: pgport_srv' > > > src/test/modules/test_radixtree/meson.build: link_with: pgport_srv, > > > > > > No other test module uses this directive, and indeed, removing this > > > still builds fine for me. Thoughts? > > > > Yeah, it could be the culprit. The test_radixtree/meson.build is the > > sole extension that explicitly specifies a link with pgport_srv. I > > think we can get rid of it as I've also confirmed the build still fine > > even without it. > > olingo and grassquit have turned green, so that must have been it. Cool! I've attached the remaining patches for CI. I've made some minor changes in separate patches and drafted the commit message for tidstore patch. While reviewing the tidstore code, I thought that it would be more appropriate to place tidstore.c under src/backend/lib instead of src/backend/common/access since the tidstore is no longer implemented only for heap or other access methods, and it might also be used by executor nodes in the future. What do you think? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Mar 7, 2024 at 8:06 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Mar 7, 2024 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > $ git grep 'link_with: pgport_srv' > > > src/test/modules/test_radixtree/meson.build: link_with: pgport_srv, > > > > > > No other test module uses this directive, and indeed, removing this > > > still builds fine for me. Thoughts? > > > > Yeah, it could be the culprit. The test_radixtree/meson.build is the > > sole extension that explicitly specifies a link with pgport_srv. I > > think we can get rid of it as I've also confirmed the build still fine > > even without it. > > olingo and grassquit have turned green, so that must have been it. fairywren is complaining another build failure: [1931/2156] "gcc" -o src/test/modules/test_radixtree/test_radixtree.dll src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj "-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group" "-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a" "-Wl,--stack,4194304" "-Wl,--allow-multiple-definition" "-Wl,--disable-auto-import" "-fvisibility=hidden" "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a" "-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a" "C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a" "C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm" "-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32" "-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group" FAILED: src/test/modules/test_radixtree/test_radixtree.dll "gcc" -o src/test/modules/test_radixtree/test_radixtree.dll src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj "-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group" "-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a" "-Wl,--stack,4194304" "-Wl,--allow-multiple-definition" "-Wl,--disable-auto-import" "-fvisibility=hidden" "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a" "-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a" "C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a" "C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm" "-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32" "-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group" C:/tools/nmsys64/ucrt64/bin/../lib/gcc/x86_64-w64-mingw32/12.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj:test_radixtree:(.rdata$.refptr.pg_popcount64[.refptr.pg_popcount64]+0x0): undefined reference to `pg_popcount64' It looks like it requires a link with pgport_srv but I'm not sure. It seems that the recent commit 1f1d73a8b breaks CI, Windows - Server 2019, VS 2019 - Meson & ninja, too. Regards, [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-03-07%2012%3A53%3A20 -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > It looks like it requires a link with pgport_srv but I'm not sure. It > seems that the recent commit 1f1d73a8b breaks CI, Windows - Server > 2019, VS 2019 - Meson & ninja, too. Unfortunately, none of the Windows animals happened to run both after the initial commit and before removing the (seemingly useless on our daily platfoms) link. I'll confirm on my own CI branch in a few minutes.
On Fri, Mar 8, 2024 at 10:04 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > It looks like it requires a link with pgport_srv but I'm not sure. It > > seems that the recent commit 1f1d73a8b breaks CI, Windows - Server > > 2019, VS 2019 - Meson & ninja, too. > > Unfortunately, none of the Windows animals happened to run both after > the initial commit and before removing the (seemingly useless on our > daily platfoms) link. I'll confirm on my own CI branch in a few > minutes. Yesterday I've confirmed the something like the below fixes the problem happened in Windows CI: --- a/src/test/modules/test_radixtree/meson.build +++ b/src/test/modules/test_radixtree/meson.build @@ -12,6 +12,7 @@ endif test_radixtree = shared_module('test_radixtree', test_radixtree_sources, + link_with: host_system == 'windows' ? pgport_srv : [], kwargs: pg_test_mod_args, ) test_install_libs += test_radixtree But I'm not sure it's the right fix especially because I guess it could raise "AddressSanitizer: odr-violation" error on Windows. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Yesterday I've confirmed the something like the below fixes the > problem happened in Windows CI: Glad you shared before I went and did it. > --- a/src/test/modules/test_radixtree/meson.build > +++ b/src/test/modules/test_radixtree/meson.build > @@ -12,6 +12,7 @@ endif > > test_radixtree = shared_module('test_radixtree', > test_radixtree_sources, > + link_with: host_system == 'windows' ? pgport_srv : [], I don't see any similar coding elsewhere, so that leaves me wondering if we're missing something. On the other hand, maybe no test modules use files in src/port ... > kwargs: pg_test_mod_args, > ) > test_install_libs += test_radixtree > > But I'm not sure it's the right fix especially because I guess it > could raise "AddressSanitizer: odr-violation" error on Windows. Well, it's now at zero definitions that it can see, so I imagine it's possible that adding the above would not cause more than one. In any case, we might not know since as far as I can tell the MSVC animals don't have address sanitizer. I'll look around some more, and if I don't get any revelations, I guess we should go with the above.
On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Yesterday I've confirmed the something like the below fixes the > problem happened in Windows CI: > > --- a/src/test/modules/test_radixtree/meson.build > +++ b/src/test/modules/test_radixtree/meson.build > @@ -12,6 +12,7 @@ endif > > test_radixtree = shared_module('test_radixtree', > test_radixtree_sources, > + link_with: host_system == 'windows' ? pgport_srv : [], > kwargs: pg_test_mod_args, > ) > test_install_libs += test_radixtree pgport_srv is for backend, shared libraries should be using pgport_shlib Further, the top level meson.build has: # all shared libraries not part of the backend should depend on this frontend_shlib_code = declare_dependency( include_directories: [postgres_inc], link_with: [common_shlib, pgport_shlib], sources: generated_headers, dependencies: [shlib_code, os_deps, libintl], ) ...but the only things that declare needing frontend_shlib_code are in src/interfaces/. In any case, I'm trying it in CI branch with pgport_shlib now.
On Fri, Mar 8, 2024 at 9:53 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Yesterday I've confirmed the something like the below fixes the > > problem happened in Windows CI: > > > > --- a/src/test/modules/test_radixtree/meson.build > > +++ b/src/test/modules/test_radixtree/meson.build > > @@ -12,6 +12,7 @@ endif > > > > test_radixtree = shared_module('test_radixtree', > > test_radixtree_sources, > > + link_with: host_system == 'windows' ? pgport_srv : [], > > kwargs: pg_test_mod_args, > > ) > > test_install_libs += test_radixtree > > pgport_srv is for backend, shared libraries should be using pgport_shlib > In any case, I'm trying it in CI branch with pgport_shlib now. That seems to work, so I'll push that just to get things green again.
On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've attached the remaining patches for CI. I've made some minor > changes in separate patches and drafted the commit message for > tidstore patch. > > While reviewing the tidstore code, I thought that it would be more > appropriate to place tidstore.c under src/backend/lib instead of > src/backend/common/access since the tidstore is no longer implemented > only for heap or other access methods, and it might also be used by > executor nodes in the future. What do you think? That's a heck of a good question. I don't think src/backend/lib is right -- it seems that's for general-purpose data structures. Something like backend/utils is also too general. src/backend/access/common has things for tuple descriptors, toast, sessions, and I don't think tidstore is out of place here. I'm not sure there's a better place, but I could be convinced otherwise. v68-0001: I'm not sure if commit messages are much a subject of review, and it's up to the committer, but I'll share a couple comments just as something to think about, not something I would ask you to change: I think it's a bit distracting that the commit message talks about the justification to use it for vacuum. Let's save that for the commit with actual vacuum changes. Also, I suspect saying there are a "wide range" of uses is over-selling it a bit, and that paragraph is a bit awkward aside from that. + /* Collect TIDs extracted from the key-value pair */ + result->num_offsets = 0; + This comment has nothing at all to do with this line. If the comment is for several lines following, some of which are separated by blank lines, there should be a blank line after the comment. Also, why isn't tidstore_iter_extract_tids() responsible for setting that to zero? + ts->context = CurrentMemoryContext; As far as I can tell, this member is never accessed again -- am I missing something? + /* DSA for tidstore will be detached at the end of session */ No other test module pins the mapping, but that doesn't necessarily mean it's wrong. Is there some advantage over explicitly detaching? +-- Add tids in random order. I don't see any randomization here. I do remember adding row_number to remove whitespace in the output, but I don't remember a random order. On that subject, the row_number was an easy trick to avoid extra whitespace, but maybe we should just teach the setting function to return blocknumber rather than null? +Datum +tidstore_create(PG_FUNCTION_ARGS) +{ ... + tidstore = TidStoreCreate(max_bytes, dsa); +Datum +tidstore_set_block_offsets(PG_FUNCTION_ARGS) +{ .... + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs); These names are too similar. Maybe the test module should do s/tidstore_/test_/ or similar. +/* Sanity check if we've called tidstore_create() */ +static void +check_tidstore_available(void) +{ + if (tidstore == NULL) + elog(ERROR, "tidstore is not initialized"); +} I don't find this very helpful. If a developer wiped out the create call, wouldn't the test crash and burn pretty obviously? In general, the .sql file is still very hard-coded. Functions are created that contain a VALUES statement. Maybe it's okay for now, but wanted to mention it. Ideally, we'd have some randomized tests, without having to display it. That could be in addition to (not replacing) the small tests we have that display input. (see below) v68-0002: @@ -329,6 +381,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid) ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0; +#ifdef TIDSTORE_DEBUG + if (!TidStoreIsShared(ts)) + { + bool ret_debug = ts_debug_is_member(ts, tid);; + Assert(ret == ret_debug); + } +#endif This only checking the case where we haven't returned already. In particular... + /* no entry for the blk */ + if (page == NULL) + return false; + + wordnum = WORDNUM(off); + bitnum = BITNUM(off); + + /* no bitmap for the off */ + if (wordnum >= page->nwords) + return false; ...these results are not checked. More broadly, it seems like the test module should be able to test everything that the debug-build array would complain about. Including ordered iteration. This may require first saving our test input to a table. We could create a cursor on a query that fetches the ordered input from the table and verifies that the tid store iterate produces the same ordered set, maybe with pl/pgSQL. Or something like that. Seems like not a whole lot of work. I can try later in the week, if you like. v68-0005/6 look ready to squash v68-0008 - I'm not a fan of captilizing short comment fragments. I use the style of either: short lower-case phrases, or full sentences including capitalization, correct grammar and period. I see these two styles all over the code base, as appropriate. + /* Remain attached until end of backend */ We'll probably want this comment, if in fact we want this behavior. + /* + * Note that funcctx->call_cntr is incremented in SRF_RETURN_NEXT + * before return. + */ I'm not sure what this is trying to say or why it's relevant, since it's been a while since I've written a SRF in C. That's all I have for now, and I haven't looked at the vacuum changes this time.
On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Feb 15, 2024 at 8:26 PM John Naylor <johncnaylorls@gmail.com> wrote: > > v61-0007: Runtime-embeddable tids -- Optional for v17, but should > > reduce memory regressions, so should be considered. Up to 3 tids can > > be stored in the last level child pointer. It's not polished, but I'll > > only proceed with that if we think we need this. "flags" iis called > > that because it could hold tidbitmap.c booleans (recheck, lossy) in > > the future, in addition to reserving space for the pointer tag. Note: > > I hacked the tests to only have 2 offsets per block to demo, but of > > course both paths should be tested. > > Interesting. I've run the same benchmark tests we did[1][2] (the > median of 3 runs): [found a big speed-up where we don't expect one] I tried to reproduce this (similar patch, but rebased on top of a bug you recently fixed (possibly related?) -- attached, and also shows one way to address some lack of coverage in the debug build, for as long as we test that with CI). Fortunately I cannot see a difference, so I believe it's not affecting the case in this test all, as expected: v68: INFO: finished vacuuming "john.public.test": index scans: 1 pages: 0 removed, 442478 remain, 88478 scanned (20.00% of total) tuples: 19995999 removed, 80003979 remain, 0 are dead but not yet removable removable cutoff: 770, which was 0 XIDs old when operation ended frozen: 0 pages from table (0.00% of total) had 0 tuples frozen index scan needed: 88478 pages from table (20.00% of total) had 19995999 dead item identifiers removed index "test_x_idx": pages: 274194 in total, 54822 newly deleted, 54822 currently deleted, 0 reusable avg read rate: 620.356 MB/s, avg write rate: 124.105 MB/s buffer usage: 758236 hits, 274196 misses, 54854 dirtied WAL usage: 2 records, 0 full page images, 425 bytes system usage: CPU: user: 3.74 s, system: 0.68 s, elapsed: 4.45 s system usage: CPU: user: 3.02 s, system: 0.42 s, elapsed: 3.47 s system usage: CPU: user: 3.09 s, system: 0.38 s, elapsed: 3.49 s system usage: CPU: user: 3.00 s, system: 0.43 s, elapsed: 3.45 s v68 + emb values (that cannot be used because > 3 tids per block): INFO: finished vacuuming "john.public.test": index scans: 1 pages: 0 removed, 442478 remain, 88478 scanned (20.00% of total) tuples: 19995999 removed, 80003979 remain, 0 are dead but not yet removable removable cutoff: 775, which was 0 XIDs old when operation ended frozen: 0 pages from table (0.00% of total) had 0 tuples frozen index scan needed: 88478 pages from table (20.00% of total) had 19995999 dead item identifiers removed index "test_x_idx": pages: 274194 in total, 54822 newly deleted, 54822 currently deleted, 0 reusable avg read rate: 570.808 MB/s, avg write rate: 114.192 MB/s buffer usage: 758236 hits, 274196 misses, 54854 dirtied WAL usage: 2 records, 0 full page images, 425 bytes system usage: CPU: user: 3.11 s, system: 0.62 s, elapsed: 3.75 s system usage: CPU: user: 3.04 s, system: 0.41 s, elapsed: 3.46 s system usage: CPU: user: 3.05 s, system: 0.41 s, elapsed: 3.47 s system usage: CPU: user: 3.04 s, system: 0.43 s, elapsed: 3.49 s I'll continue polishing the runtime-embeddable values patch as time permits, for later consideration.
Attachment
On Mon, Mar 11, 2024 at 12:20 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I've attached the remaining patches for CI. I've made some minor > > changes in separate patches and drafted the commit message for > > tidstore patch. > > > > While reviewing the tidstore code, I thought that it would be more > > appropriate to place tidstore.c under src/backend/lib instead of > > src/backend/common/access since the tidstore is no longer implemented > > only for heap or other access methods, and it might also be used by > > executor nodes in the future. What do you think? > > That's a heck of a good question. I don't think src/backend/lib is > right -- it seems that's for general-purpose data structures. > Something like backend/utils is also too general. > src/backend/access/common has things for tuple descriptors, toast, > sessions, and I don't think tidstore is out of place here. I'm not > sure there's a better place, but I could be convinced otherwise. Yeah, I agreed that src/backend/lib seems not to be the place for tidstore. Let's keep it in src/backend/access/common. If others think differently, we can move it later. > > v68-0001: > > I'm not sure if commit messages are much a subject of review, and it's > up to the committer, but I'll share a couple comments just as > something to think about, not something I would ask you to change: I > think it's a bit distracting that the commit message talks about the > justification to use it for vacuum. Let's save that for the commit > with actual vacuum changes. Also, I suspect saying there are a "wide > range" of uses is over-selling it a bit, and that paragraph is a bit > awkward aside from that. Thank you for the comment, and I agreed. I've updated the commit message. > > + /* Collect TIDs extracted from the key-value pair */ > + result->num_offsets = 0; > + > > This comment has nothing at all to do with this line. If the comment > is for several lines following, some of which are separated by blank > lines, there should be a blank line after the comment. Also, why isn't > tidstore_iter_extract_tids() responsible for setting that to zero? Agreed, fixed. I also updated this part so we set result->blkno in tidstore_iter_extract_tids() too, which seems more readable. > > + ts->context = CurrentMemoryContext; > > As far as I can tell, this member is never accessed again -- am I > missing something? You're right. It was used to re-create the tidstore in the same context again while resetting it, but we no longer support the reset API. Considering it again, would it be better to allocate the iterator struct in the same context as we store the tidstore struct? > > + /* DSA for tidstore will be detached at the end of session */ > > No other test module pins the mapping, but that doesn't necessarily > mean it's wrong. Is there some advantage over explicitly detaching? One small benefit of not explicitly detaching dsa_area in tidstore_destroy() would be simplicity; IIUC if we want to do that, we need to remember the dsa_area using (for example) a static variable, and free it if it's non-NULL. I've implemented this idea in the attached patch. > > +-- Add tids in random order. > > I don't see any randomization here. I do remember adding row_number to > remove whitespace in the output, but I don't remember a random order. > On that subject, the row_number was an easy trick to avoid extra > whitespace, but maybe we should just teach the setting function to > return blocknumber rather than null? Good idea, fixed. > > +Datum > +tidstore_create(PG_FUNCTION_ARGS) > +{ > ... > + tidstore = TidStoreCreate(max_bytes, dsa); > > +Datum > +tidstore_set_block_offsets(PG_FUNCTION_ARGS) > +{ > .... > + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs); > > These names are too similar. Maybe the test module should do > s/tidstore_/test_/ or similar. Agreed. > > +/* Sanity check if we've called tidstore_create() */ > +static void > +check_tidstore_available(void) > +{ > + if (tidstore == NULL) > + elog(ERROR, "tidstore is not initialized"); > +} > > I don't find this very helpful. If a developer wiped out the create > call, wouldn't the test crash and burn pretty obviously? Removed. > > In general, the .sql file is still very hard-coded. Functions are > created that contain a VALUES statement. Maybe it's okay for now, but > wanted to mention it. Ideally, we'd have some randomized tests, > without having to display it. That could be in addition to (not > replacing) the small tests we have that display input. (see below) > Agreed to add randomized tests in addition to the existing tests. > > v68-0002: > > @@ -329,6 +381,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid) > > ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0; > > +#ifdef TIDSTORE_DEBUG > + if (!TidStoreIsShared(ts)) > + { > + bool ret_debug = ts_debug_is_member(ts, tid);; > + Assert(ret == ret_debug); > + } > +#endif > > This only checking the case where we haven't returned already. In particular... > > + /* no entry for the blk */ > + if (page == NULL) > + return false; > + > + wordnum = WORDNUM(off); > + bitnum = BITNUM(off); > + > + /* no bitmap for the off */ > + if (wordnum >= page->nwords) > + return false; > > ...these results are not checked. > > More broadly, it seems like the test module should be able to test > everything that the debug-build array would complain about. Including > ordered iteration. This may require first saving our test input to a > table. We could create a cursor on a query that fetches the ordered > input from the table and verifies that the tid store iterate produces > the same ordered set, maybe with pl/pgSQL. Or something like that. > Seems like not a whole lot of work. I can try later in the week, if > you like. Sounds a good idea. In fact, if there are some bugs in tidstore, it's likely that even initdb would fail in practice. However, it's a very good idea that we can test the tidstore anyway with such a check without a debug-build array. Or as another idea, I wonder if we could keep the debug-build array in some form. For example, we use the array with the particular build flag and set a BF animal for that. That way, we can test the tidstore in more real cases. > > v68-0005/6 look ready to squash Done. > > v68-0008 - I'm not a fan of captilizing short comment fragments. I use > the style of either: short lower-case phrases, or full sentences > including capitalization, correct grammar and period. I see these two > styles all over the code base, as appropriate. Agreed. > > + /* Remain attached until end of backend */ > > We'll probably want this comment, if in fact we want this behavior. Kept it. > > + /* > + * Note that funcctx->call_cntr is incremented in SRF_RETURN_NEXT > + * before return. > + */ > > I'm not sure what this is trying to say or why it's relevant, since > it's been a while since I've written a SRF in C. I wanted to say is that we cannot do like: SRF_RETURN_NEXT(funcctx, PointerGetDatum(&(tids[funcctx->call_cntr]))); because funcctx->call_cntr is incremented *before* return and therefore we will end up accessing the index out of range. I've took some time to realize this fact before. > That's all I have for now, and I haven't looked at the vacuum changes this time. Thank you for the comments! In the latest (v69) patch: - squashed v68-0005 and v68-0006 patches. - removed most of the changes in v68-0007 patch. - addressed above review comments in v69-0002 patch. - v69-0003, 0004, and 0005 are miscellaneous updates. As for renaming TidStore to TIDStore, I dropped the patch for now since it seems we're using "Tid" in some function names and variable names. If we want to update it, we can do that later. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Mar 11, 2024 at 5:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > In the latest (v69) patch: > > - squashed v68-0005 and v68-0006 patches. > - removed most of the changes in v68-0007 patch. > - addressed above review comments in v69-0002 patch. > - v69-0003, 0004, and 0005 are miscellaneous updates. Since the v69 conflicts with the current HEAD, I've rebased them. In addition, v70-0008 is the new patch, which cleans up the vacuum integration patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Mar 11, 2024 at 3:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Mar 11, 2024 at 12:20 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > + ts->context = CurrentMemoryContext; > > > > As far as I can tell, this member is never accessed again -- am I > > missing something? > > You're right. It was used to re-create the tidstore in the same > context again while resetting it, but we no longer support the reset > API. Considering it again, would it be better to allocate the iterator > struct in the same context as we store the tidstore struct? That makes sense. > > + /* DSA for tidstore will be detached at the end of session */ > > > > No other test module pins the mapping, but that doesn't necessarily > > mean it's wrong. Is there some advantage over explicitly detaching? > > One small benefit of not explicitly detaching dsa_area in > tidstore_destroy() would be simplicity; IIUC if we want to do that, we > need to remember the dsa_area using (for example) a static variable, > and free it if it's non-NULL. I've implemented this idea in the > attached patch. Okay, I don't have a strong preference at this point. > > +-- Add tids in random order. > > > > I don't see any randomization here. I do remember adding row_number to > > remove whitespace in the output, but I don't remember a random order. > > On that subject, the row_number was an easy trick to avoid extra > > whitespace, but maybe we should just teach the setting function to > > return blocknumber rather than null? > > Good idea, fixed. + test_set_block_offsets +------------------------ + 2147483647 + 0 + 4294967294 + 1 + 4294967295 Hmm, was the earlier comment about randomness referring to this? I'm not sure what other regression tests do in these cases, or how relibale this is. If this is a problem we could simply insert this result into a temp table so it's not output. > > +Datum > > +tidstore_create(PG_FUNCTION_ARGS) > > +{ > > ... > > + tidstore = TidStoreCreate(max_bytes, dsa); > > > > +Datum > > +tidstore_set_block_offsets(PG_FUNCTION_ARGS) > > +{ > > .... > > + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs); > > > > These names are too similar. Maybe the test module should do > > s/tidstore_/test_/ or similar. > > Agreed. Mostly okay, although a couple look a bit generic now. I'll leave it up to you if you want to tweak things. > > In general, the .sql file is still very hard-coded. Functions are > > created that contain a VALUES statement. Maybe it's okay for now, but > > wanted to mention it. Ideally, we'd have some randomized tests, > > without having to display it. That could be in addition to (not > > replacing) the small tests we have that display input. (see below) > > > > Agreed to add randomized tests in addition to the existing tests. I'll try something tomorrow. > Sounds a good idea. In fact, if there are some bugs in tidstore, it's > likely that even initdb would fail in practice. However, it's a very > good idea that we can test the tidstore anyway with such a check > without a debug-build array. > > Or as another idea, I wonder if we could keep the debug-build array in > some form. For example, we use the array with the particular build > flag and set a BF animal for that. That way, we can test the tidstore > in more real cases. I think the purpose of a debug flag is to help developers catch mistakes. I don't think it's quite useful enough for that. For one, it has the same 1GB limitation as vacuum's current array. For another, it'd be a terrible way to debug moving tidbitmap.c from its hash table to use TID store -- AND/OR operations and lossy pages are pretty much undoable with a copy of vacuum's array. Last year, when I insisted on trying a long term realistic load that compares the result with the array, the encoding scheme was much harder to understand in code. I think it's now easier, and there are better tests. > In the latest (v69) patch: > > - squashed v68-0005 and v68-0006 patches. > - removed most of the changes in v68-0007 patch. > - addressed above review comments in v69-0002 patch. > - v69-0003, 0004, and 0005 are miscellaneous updates. > > As for renaming TidStore to TIDStore, I dropped the patch for now > since it seems we're using "Tid" in some function names and variable > names. If we want to update it, we can do that later. I think we're not consistent across the codebase, and it's fine to drop that patch. v70-0008: @@ -489,7 +489,7 @@ parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs) /* * Free the current tidstore and return allocated DSA segments to the * operating system. Then we recreate the tidstore with the same max_bytes - * limitation. + * limitation we just used. Nowadays, max_bytes is now more like a hint for tidstore, and not a limitation, right? Vacuum has the limitation. Maybe instead of "with", we should say "passing the same limitation". I wonder how "di_info" would look as "dead_items_info". I don't feel too strongly about it, though. I'm going to try additional regression tests, as mentioned, and try a couple benchmarks. It should be only a couple more days. One thing that occurred to me: The radix tree regression tests only compile and run the local memory case. The tidstore commit would be the first time the buildfarm has seen the shared memory case, so we should look out for possible build failures of the same sort we saw with the the radix tree tests. I see you've already removed the problematic link_with command -- that's the kind of thing to double-check for.
On Tue, Mar 12, 2024 at 7:34 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Mon, Mar 11, 2024 at 3:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Mar 11, 2024 at 12:20 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > + ts->context = CurrentMemoryContext; > > > > > > As far as I can tell, this member is never accessed again -- am I > > > missing something? > > > > You're right. It was used to re-create the tidstore in the same > > context again while resetting it, but we no longer support the reset > > API. Considering it again, would it be better to allocate the iterator > > struct in the same context as we store the tidstore struct? > > That makes sense. > > > > + /* DSA for tidstore will be detached at the end of session */ > > > > > > No other test module pins the mapping, but that doesn't necessarily > > > mean it's wrong. Is there some advantage over explicitly detaching? > > > > One small benefit of not explicitly detaching dsa_area in > > tidstore_destroy() would be simplicity; IIUC if we want to do that, we > > need to remember the dsa_area using (for example) a static variable, > > and free it if it's non-NULL. I've implemented this idea in the > > attached patch. > > Okay, I don't have a strong preference at this point. I'd keep the update on that. > > > > +-- Add tids in random order. > > > > > > I don't see any randomization here. I do remember adding row_number to > > > remove whitespace in the output, but I don't remember a random order. > > > On that subject, the row_number was an easy trick to avoid extra > > > whitespace, but maybe we should just teach the setting function to > > > return blocknumber rather than null? > > > > Good idea, fixed. > > + test_set_block_offsets > +------------------------ > + 2147483647 > + 0 > + 4294967294 > + 1 > + 4294967295 > > Hmm, was the earlier comment about randomness referring to this? I'm > not sure what other regression tests do in these cases, or how > relibale this is. If this is a problem we could simply insert this > result into a temp table so it's not output. I didn't address the comment about randomness. I think that we will have both random TIDs tests and fixed TIDs tests in test_tidstore as we discussed, and probably we can do both tests with similar steps; insert TIDs into both a temp table and tidstore and check if the tidstore returned the results as expected by comparing the results to the temp table. Probably we can have a common pl/pgsql function that checks that and raises a WARNING or an ERROR. Given that this is very similar to what we did in test_radixtree, why do we really want to implement it using a pl/pgsql function? When we discussed it before, I found the current way makes sense. But given that we're adding more tests and will add more tests in the future, doing the tests in C will be more maintainable and faster. Also, I think we can do the debug-build array stuff in the test_tidstore code instead. > > > > +Datum > > > +tidstore_create(PG_FUNCTION_ARGS) > > > +{ > > > ... > > > + tidstore = TidStoreCreate(max_bytes, dsa); > > > > > > +Datum > > > +tidstore_set_block_offsets(PG_FUNCTION_ARGS) > > > +{ > > > .... > > > + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs); > > > > > > These names are too similar. Maybe the test module should do > > > s/tidstore_/test_/ or similar. > > > > Agreed. > > Mostly okay, although a couple look a bit generic now. I'll leave it > up to you if you want to tweak things. > > > > In general, the .sql file is still very hard-coded. Functions are > > > created that contain a VALUES statement. Maybe it's okay for now, but > > > wanted to mention it. Ideally, we'd have some randomized tests, > > > without having to display it. That could be in addition to (not > > > replacing) the small tests we have that display input. (see below) > > > > > > > Agreed to add randomized tests in addition to the existing tests. > > I'll try something tomorrow. > > > Sounds a good idea. In fact, if there are some bugs in tidstore, it's > > likely that even initdb would fail in practice. However, it's a very > > good idea that we can test the tidstore anyway with such a check > > without a debug-build array. > > > > Or as another idea, I wonder if we could keep the debug-build array in > > some form. For example, we use the array with the particular build > > flag and set a BF animal for that. That way, we can test the tidstore > > in more real cases. > > I think the purpose of a debug flag is to help developers catch > mistakes. I don't think it's quite useful enough for that. For one, it > has the same 1GB limitation as vacuum's current array. For another, > it'd be a terrible way to debug moving tidbitmap.c from its hash table > to use TID store -- AND/OR operations and lossy pages are pretty much > undoable with a copy of vacuum's array. Valid points. As I mentioned above, if we implement the test cases in C, we can use the debug-build array in the test code. And we won't use it in AND/OR operations tests in the future. > > > In the latest (v69) patch: > > > > - squashed v68-0005 and v68-0006 patches. > > - removed most of the changes in v68-0007 patch. > > - addressed above review comments in v69-0002 patch. > > - v69-0003, 0004, and 0005 are miscellaneous updates. > > > > As for renaming TidStore to TIDStore, I dropped the patch for now > > since it seems we're using "Tid" in some function names and variable > > names. If we want to update it, we can do that later. > > I think we're not consistent across the codebase, and it's fine to > drop that patch. > > v70-0008: > > @@ -489,7 +489,7 @@ parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs) > /* > * Free the current tidstore and return allocated DSA segments to the > * operating system. Then we recreate the tidstore with the same max_bytes > - * limitation. > + * limitation we just used. > > Nowadays, max_bytes is now more like a hint for tidstore, and not a > limitation, right? Vacuum has the limitation. Right. > Maybe instead of "with", > we should say "passing the same limitation". Will fix. > > I wonder how "di_info" would look as "dead_items_info". I don't feel > too strongly about it, though. Agreed. > > I'm going to try additional regression tests, as mentioned, and try a > couple benchmarks. It should be only a couple more days. Thank you! > One thing that occurred to me: The radix tree regression tests only > compile and run the local memory case. The tidstore commit would be > the first time the buildfarm has seen the shared memory case, so we > should look out for possible build failures of the same sort we saw > with the the radix tree tests. I see you've already removed the > problematic link_with command -- that's the kind of thing to > double-check for. Good point, agreed. I'll double-check it again. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > As I mentioned above, if we implement the test cases in C, we can use > the debug-build array in the test code. And we won't use it in AND/OR > operations tests in the future. That's a really interesting idea, so I went ahead and tried that for v71. This seems like a good basis for testing larger, randomized inputs, once we decide how best to hide that from the expected output. The tests use SQL functions do_set_block_offsets() and check_set_block_offsets(). The latter does two checks against a tid array, and replaces test_dump_tids(). Funnily enough, the debug array itself gave false failures when using a similar array in the test harness, because it didn't know all the places where the array should have been sorted -- it only worked by chance before because of what order things were done. I squashed everything from v70 and also took the liberty of switching on shared memory for tid store tests. The only reason we didn't do this with the radix tree tests is that the static attach/detach functions would raise warnings since they are not used.
Attachment
- v71-0002-DEV-Debug-TidStore.patch
- v71-0003-DEV-Fix-failure-to-sort-debug-array-for-iteratio.patch
- v71-0005-Use-shared-memory-in-TID-store-tests.patch
- v71-0006-Use-TidStore-to-store-dead-tuple-TIDs-during-laz.patch
- v71-0004-WIP-Use-array-of-TIDs-in-TID-store-regression-te.patch
- v71-0001-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
On Wed, Mar 13, 2024 at 8:05 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > As I mentioned above, if we implement the test cases in C, we can use > > the debug-build array in the test code. And we won't use it in AND/OR > > operations tests in the future. > > That's a really interesting idea, so I went ahead and tried that for > v71. This seems like a good basis for testing larger, randomized > inputs, once we decide how best to hide that from the expected output. > The tests use SQL functions do_set_block_offsets() and > check_set_block_offsets(). The latter does two checks against a tid > array, and replaces test_dump_tids(). Great! I think that's a very good starter. The lookup_test() (and test_lookup_tids()) do also test that the IsMember() function returns false as expected if the TID doesn't exist in it, and probably we can do these tests in a C function too. BTW do we still want to test the tidstore by using a combination of SQL functions? We might no longer need to input TIDs via a SQL function. > Funnily enough, the debug array > itself gave false failures when using a similar array in the test > harness, because it didn't know all the places where the array should > have been sorted -- it only worked by chance before because of what > order things were done. Good catch, thanks. > I squashed everything from v70 and also took the liberty of switching > on shared memory for tid store tests. The only reason we didn't do > this with the radix tree tests is that the static attach/detach > functions would raise warnings since they are not used. Agreed to test the tidstore on shared memory. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 13, 2024 at 9:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Mar 13, 2024 at 8:05 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > As I mentioned above, if we implement the test cases in C, we can use > > > the debug-build array in the test code. And we won't use it in AND/OR > > > operations tests in the future. > > > > That's a really interesting idea, so I went ahead and tried that for > > v71. This seems like a good basis for testing larger, randomized > > inputs, once we decide how best to hide that from the expected output. > > The tests use SQL functions do_set_block_offsets() and > > check_set_block_offsets(). The latter does two checks against a tid > > array, and replaces test_dump_tids(). > > Great! I think that's a very good starter. > > The lookup_test() (and test_lookup_tids()) do also test that the > IsMember() function returns false as expected if the TID doesn't exist > in it, and probably we can do these tests in a C function too. > > BTW do we still want to test the tidstore by using a combination of > SQL functions? We might no longer need to input TIDs via a SQL > function. I'm not sure. I stopped short of doing that to get feedback on this much. One advantage with SQL functions is we can use generate_series to easily input lists of blocks with different numbers and strides, and array literals for offsets are a bit easier. What do you think?
On Thu, Mar 14, 2024 at 9:59 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Mar 13, 2024 at 9:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Mar 13, 2024 at 8:05 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > As I mentioned above, if we implement the test cases in C, we can use > > > > the debug-build array in the test code. And we won't use it in AND/OR > > > > operations tests in the future. > > > > > > That's a really interesting idea, so I went ahead and tried that for > > > v71. This seems like a good basis for testing larger, randomized > > > inputs, once we decide how best to hide that from the expected output. > > > The tests use SQL functions do_set_block_offsets() and > > > check_set_block_offsets(). The latter does two checks against a tid > > > array, and replaces test_dump_tids(). > > > > Great! I think that's a very good starter. > > > > The lookup_test() (and test_lookup_tids()) do also test that the > > IsMember() function returns false as expected if the TID doesn't exist > > in it, and probably we can do these tests in a C function too. > > > > BTW do we still want to test the tidstore by using a combination of > > SQL functions? We might no longer need to input TIDs via a SQL > > function. > > I'm not sure. I stopped short of doing that to get feedback on this > much. One advantage with SQL functions is we can use generate_series > to easily input lists of blocks with different numbers and strides, > and array literals for offsets are a bit easier. What do you think? While I'm not a fan of the following part, I agree that it makes sense to use SQL functions for test data generation: -- Constant values used in the tests. \set maxblkno 4294967295 -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291. -- We use a higher number to test tidstore. \set maxoffset 512 It would also be easier for developers to test the tidstore with their own data set. So I agreed with the current approach; use SQL functions for data generation and do the actual tests inside C functions. Is it convenient for developers if we have functions like generate_tids() and generate_random_tids() to generate TIDs so that they can pass them to do_set_block_offsets()? Then they call check_set_block_offsets() and others for actual data lookup and iteration tests. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 14, 2024 at 8:53 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 14, 2024 at 9:59 AM John Naylor <johncnaylorls@gmail.com> wrote: > > > BTW do we still want to test the tidstore by using a combination of > > > SQL functions? We might no longer need to input TIDs via a SQL > > > function. > > > > I'm not sure. I stopped short of doing that to get feedback on this > > much. One advantage with SQL functions is we can use generate_series > > to easily input lists of blocks with different numbers and strides, > > and array literals for offsets are a bit easier. What do you think? > > While I'm not a fan of the following part, I agree that it makes sense > to use SQL functions for test data generation: > > -- Constant values used in the tests. > \set maxblkno 4294967295 > -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291. > -- We use a higher number to test tidstore. > \set maxoffset 512 I'm not really a fan of these either, and could be removed a some point if we've done everything else nicely. > It would also be easier for developers to test the tidstore with their > own data set. So I agreed with the current approach; use SQL functions > for data generation and do the actual tests inside C functions. Okay, here's an another idea: Change test_lookup_tids() to be more general and put the validation down into C as well. First we save the blocks from do_set_block_offsets() into a table, then with all those blocks lookup a sufficiently-large range of possible offsets and save found values in another array. So the static items structure would have 3 arrays: inserts, successful lookups, and iteration (currently the iteration output is private to check_set_block_offsets(). Then sort as needed and check they are all the same. Further thought: We may not really need to test block numbers that vigorously, since the radix tree tests should cover keys/values pretty well. The difference here is using bitmaps of tids and that should be well covered. Locally (not CI), we should try big inputs to make sure we can actually go up to many GB -- it's easier and faster this way than having vacuum give us a large data set. > Is it > convenient for developers if we have functions like generate_tids() > and generate_random_tids() to generate TIDs so that they can pass them > to do_set_block_offsets()? I guess I don't see the advantage of adding a layer of indirection at this point, but it could be useful at a later time.
On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 14, 2024 at 8:53 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Mar 14, 2024 at 9:59 AM John Naylor <johncnaylorls@gmail.com> wrote: > > > > BTW do we still want to test the tidstore by using a combination of > > > > SQL functions? We might no longer need to input TIDs via a SQL > > > > function. > > > > > > I'm not sure. I stopped short of doing that to get feedback on this > > > much. One advantage with SQL functions is we can use generate_series > > > to easily input lists of blocks with different numbers and strides, > > > and array literals for offsets are a bit easier. What do you think? > > > > While I'm not a fan of the following part, I agree that it makes sense > > to use SQL functions for test data generation: > > > > -- Constant values used in the tests. > > \set maxblkno 4294967295 > > -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291. > > -- We use a higher number to test tidstore. > > \set maxoffset 512 > > I'm not really a fan of these either, and could be removed a some > point if we've done everything else nicely. > > > It would also be easier for developers to test the tidstore with their > > own data set. So I agreed with the current approach; use SQL functions > > for data generation and do the actual tests inside C functions. > > Okay, here's an another idea: Change test_lookup_tids() to be more > general and put the validation down into C as well. First we save the > blocks from do_set_block_offsets() into a table, then with all those > blocks lookup a sufficiently-large range of possible offsets and save > found values in another array. So the static items structure would > have 3 arrays: inserts, successful lookups, and iteration (currently > the iteration output is private to check_set_block_offsets(). Then > sort as needed and check they are all the same. That's a promising idea. We can use the same mechanism for randomized tests too. If you're going to work on this, I'll do other tests on my environment in the meantime. > > Further thought: We may not really need to test block numbers that > vigorously, since the radix tree tests should cover keys/values pretty > well. Agreed. Probably boundary block numbers: 0, 1, MaxBlockNumber - 1, and MaxBlockNumber, would be sufficient. > The difference here is using bitmaps of tids and that should be > well covered. Right. We would need to test offset numbers vigorously instead. > > Locally (not CI), we should try big inputs to make sure we can > actually go up to many GB -- it's easier and faster this way than > having vacuum give us a large data set. I'll do these tests. > > > Is it > > convenient for developers if we have functions like generate_tids() > > and generate_random_tids() to generate TIDs so that they can pass them > > to do_set_block_offsets()? > > I guess I don't see the advantage of adding a layer of indirection at > this point, but it could be useful at a later time. Agreed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote: > > Okay, here's an another idea: Change test_lookup_tids() to be more > > general and put the validation down into C as well. First we save the > > blocks from do_set_block_offsets() into a table, then with all those > > blocks lookup a sufficiently-large range of possible offsets and save > > found values in another array. So the static items structure would > > have 3 arrays: inserts, successful lookups, and iteration (currently > > the iteration output is private to check_set_block_offsets(). Then > > sort as needed and check they are all the same. > > That's a promising idea. We can use the same mechanism for randomized > tests too. If you're going to work on this, I'll do other tests on my > environment in the meantime. Some progress on this in v72 -- I tried first without using SQL to save the blocks, just using the unique blocks from the verification array. It seems to work fine. Some open questions on the test module: - Since there are now three arrays we should reduce max bytes to something smaller. - Further on that, I'm not sure if the "is full" test is telling us much. It seems we could make max bytes a static variable and set it to the size of the empty store. I'm guessing it wouldn't take much to add enough tids so that the contexts need to allocate some blocks, and then it would appear full and we can test that. I've made it so all arrays repalloc when needed, just in case. - Why are we switching to TopMemoryContext? It's not explained -- the comment only tells what the code is doing (which is obvious), but not why. - I'm not sure it's useful to keep test_lookup_tids() around. Since we now have a separate lookup test, the only thing it can tell us is that lookups fail on an empty store. I arranged it so that check_set_block_offsets() works on an empty store. Although that's even more trivial, it's just reusing what we already need.
On Thu, Mar 14, 2024 at 6:55 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > Okay, here's an another idea: Change test_lookup_tids() to be more > > > general and put the validation down into C as well. First we save the > > > blocks from do_set_block_offsets() into a table, then with all those > > > blocks lookup a sufficiently-large range of possible offsets and save > > > found values in another array. So the static items structure would > > > have 3 arrays: inserts, successful lookups, and iteration (currently > > > the iteration output is private to check_set_block_offsets(). Then > > > sort as needed and check they are all the same. > > > > That's a promising idea. We can use the same mechanism for randomized > > tests too. If you're going to work on this, I'll do other tests on my > > environment in the meantime. > > Some progress on this in v72 -- I tried first without using SQL to > save the blocks, just using the unique blocks from the verification > array. It seems to work fine. Thanks! > > - Since there are now three arrays we should reduce max bytes to > something smaller. Agreed. > - Further on that, I'm not sure if the "is full" test is telling us > much. It seems we could make max bytes a static variable and set it to > the size of the empty store. I'm guessing it wouldn't take much to add > enough tids so that the contexts need to allocate some blocks, and > then it would appear full and we can test that. I've made it so all > arrays repalloc when needed, just in case. How about using work_mem as max_bytes instead of having it as a static variable? In test_tidstore.sql we set work_mem before creating the tidstore. It would make the tidstore more controllable by SQL queries. > - Why are we switching to TopMemoryContext? It's not explained -- the > comment only tells what the code is doing (which is obvious), but not > why. This is because the tidstore needs to live across the transaction boundary. We can use TopMemoryContext or CacheMemoryContext. > - I'm not sure it's useful to keep test_lookup_tids() around. Since we > now have a separate lookup test, the only thing it can tell us is that > lookups fail on an empty store. I arranged it so that > check_set_block_offsets() works on an empty store. Although that's > even more trivial, it's just reusing what we already need. Agreed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 14, 2024 at 9:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 14, 2024 at 6:55 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > Okay, here's an another idea: Change test_lookup_tids() to be more > > > > general and put the validation down into C as well. First we save the > > > > blocks from do_set_block_offsets() into a table, then with all those > > > > blocks lookup a sufficiently-large range of possible offsets and save > > > > found values in another array. So the static items structure would > > > > have 3 arrays: inserts, successful lookups, and iteration (currently > > > > the iteration output is private to check_set_block_offsets(). Then > > > > sort as needed and check they are all the same. > > > > > > That's a promising idea. We can use the same mechanism for randomized > > > tests too. If you're going to work on this, I'll do other tests on my > > > environment in the meantime. > > > > Some progress on this in v72 -- I tried first without using SQL to > > save the blocks, just using the unique blocks from the verification > > array. It seems to work fine. > > Thanks! > > > > > - Since there are now three arrays we should reduce max bytes to > > something smaller. > > Agreed. > > > - Further on that, I'm not sure if the "is full" test is telling us > > much. It seems we could make max bytes a static variable and set it to > > the size of the empty store. I'm guessing it wouldn't take much to add > > enough tids so that the contexts need to allocate some blocks, and > > then it would appear full and we can test that. I've made it so all > > arrays repalloc when needed, just in case. > > How about using work_mem as max_bytes instead of having it as a static > variable? In test_tidstore.sql we set work_mem before creating the > tidstore. It would make the tidstore more controllable by SQL queries. > > > - Why are we switching to TopMemoryContext? It's not explained -- the > > comment only tells what the code is doing (which is obvious), but not > > why. > > This is because the tidstore needs to live across the transaction > boundary. We can use TopMemoryContext or CacheMemoryContext. > > > - I'm not sure it's useful to keep test_lookup_tids() around. Since we > > now have a separate lookup test, the only thing it can tell us is that > > lookups fail on an empty store. I arranged it so that > > check_set_block_offsets() works on an empty store. Although that's > > even more trivial, it's just reusing what we already need. > > Agreed. > I have two questions on tidstore.c: +/* + * Set the given TIDs on the blkno to TidStore. + * + * NB: the offset numbers in offsets must be sorted in ascending order. + */ Do we need some assertions to check if the given offset numbers are sorted expectedly? --- + if (TidStoreIsShared(ts)) + found = shared_rt_set(ts->tree.shared, blkno, page); + else + found = local_rt_set(ts->tree.local, blkno, page); + + Assert(!found); Given TidStoreSetBlockOffsets() is designed to always set (i.e. overwrite) the value, I think we should not expect that found is always false. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 14, 2024 at 6:55 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > Okay, here's an another idea: Change test_lookup_tids() to be more > > > > general and put the validation down into C as well. First we save the > > > > blocks from do_set_block_offsets() into a table, then with all those > > > > blocks lookup a sufficiently-large range of possible offsets and save > > > > found values in another array. So the static items structure would > > > > have 3 arrays: inserts, successful lookups, and iteration (currently > > > > the iteration output is private to check_set_block_offsets(). Then > > > > sort as needed and check they are all the same. > > > > > > That's a promising idea. We can use the same mechanism for randomized > > > tests too. If you're going to work on this, I'll do other tests on my > > > environment in the meantime. > > > > Some progress on this in v72 -- I tried first without using SQL to > > save the blocks, just using the unique blocks from the verification > > array. It seems to work fine. > > Thanks! Seems I forgot the attachment last time...there's more stuff now anyway, based on discussion. > > - Since there are now three arrays we should reduce max bytes to > > something smaller. > > Agreed. I went further than this, see below. > > - Further on that, I'm not sure if the "is full" test is telling us > > much. It seems we could make max bytes a static variable and set it to > > the size of the empty store. I'm guessing it wouldn't take much to add > > enough tids so that the contexts need to allocate some blocks, and > > then it would appear full and we can test that. I've made it so all > > arrays repalloc when needed, just in case. > > How about using work_mem as max_bytes instead of having it as a static > variable? In test_tidstore.sql we set work_mem before creating the > tidstore. It would make the tidstore more controllable by SQL queries. My complaint is that the "is full" test is trivial, and also strange in that max_bytes is used for two unrelated things: - the initial size of the verification arrays, which was always larger than necessary, and now there are three of them - the hint to TidStoreCreate to calculate its max block size / the threshold for being "full" To make the "is_full" test slightly less trivial, my idea is to save the empty store size and later add enough tids so that it has to allocate new blocks/DSA segments, which is not that many, and then it will appear full. I've done this and also separated the purpose of various sizes in v72-0009/10. Using actual work_mem seems a bit more difficult to make this work. > > - I'm not sure it's useful to keep test_lookup_tids() around. Since we > > now have a separate lookup test, the only thing it can tell us is that > > lookups fail on an empty store. I arranged it so that > > check_set_block_offsets() works on an empty store. Although that's > > even more trivial, it's just reusing what we already need. > > Agreed. Removed in v72-0007 On Fri, Mar 15, 2024 at 9:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I have two questions on tidstore.c: > > +/* > + * Set the given TIDs on the blkno to TidStore. > + * > + * NB: the offset numbers in offsets must be sorted in ascending order. > + */ > > Do we need some assertions to check if the given offset numbers are > sorted expectedly? Done in v72-0008 > --- > + if (TidStoreIsShared(ts)) > + found = shared_rt_set(ts->tree.shared, blkno, page); > + else > + found = local_rt_set(ts->tree.local, blkno, page); > + > + Assert(!found); > > Given TidStoreSetBlockOffsets() is designed to always set (i.e. > overwrite) the value, I think we should not expect that found is > always false. I find that a puzzling statement, since 1) it was designed for insert-only workloads, not actual overwrite IIRC and 2) the tests will now fail if the same block is set twice, since we just switched the tests to use a remnant of vacuum's old array. Having said that, I don't object to removing artificial barriers to using it for purposes not yet imagined, as long as test_tidstore.sql warns against that. Given the above two things, I think this function's comment needs stronger language about its limitations. Perhaps even mention that it's intended for, and optimized for, vacuum. You and I have long known that tidstore would need a separate, more complex, function to add or remove individual tids from existing entries, but it might be good to have that documented. Other things: v72-0011: Test that zero offset raises an error. v72-0013: I had wanted to microbenchmark this, but since we are running short of time I decided to skip that, so I want to revert some code to make it again more similar to the equivalent in tidbitmap.c. In the absence of evidence, it seems better to do it this way.
Attachment
On Fri, Mar 15, 2024 at 4:36 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Mar 14, 2024 at 6:55 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > Okay, here's an another idea: Change test_lookup_tids() to be more > > > > > general and put the validation down into C as well. First we save the > > > > > blocks from do_set_block_offsets() into a table, then with all those > > > > > blocks lookup a sufficiently-large range of possible offsets and save > > > > > found values in another array. So the static items structure would > > > > > have 3 arrays: inserts, successful lookups, and iteration (currently > > > > > the iteration output is private to check_set_block_offsets(). Then > > > > > sort as needed and check they are all the same. > > > > > > > > That's a promising idea. We can use the same mechanism for randomized > > > > tests too. If you're going to work on this, I'll do other tests on my > > > > environment in the meantime. > > > > > > Some progress on this in v72 -- I tried first without using SQL to > > > save the blocks, just using the unique blocks from the verification > > > array. It seems to work fine. > > > > Thanks! > > Seems I forgot the attachment last time...there's more stuff now > anyway, based on discussion. Thank you for updating the patches! The idea of using three TID arrays for the lookup test and iteration test looks good to me. I think we can add random-TIDs tests on top of it. > > > > - Since there are now three arrays we should reduce max bytes to > > > something smaller. > > > > Agreed. > > I went further than this, see below. > > > > - Further on that, I'm not sure if the "is full" test is telling us > > > much. It seems we could make max bytes a static variable and set it to > > > the size of the empty store. I'm guessing it wouldn't take much to add > > > enough tids so that the contexts need to allocate some blocks, and > > > then it would appear full and we can test that. I've made it so all > > > arrays repalloc when needed, just in case. > > > > How about using work_mem as max_bytes instead of having it as a static > > variable? In test_tidstore.sql we set work_mem before creating the > > tidstore. It would make the tidstore more controllable by SQL queries. > > My complaint is that the "is full" test is trivial, and also strange > in that max_bytes is used for two unrelated things: > > - the initial size of the verification arrays, which was always larger > than necessary, and now there are three of them > - the hint to TidStoreCreate to calculate its max block size / the > threshold for being "full" > > To make the "is_full" test slightly less trivial, my idea is to save > the empty store size and later add enough tids so that it has to > allocate new blocks/DSA segments, which is not that many, and then it > will appear full. I've done this and also separated the purpose of > various sizes in v72-0009/10. I see your point and the changes look good to me. > Using actual work_mem seems a bit more difficult to make this work. Agreed. > > > > --- > > + if (TidStoreIsShared(ts)) > > + found = shared_rt_set(ts->tree.shared, blkno, page); > > + else > > + found = local_rt_set(ts->tree.local, blkno, page); > > + > > + Assert(!found); > > > > Given TidStoreSetBlockOffsets() is designed to always set (i.e. > > overwrite) the value, I think we should not expect that found is > > always false. > > I find that a puzzling statement, since 1) it was designed for > insert-only workloads, not actual overwrite IIRC and 2) the tests will > now fail if the same block is set twice, since we just switched the > tests to use a remnant of vacuum's old array. Having said that, I > don't object to removing artificial barriers to using it for purposes > not yet imagined, as long as test_tidstore.sql warns against that. I think that if it supports only insert-only workload and expects the same block is set only once, it should raise an error rather than an assertion. It's odd to me that the function fails only with an assertion build assertions even though it actually works fine even in that case. As for test_tidstore you're right that the test code doesn't handle the case where setting the same block twice. I think that there is no problem in the fixed-TIDs tests, but we would need something for random-TIDs tests so that we don't set the same block twice. I guess it could be trivial since we can use SQL queries to generate TIDs. I'm not sure how the random-TIDs tests would be like, but I think we can use SELECT DISTINCT to eliminate the duplicates of block numbers to use. > > Given the above two things, I think this function's comment needs > stronger language about its limitations. Perhaps even mention that > it's intended for, and optimized for, vacuum. You and I have long > known that tidstore would need a separate, more complex, function to > add or remove individual tids from existing entries, but it might be > good to have that documented. Agreed. > > Other things: > > v72-0011: Test that zero offset raises an error. > > v72-0013: I had wanted to microbenchmark this, but since we are > running short of time I decided to skip that, so I want to revert some > code to make it again more similar to the equivalent in tidbitmap.c. > In the absence of evidence, it seems better to do it this way. LGTM. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Mar 15, 2024 at 4:36 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Given TidStoreSetBlockOffsets() is designed to always set (i.e. > > > overwrite) the value, I think we should not expect that found is > > > always false. > > > > I find that a puzzling statement, since 1) it was designed for > > insert-only workloads, not actual overwrite IIRC and 2) the tests will > > now fail if the same block is set twice, since we just switched the > > tests to use a remnant of vacuum's old array. Having said that, I > > don't object to removing artificial barriers to using it for purposes > > not yet imagined, as long as test_tidstore.sql warns against that. > > I think that if it supports only insert-only workload and expects the > same block is set only once, it should raise an error rather than an > assertion. It's odd to me that the function fails only with an > assertion build assertions even though it actually works fine even in > that case. After thinking some more, I think you're right -- it's too heavy-handed to throw an error/assert and a public function shouldn't make assumptions about the caller. It's probably just a matter of documenting the function (and it's lack of generality), and the tests (which are based on the thing we're replacing). > As for test_tidstore you're right that the test code doesn't handle > the case where setting the same block twice. I think that there is no > problem in the fixed-TIDs tests, but we would need something for > random-TIDs tests so that we don't set the same block twice. I guess > it could be trivial since we can use SQL queries to generate TIDs. I'm > not sure how the random-TIDs tests would be like, but I think we can > use SELECT DISTINCT to eliminate the duplicates of block numbers to > use. Also, I don't think we need random blocks, since the radix tree tests excercise that heavily already. Random offsets is what I was thinking of (if made distinct and ordered), but even there the code is fairy trivial, so I don't have a strong feeling about it. > > Given the above two things, I think this function's comment needs > > stronger language about its limitations. Perhaps even mention that > > it's intended for, and optimized for, vacuum. You and I have long > > known that tidstore would need a separate, more complex, function to > > add or remove individual tids from existing entries, but it might be > > good to have that documented. > > Agreed. How about this: /* - * Set the given TIDs on the blkno to TidStore. + * Create or replace an entry for the given block and array of offsets * - * NB: the offset numbers in offsets must be sorted in ascending order. + * NB: This function is designed and optimized for vacuum's heap scanning + * phase, so has some limitations: + * - The offset numbers in "offsets" must be sorted in ascending order. + * - If the block number already exists, the entry will be replaced -- + * there is no way to add or remove offsets from an entry. */ void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, I think we can stop including the debug-tid-store patch for CI now. That would allow getting rid of some unnecessary variables. More comments: + * Prepare to iterate through a TidStore. Since the radix tree is locked during + * the iteration, TidStoreEndIterate() needs to be called when finished. + * Concurrent updates during the iteration will be blocked when inserting a + * key-value to the radix tree. This is outdated. Locking is optional. The remaining real reason now is that TidStoreEndIterate needs to free memory. We probably need to say something about locking, too, but not this. + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs + * in one block. We return the block numbers in ascending order and the offset + * numbers in each result is also sorted in ascending order. + */ +TidStoreIterResult * +TidStoreIterateNext(TidStoreIter *iter) The wording is a bit awkward. +/* + * Finish an iteration over TidStore. This needs to be called after finishing + * or when existing an iteration. + */ s/existing/exiting/ ? It seems to say we need to finish after finishing. Maybe more precise wording. +/* Extract TIDs from the given key-value pair */ +static void +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, BlocktableEntry *page) This is a leftover from the old encoding scheme. This should really take a "BlockNumber blockno" not a "key", and the only call site should probably cast the uint64 to BlockNumber. + * tidstore.h + * Tid storage. + * + * + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group Update year. +typedef struct BlocktableEntry +{ + uint16 nwords; + bitmapword words[FLEXIBLE_ARRAY_MEMBER]; +} BlocktableEntry; In my WIP for runtime-embeddable offsets, nwords needs to be one byte. That doesn't have any real-world affect on the largest offset encountered, and only in 32-bit builds with 32kB block size would the theoretical max change at all. To be precise, we could use in the MaxBlocktableEntrySize calculation: Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1); Tests: I never got rid of maxblkno and maxoffset, in case you wanted to do that. And as discussed above, maybe -- Note: The test code use an array of TIDs for verification similar -- to vacuum's dead item array pre-PG17. To avoid adding duplicates, -- each call to do_set_block_offsets() should use different block -- numbers.
On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Mar 15, 2024 at 4:36 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > Given TidStoreSetBlockOffsets() is designed to always set (i.e. > > > > overwrite) the value, I think we should not expect that found is > > > > always false. > > > > > > I find that a puzzling statement, since 1) it was designed for > > > insert-only workloads, not actual overwrite IIRC and 2) the tests will > > > now fail if the same block is set twice, since we just switched the > > > tests to use a remnant of vacuum's old array. Having said that, I > > > don't object to removing artificial barriers to using it for purposes > > > not yet imagined, as long as test_tidstore.sql warns against that. > > > > I think that if it supports only insert-only workload and expects the > > same block is set only once, it should raise an error rather than an > > assertion. It's odd to me that the function fails only with an > > assertion build assertions even though it actually works fine even in > > that case. > > After thinking some more, I think you're right -- it's too > heavy-handed to throw an error/assert and a public function shouldn't > make assumptions about the caller. It's probably just a matter of > documenting the function (and it's lack of generality), and the tests > (which are based on the thing we're replacing). Removed 'found' in 0003 patch. > > > As for test_tidstore you're right that the test code doesn't handle > > the case where setting the same block twice. I think that there is no > > problem in the fixed-TIDs tests, but we would need something for > > random-TIDs tests so that we don't set the same block twice. I guess > > it could be trivial since we can use SQL queries to generate TIDs. I'm > > not sure how the random-TIDs tests would be like, but I think we can > > use SELECT DISTINCT to eliminate the duplicates of block numbers to > > use. > > Also, I don't think we need random blocks, since the radix tree tests > excercise that heavily already. > > Random offsets is what I was thinking of (if made distinct and > ordered), but even there the code is fairy trivial, so I don't have a > strong feeling about it. Agreed. > > > > Given the above two things, I think this function's comment needs > > > stronger language about its limitations. Perhaps even mention that > > > it's intended for, and optimized for, vacuum. You and I have long > > > known that tidstore would need a separate, more complex, function to > > > add or remove individual tids from existing entries, but it might be > > > good to have that documented. > > > > Agreed. > > How about this: > > /* > - * Set the given TIDs on the blkno to TidStore. > + * Create or replace an entry for the given block and array of offsets > * > - * NB: the offset numbers in offsets must be sorted in ascending order. > + * NB: This function is designed and optimized for vacuum's heap scanning > + * phase, so has some limitations: > + * - The offset numbers in "offsets" must be sorted in ascending order. > + * - If the block number already exists, the entry will be replaced -- > + * there is no way to add or remove offsets from an entry. > */ > void > TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, Looks good. > > I think we can stop including the debug-tid-store patch for CI now. > That would allow getting rid of some unnecessary variables. Agreed. > > + * Prepare to iterate through a TidStore. Since the radix tree is locked during > + * the iteration, TidStoreEndIterate() needs to be called when finished. > > + * Concurrent updates during the iteration will be blocked when inserting a > + * key-value to the radix tree. > > This is outdated. Locking is optional. The remaining real reason now > is that TidStoreEndIterate needs to free memory. We probably need to > say something about locking, too, but not this. Fixed. > > + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs > + * in one block. We return the block numbers in ascending order and the offset > + * numbers in each result is also sorted in ascending order. > + */ > +TidStoreIterResult * > +TidStoreIterateNext(TidStoreIter *iter) > > The wording is a bit awkward. Fixed. > > +/* > + * Finish an iteration over TidStore. This needs to be called after finishing > + * or when existing an iteration. > + */ > > s/existing/exiting/ ? > > It seems to say we need to finish after finishing. Maybe more precise wording. Fixed. > > +/* Extract TIDs from the given key-value pair */ > +static void > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, > BlocktableEntry *page) > > This is a leftover from the old encoding scheme. This should really > take a "BlockNumber blockno" not a "key", and the only call site > should probably cast the uint64 to BlockNumber. Fixed. > > + * tidstore.h > + * Tid storage. > + * > + * > + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group > > Update year. Updated. > > +typedef struct BlocktableEntry > +{ > + uint16 nwords; > + bitmapword words[FLEXIBLE_ARRAY_MEMBER]; > +} BlocktableEntry; > > In my WIP for runtime-embeddable offsets, nwords needs to be one byte. > That doesn't have any real-world affect on the largest offset > encountered, and only in 32-bit builds with 32kB block size would the > theoretical max change at all. To be precise, we could use in the > MaxBlocktableEntrySize calculation: > > Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1); I don't get this expression. Making the nwords one byte works well? With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256 bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit OS, respectively. One byte nwrods variable seems not to be sufficient for both cases. Also, where does the expression "BITS_PER_BITMAPWORD * PG_INT8_MAX - 1" come from? > > Tests: I never got rid of maxblkno and maxoffset, in case you wanted > to do that. And as discussed above, maybe > > -- Note: The test code use an array of TIDs for verification similar > -- to vacuum's dead item array pre-PG17. To avoid adding duplicates, > -- each call to do_set_block_offsets() should use different block > -- numbers. I've added this comment on top of the .sql file. I've attached the new patch sets. The summary of updates is: - Squashed all updates of v72 - 0004 and 0005 are updates for test_tidstore.sql. Particularly the 0005 patch adds randomized TID tests. - 0006 addresses review comments above. - 0007 and 0008 patches are pgindent stuff. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote: > > Random offsets is what I was thinking of (if made distinct and > > ordered), but even there the code is fairy trivial, so I don't have a > > strong feeling about it. > > Agreed. Looks good. A related thing I should mention is that the tests which look up all possible offsets are really expensive with the number of blocks we're using now (assert build): v70 0.33s v72 1.15s v73 1.32 To trim that back, I think we should give up on using shared memory for the is-full test: We can cause aset to malloc a new block with a lot fewer entries. In the attached, this brings it back down to 0.43s. It might also be worth reducing the number of blocks in the random test -- multiple runs will have different offsets anyway. > > I think we can stop including the debug-tid-store patch for CI now. > > That would allow getting rid of some unnecessary variables. > > Agreed. Okay, all that remains here is to get rid of those variables (might be just one). > > + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs > > + * in one block. We return the block numbers in ascending order and the offset > > + * numbers in each result is also sorted in ascending order. > > + */ > > +TidStoreIterResult * > > +TidStoreIterateNext(TidStoreIter *iter) > > > > The wording is a bit awkward. > > Fixed. - * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs - * in one block. We return the block numbers in ascending order and the offset - * numbers in each result is also sorted in ascending order. + * Scan the TidStore and return the TIDs of the next block. The returned block + * numbers is sorted in ascending order, and the offset numbers in each result + * is also sorted in ascending order. Better, but it's still not very clear. Maybe "The offsets in each iteration result are ordered, as are the block numbers over all iterations." > > +/* Extract TIDs from the given key-value pair */ > > +static void > > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, > > BlocktableEntry *page) > > > > This is a leftover from the old encoding scheme. This should really > > take a "BlockNumber blockno" not a "key", and the only call site > > should probably cast the uint64 to BlockNumber. > > Fixed. This part looks good. I didn't notice earlier, but this comment has a similar issue @@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter) return NULL; /* Collect TIDs extracted from the key-value pair */ - tidstore_iter_extract_tids(iter, key, page); + tidstore_iter_extract_tids(iter, (BlockNumber) key, page); ..."extracted" was once a separate operation. I think just removing that one word is enough to update it. Some other review on code comments: v73-0001: + /* Enlarge the TID array if necessary */ It's "arrays" now. v73-0005: +-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has +-- different randon 100 offset numbers each other. The numbers are obvious from the query. Maybe just mention that the offsets are randomized and must be unique and ordered. + * The caller is responsible for release any locks. "releasing" > > +typedef struct BlocktableEntry > > +{ > > + uint16 nwords; > > + bitmapword words[FLEXIBLE_ARRAY_MEMBER]; > > +} BlocktableEntry; > > > > In my WIP for runtime-embeddable offsets, nwords needs to be one byte. I should be more clear here: nwords fitting into one byte allows 3 embedded offsets (1 on 32-bit platforms, which is good for testing at least). With uint16 nwords that reduces to 2 (none on 32-bit platforms). Further, after the current patch series is fully committed, I plan to split the embedded-offset patch into two parts: The first would store the offsets in the header, but would still need a (smaller) allocation. The second would embed them in the child pointer. Only the second patch will care about the size of nwords because it needs to reserve a byte for the pointer tag. > > That doesn't have any real-world affect on the largest offset > > encountered, and only in 32-bit builds with 32kB block size would the > > theoretical max change at all. To be precise, we could use in the > > MaxBlocktableEntrySize calculation: > > > > Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1); > > I don't get this expression. Making the nwords one byte works well? > With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256 > bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit > OS, respectively. One byte nwrods variable seems not to be sufficient I believe there is confusion between bitmap words and bytes: 2048 / 64 = 32 words = 256 bytes It used to be max tuples per (heap) page, but we wanted a simple way to make this independent of heap. I believe we won't need to ever store the actual MaxOffsetNumber, although we technically still could with a one-byte type and 32kB pages, at least on 64-bit platforms. > for both cases. Also, where does the expression "BITS_PER_BITMAPWORD * > PG_INT8_MAX - 1" come from? 127 words, each with 64 (or 32) bits. The zero bit is not a valid offset, so subtract one. And I used signed type in case there was a need for -1 to mean something.
Attachment
On Tue, Mar 19, 2024 at 8:35 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote: > > > > Random offsets is what I was thinking of (if made distinct and > > > ordered), but even there the code is fairy trivial, so I don't have a > > > strong feeling about it. > > > > Agreed. > > Looks good. > > A related thing I should mention is that the tests which look up all > possible offsets are really expensive with the number of blocks we're > using now (assert build): > > v70 0.33s > v72 1.15s > v73 1.32 > > To trim that back, I think we should give up on using shared memory > for the is-full test: We can cause aset to malloc a new block with a > lot fewer entries. In the attached, this brings it back down to 0.43s. Looks good. Agreed with this change. > It might also be worth reducing the number of blocks in the random > test -- multiple runs will have different offsets anyway. Yes. If we reduce the number of blocks from 1000 to 100, the regression test took on my environment: 1000 blocks : 516 ms 100 blocks : 228 ms > > > > I think we can stop including the debug-tid-store patch for CI now. > > > That would allow getting rid of some unnecessary variables. > > > > Agreed. > > Okay, all that remains here is to get rid of those variables (might be > just one). Removed some unnecessary variables in 0002 patch. > > > > + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs > > > + * in one block. We return the block numbers in ascending order and the offset > > > + * numbers in each result is also sorted in ascending order. > > > + */ > > > +TidStoreIterResult * > > > +TidStoreIterateNext(TidStoreIter *iter) > > > > > > The wording is a bit awkward. > > > > Fixed. > > - * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs > - * in one block. We return the block numbers in ascending order and the offset > - * numbers in each result is also sorted in ascending order. > + * Scan the TidStore and return the TIDs of the next block. The returned block > + * numbers is sorted in ascending order, and the offset numbers in each result > + * is also sorted in ascending order. > > Better, but it's still not very clear. Maybe "The offsets in each > iteration result are ordered, as are the block numbers over all > iterations." Thanks, fixed. > > > > +/* Extract TIDs from the given key-value pair */ > > > +static void > > > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, > > > BlocktableEntry *page) > > > > > > This is a leftover from the old encoding scheme. This should really > > > take a "BlockNumber blockno" not a "key", and the only call site > > > should probably cast the uint64 to BlockNumber. > > > > Fixed. > > This part looks good. I didn't notice earlier, but this comment has a > similar issue > > @@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter) > return NULL; > > /* Collect TIDs extracted from the key-value pair */ > - tidstore_iter_extract_tids(iter, key, page); > + tidstore_iter_extract_tids(iter, (BlockNumber) key, page); > > ..."extracted" was once a separate operation. I think just removing > that one word is enough to update it. Fixed. > > Some other review on code comments: > > v73-0001: > > + /* Enlarge the TID array if necessary */ > > It's "arrays" now. > > v73-0005: > > +-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has > +-- different randon 100 offset numbers each other. > > The numbers are obvious from the query. Maybe just mention that the > offsets are randomized and must be unique and ordered. > > + * The caller is responsible for release any locks. > > "releasing" Fixed. > > > > +typedef struct BlocktableEntry > > > +{ > > > + uint16 nwords; > > > + bitmapword words[FLEXIBLE_ARRAY_MEMBER]; > > > +} BlocktableEntry; > > > > > > In my WIP for runtime-embeddable offsets, nwords needs to be one byte. > > I should be more clear here: nwords fitting into one byte allows 3 > embedded offsets (1 on 32-bit platforms, which is good for testing at > least). With uint16 nwords that reduces to 2 (none on 32-bit > platforms). Further, after the current patch series is fully > committed, I plan to split the embedded-offset patch into two parts: > The first would store the offsets in the header, but would still need > a (smaller) allocation. The second would embed them in the child > pointer. Only the second patch will care about the size of nwords > because it needs to reserve a byte for the pointer tag. Thank you for the clarification. > > > > That doesn't have any real-world affect on the largest offset > > > encountered, and only in 32-bit builds with 32kB block size would the > > > theoretical max change at all. To be precise, we could use in the > > > MaxBlocktableEntrySize calculation: > > > > > > Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1); > > > > I don't get this expression. Making the nwords one byte works well? > > With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256 > > bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit > > OS, respectively. One byte nwrods variable seems not to be sufficient > > I believe there is confusion between bitmap words and bytes: > 2048 / 64 = 32 words = 256 bytes Oops, you're right. > > It used to be max tuples per (heap) page, but we wanted a simple way > to make this independent of heap. I believe we won't need to ever > store the actual MaxOffsetNumber, although we technically still could > with a one-byte type and 32kB pages, at least on 64-bit platforms. > > > for both cases. Also, where does the expression "BITS_PER_BITMAPWORD * > > PG_INT8_MAX - 1" come from? > > 127 words, each with 64 (or 32) bits. The zero bit is not a valid > offset, so subtract one. And I used signed type in case there was a > need for -1 to mean something. Okay, I missed that we want to change nwords from uint8 to int8. So the MaxBlocktableEntrySize calculation would be as follows? #define MaxBlocktableEntrySize \ offsetof(BlocktableEntry, words) + \ (sizeof(bitmapword) * \ WORDS_PER_PAGE(Min(MaxOffsetNumber, \ BITS_PER_BITMAPWORD * PG_INT8_MAX - 1)))) I've made this change in the 0003 patch. While reviewing the vacuum patch, I realized that we always pass LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related to the tidstore is therefore always the same. I think it would be better to make the caller of TidStoreCreate() specify the tranch_id and pass it to RT_CREATE(). That way, the caller can specify their own wait event for tidstore. The 0008 patch tried this idea. dshash.c does the same idea. Other patches are minor updates for tidstore and vacuum patches. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Mar 19, 2024 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Mar 19, 2024 at 8:35 AM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote: > > It might also be worth reducing the number of blocks in the random > > test -- multiple runs will have different offsets anyway. > > Yes. If we reduce the number of blocks from 1000 to 100, the > regression test took on my environment: > > 1000 blocks : 516 ms > 100 blocks : 228 ms Sounds good. > Removed some unnecessary variables in 0002 patch. Looks good. > So the MaxBlocktableEntrySize calculation would be as follows? > > #define MaxBlocktableEntrySize \ > offsetof(BlocktableEntry, words) + \ > (sizeof(bitmapword) * \ > WORDS_PER_PAGE(Min(MaxOffsetNumber, \ > BITS_PER_BITMAPWORD * PG_INT8_MAX - 1)))) > > I've made this change in the 0003 patch. This is okay, but one side effect is that we have both an assert and an elog, for different limits. I think we'll need a separate #define to help. But for now, I don't want to hold up tidstore further with this because I believe almost everything else in v74 is in pretty good shape. I'll save this for later as a part of the optimization I proposed. Remaining things I noticed: +#define RT_PREFIX local_rt +#define RT_PREFIX shared_rt Prefixes for simplehash, for example, don't have "sh" -- maybe "local/shared_ts" + /* MemoryContext where the radix tree uses */ s/where/that/ +/* + * Lock support functions. + * + * We can use the radix tree's lock for shared TidStore as the data we + * need to protect is only the shared radix tree. + */ +void +TidStoreLockExclusive(TidStore *ts) Talking about multiple things, so maybe a blank line after the comment. With those, I think you can go ahead and squash all the tidstore patches except for 0003 and commit it. > While reviewing the vacuum patch, I realized that we always pass > LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related > to the tidstore is therefore always the same. I think it would be > better to make the caller of TidStoreCreate() specify the tranch_id > and pass it to RT_CREATE(). That way, the caller can specify their own > wait event for tidstore. The 0008 patch tried this idea. dshash.c does > the same idea. Sounds reasonable. I'll just note that src/include/storage/lwlock.h still has an entry for LWTRANCHE_SHARED_TIDSTORE.
On Tue, Mar 19, 2024 at 6:40 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Mar 19, 2024 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Mar 19, 2024 at 8:35 AM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote: > > > > It might also be worth reducing the number of blocks in the random > > > test -- multiple runs will have different offsets anyway. > > > > Yes. If we reduce the number of blocks from 1000 to 100, the > > regression test took on my environment: > > > > 1000 blocks : 516 ms > > 100 blocks : 228 ms > > Sounds good. > > > Removed some unnecessary variables in 0002 patch. > > Looks good. > > > So the MaxBlocktableEntrySize calculation would be as follows? > > > > #define MaxBlocktableEntrySize \ > > offsetof(BlocktableEntry, words) + \ > > (sizeof(bitmapword) * \ > > WORDS_PER_PAGE(Min(MaxOffsetNumber, \ > > BITS_PER_BITMAPWORD * PG_INT8_MAX - 1)))) > > > > I've made this change in the 0003 patch. > > This is okay, but one side effect is that we have both an assert and > an elog, for different limits. I think we'll need a separate #define > to help. But for now, I don't want to hold up tidstore further with > this because I believe almost everything else in v74 is in pretty good > shape. I'll save this for later as a part of the optimization I > proposed. > > Remaining things I noticed: > > +#define RT_PREFIX local_rt > +#define RT_PREFIX shared_rt > > Prefixes for simplehash, for example, don't have "sh" -- maybe "local/shared_ts" > > + /* MemoryContext where the radix tree uses */ > > s/where/that/ > > +/* > + * Lock support functions. > + * > + * We can use the radix tree's lock for shared TidStore as the data we > + * need to protect is only the shared radix tree. > + */ > +void > +TidStoreLockExclusive(TidStore *ts) > > Talking about multiple things, so maybe a blank line after the comment. > > With those, I think you can go ahead and squash all the tidstore > patches except for 0003 and commit it. > > > While reviewing the vacuum patch, I realized that we always pass > > LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related > > to the tidstore is therefore always the same. I think it would be > > better to make the caller of TidStoreCreate() specify the tranch_id > > and pass it to RT_CREATE(). That way, the caller can specify their own > > wait event for tidstore. The 0008 patch tried this idea. dshash.c does > > the same idea. > > Sounds reasonable. I'll just note that src/include/storage/lwlock.h > still has an entry for LWTRANCHE_SHARED_TIDSTORE. Thank you. I've incorporated all the comments above. I've attached the latest patches, and am going to push them (one by one) after self-review again. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote: > > Locally (not CI), we should try big inputs to make sure we can > > actually go up to many GB -- it's easier and faster this way than > > having vacuum give us a large data set. > > I'll do these tests. I just remembered this -- did any of this kind of testing happen? I can do it as well. > Thank you. I've incorporated all the comments above. I've attached the > latest patches, and am going to push them (one by one) after > self-review again. One more cosmetic thing in 0001 that caught my eye: diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile index b9aff0ccfd..67b8cc6108 100644 --- a/src/backend/access/common/Makefile +++ b/src/backend/access/common/Makefile @@ -27,6 +27,7 @@ OBJS = \ syncscan.o \ toast_compression.o \ toast_internals.o \ + tidstore.o \ tupconvert.o \ tupdesc.o diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build index 725041a4ce..a02397855e 100644 --- a/src/backend/access/common/meson.build +++ b/src/backend/access/common/meson.build @@ -15,6 +15,7 @@ backend_sources += files( 'syncscan.c', 'toast_compression.c', 'toast_internals.c', + 'tidstore.c', 'tupconvert.c', 'tupdesc.c', ) These aren't in alphabetical order.
On Wed, Mar 20, 2024 at 3:48 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > Locally (not CI), we should try big inputs to make sure we can > > > actually go up to many GB -- it's easier and faster this way than > > > having vacuum give us a large data set. > > > > I'll do these tests. > > I just remembered this -- did any of this kind of testing happen? I > can do it as well. I forgot to report the results. Yes, I did some tests where I inserted many TIDs to make the tidstore use several GB memory. I did two cases: 1. insert 100M blocks of TIDs with an offset of 100. 2. insert 10M blocks of TIDs with an offset of 2048. The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup and iteration results were expected. > > > Thank you. I've incorporated all the comments above. I've attached the > > latest patches, and am going to push them (one by one) after > > self-review again. > > One more cosmetic thing in 0001 that caught my eye: > > diff --git a/src/backend/access/common/Makefile > b/src/backend/access/common/Makefile > index b9aff0ccfd..67b8cc6108 100644 > --- a/src/backend/access/common/Makefile > +++ b/src/backend/access/common/Makefile > @@ -27,6 +27,7 @@ OBJS = \ > syncscan.o \ > toast_compression.o \ > toast_internals.o \ > + tidstore.o \ > tupconvert.o \ > tupdesc.o > > diff --git a/src/backend/access/common/meson.build > b/src/backend/access/common/meson.build > index 725041a4ce..a02397855e 100644 > --- a/src/backend/access/common/meson.build > +++ b/src/backend/access/common/meson.build > @@ -15,6 +15,7 @@ backend_sources += files( > 'syncscan.c', > 'toast_compression.c', > 'toast_internals.c', > + 'tidstore.c', > 'tupconvert.c', > 'tupdesc.c', > ) > > These aren't in alphabetical order. Good catch. I'll fix them before the push. While reviewing the codes again, the following two things caught my eyes: in check_set_block_offset() function, we don't take a lock on the tidstore while checking all possible TIDs. I'll add TidStoreLockShare() and TidStoreUnlock() as follows: + TidStoreLockShare(tidstore); if (TidStoreIsMember(tidstore, &tid)) ItemPointerSet(&items.lookup_tids[num_lookup_tids++], blkno, offset); + TidStoreUnlock(tidstore); --- Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take a lock on the shared tidstore since dsa_get_total_size() (called by RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it in the comment as follows: -/* Return the memory usage of TidStore */ +/* + * Return the memory usage of TidStore. + * + * In shared TidStore cases, since shared_ts_memory_usage() does appropriate + * locking, the caller doesn't need to take a lock. + */ What do you think? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 20, 2024 at 8:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I forgot to report the results. Yes, I did some tests where I inserted > many TIDs to make the tidstore use several GB memory. I did two cases: > > 1. insert 100M blocks of TIDs with an offset of 100. > 2. insert 10M blocks of TIDs with an offset of 2048. > > The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup > and iteration results were expected. Thanks for confirming! > While reviewing the codes again, the following two things caught my eyes: > > in check_set_block_offset() function, we don't take a lock on the > tidstore while checking all possible TIDs. I'll add > TidStoreLockShare() and TidStoreUnlock() as follows: > > + TidStoreLockShare(tidstore); > if (TidStoreIsMember(tidstore, &tid)) > ItemPointerSet(&items.lookup_tids[num_lookup_tids++], > blkno, offset); > + TidStoreUnlock(tidstore); In one sense, all locking in the test module is useless since there is only a single process. On the other hand, it seems good to at least run what we have written to run it trivially, and serve as an example of usage. We should probably be consistent, and document at the top that the locks are pro-forma only. It's both a blessing and a curse that vacuum only has a single writer. It makes development less of a hassle, but also means that tidstore locking is done for API-completeness reasons, not (yet) as a practical necessity. Even tidbitmap.c's hash table currently has a single writer, and while using tidstore for that is still an engineering challenge for other reasons, it wouldn't exercise locking meaningfully, either, at least at first. > Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take > a lock on the shared tidstore since dsa_get_total_size() (called by > RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it > in the comment as follows: > > -/* Return the memory usage of TidStore */ > +/* > + * Return the memory usage of TidStore. > + * > + * In shared TidStore cases, since shared_ts_memory_usage() does appropriate > + * locking, the caller doesn't need to take a lock. > + */ > > What do you think? That duplicates the underlying comment on the radix tree function that this calls, so I'm inclined to leave it out. At this level it's probably best to document when a caller _does_ need to take an action. One thing I forgot to ask about earlier: +-- Add tids in out of order. Are they (the blocks to be precise) really out of order? The VALUES statement is ordered, but after inserting it does not output that way. I wondered if this is platform independent, but CI and our dev machines haven't failed this test, and I haven't looked into what determines the order. It's easy enough to hide the blocks if we ever need to, as we do elsewhere...
On Wed, Mar 20, 2024 at 11:19 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Mar 20, 2024 at 8:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I forgot to report the results. Yes, I did some tests where I inserted > > many TIDs to make the tidstore use several GB memory. I did two cases: > > > > 1. insert 100M blocks of TIDs with an offset of 100. > > 2. insert 10M blocks of TIDs with an offset of 2048. > > > > The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup > > and iteration results were expected. > > Thanks for confirming! > > > While reviewing the codes again, the following two things caught my eyes: > > > > in check_set_block_offset() function, we don't take a lock on the > > tidstore while checking all possible TIDs. I'll add > > TidStoreLockShare() and TidStoreUnlock() as follows: > > > > + TidStoreLockShare(tidstore); > > if (TidStoreIsMember(tidstore, &tid)) > > ItemPointerSet(&items.lookup_tids[num_lookup_tids++], > > blkno, offset); > > + TidStoreUnlock(tidstore); > > In one sense, all locking in the test module is useless since there is > only a single process. On the other hand, it seems good to at least > run what we have written to run it trivially, and serve as an example > of usage. We should probably be consistent, and document at the top > that the locks are pro-forma only. Agreed. > > > Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take > > a lock on the shared tidstore since dsa_get_total_size() (called by > > RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it > > in the comment as follows: > > > > -/* Return the memory usage of TidStore */ > > +/* > > + * Return the memory usage of TidStore. > > + * > > + * In shared TidStore cases, since shared_ts_memory_usage() does appropriate > > + * locking, the caller doesn't need to take a lock. > > + */ > > > > What do you think? > > That duplicates the underlying comment on the radix tree function that > this calls, so I'm inclined to leave it out. At this level it's > probably best to document when a caller _does_ need to take an action. Okay, I didn't change it. > > One thing I forgot to ask about earlier: > > +-- Add tids in out of order. > > Are they (the blocks to be precise) really out of order? The VALUES > statement is ordered, but after inserting it does not output that way. > I wondered if this is platform independent, but CI and our dev > machines haven't failed this test, and I haven't looked into what > determines the order. It's easy enough to hide the blocks if we ever > need to, as we do elsewhere... It seems not necessary as such a test is already covered by test_radixtree. I've changed the query to hide the output blocks. I've pushed the tidstore patch after incorporating the above changes. In addition to that, I've added the following changes before the push: - Added src/test/modules/test_tidstore/.gitignore file. - Removed unnecessary #include from tidstore.c. The buildfarm has been all-green so far. I've attached the latest vacuum improvement patch. I just remembered that the tidstore cannot still be used for parallel vacuum with minimum maintenance_work_mem. Even when the shared tidstore is empty, its memory usage reports 1056768 bytes, a bit above 1MB (1048576 bytes). We need something discussed on another thread[1] in order to make it work. Regards, [1] https://www.postgresql.org/message-id/CAD21AoCVMw6DSmgZY9h%2BxfzKtzJeqWiwxaUD2T-FztVcV-XibQ%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Mar 20, 2024 at 11:19 PM John Naylor <johncnaylorls@gmail.com> wrote: > > Are they (the blocks to be precise) really out of order? The VALUES > > statement is ordered, but after inserting it does not output that way. > > I wondered if this is platform independent, but CI and our dev > > machines haven't failed this test, and I haven't looked into what > > determines the order. It's easy enough to hide the blocks if we ever > > need to, as we do elsewhere... > > It seems not necessary as such a test is already covered by > test_radixtree. I've changed the query to hide the output blocks. Okay. > The buildfarm has been all-green so far. Great! > I've attached the latest vacuum improvement patch. > > I just remembered that the tidstore cannot still be used for parallel > vacuum with minimum maintenance_work_mem. Even when the shared > tidstore is empty, its memory usage reports 1056768 bytes, a bit above > 1MB (1048576 bytes). We need something discussed on another thread[1] > in order to make it work. For exactly this reason, we used to have a clamp on max_bytes when it was internal to tidstore, so that it never reported full when first created, so I guess that got thrown away when we got rid of the control object in shared memory. Forcing callers to clamp their own limits seems pretty unfriendly, though. The proposals in that thread are pretty simple. If those don't move forward soon, a hackish workaround would be to round down the number we get from dsa_get_total_size to the nearest megabyte. Then controlling min/max segment size would be a nice-to-have for PG17, not a prerequisite.
On Thu, Mar 21, 2024 at 12:40 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Mar 20, 2024 at 11:19 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > Are they (the blocks to be precise) really out of order? The VALUES > > > statement is ordered, but after inserting it does not output that way. > > > I wondered if this is platform independent, but CI and our dev > > > machines haven't failed this test, and I haven't looked into what > > > determines the order. It's easy enough to hide the blocks if we ever > > > need to, as we do elsewhere... > > > > It seems not necessary as such a test is already covered by > > test_radixtree. I've changed the query to hide the output blocks. > > Okay. > > > The buildfarm has been all-green so far. > > Great! > > > I've attached the latest vacuum improvement patch. > > > > I just remembered that the tidstore cannot still be used for parallel > > vacuum with minimum maintenance_work_mem. Even when the shared > > tidstore is empty, its memory usage reports 1056768 bytes, a bit above > > 1MB (1048576 bytes). We need something discussed on another thread[1] > > in order to make it work. > > For exactly this reason, we used to have a clamp on max_bytes when it > was internal to tidstore, so that it never reported full when first > created, so I guess that got thrown away when we got rid of the > control object in shared memory. Forcing callers to clamp their own > limits seems pretty unfriendly, though. Or we can have a new function for dsa.c to set the initial and max segment size (or either one) to the existing DSA area so that TidStoreCreate() can specify them at creation. In shared TidStore cases, since all memory required by shared radix tree is allocated in the passed-in DSA area and the memory usage is the total segment size allocated in the DSA area, the user will have to prepare a DSA area only for the shared tidstore. So we might be able to expect that the DSA passed-in to TidStoreCreate() is empty and its segment sizes can be adjustable. > > The proposals in that thread are pretty simple. If those don't move > forward soon, a hackish workaround would be to round down the number > we get from dsa_get_total_size to the nearest megabyte. Then > controlling min/max segment size would be a nice-to-have for PG17, not > a prerequisite. Interesting idea. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 21, 2024 at 3:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 21, 2024 at 12:40 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Wed, Mar 20, 2024 at 11:19 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > Are they (the blocks to be precise) really out of order? The VALUES > > > > statement is ordered, but after inserting it does not output that way. > > > > I wondered if this is platform independent, but CI and our dev > > > > machines haven't failed this test, and I haven't looked into what > > > > determines the order. It's easy enough to hide the blocks if we ever > > > > need to, as we do elsewhere... > > > > > > It seems not necessary as such a test is already covered by > > > test_radixtree. I've changed the query to hide the output blocks. > > > > Okay. > > > > > The buildfarm has been all-green so far. > > > > Great! > > > > > I've attached the latest vacuum improvement patch. > > > > > > I just remembered that the tidstore cannot still be used for parallel > > > vacuum with minimum maintenance_work_mem. Even when the shared > > > tidstore is empty, its memory usage reports 1056768 bytes, a bit above > > > 1MB (1048576 bytes). We need something discussed on another thread[1] > > > in order to make it work. > > > > For exactly this reason, we used to have a clamp on max_bytes when it > > was internal to tidstore, so that it never reported full when first > > created, so I guess that got thrown away when we got rid of the > > control object in shared memory. Forcing callers to clamp their own > > limits seems pretty unfriendly, though. > > Or we can have a new function for dsa.c to set the initial and max > segment size (or either one) to the existing DSA area so that > TidStoreCreate() can specify them at creation. In shared TidStore > cases, since all memory required by shared radix tree is allocated in > the passed-in DSA area and the memory usage is the total segment size > allocated in the DSA area, the user will have to prepare a DSA area > only for the shared tidstore. So we might be able to expect that the > DSA passed-in to TidStoreCreate() is empty and its segment sizes can > be adjustable. Yet another idea is that TidStore creates its own DSA area in TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area (using dsa_create()) and pass it to RT_CREATE(). Also, we need a new API to get the DSA area. The caller (e.g. parallel vacuum) gets the dsa_handle of the DSA and stores it in the shared memory (e.g. in PVShared). TidStoreAttach() will take two arguments: dsa_handle for the DSA area and dsa_pointer for the shared radix tree. This idea still requires controlling min/max segment sizes since dsa_create() uses the 1MB as the initial segment size. But the TidStoreCreate() would be more user friendly. I've attached a PoC patch for discussion. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Mar 21, 2024 at 1:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Or we can have a new function for dsa.c to set the initial and max > segment size (or either one) to the existing DSA area so that > TidStoreCreate() can specify them at creation. I didn't like this very much, because it's splitting an operation across an API boundary. The caller already has all the information it needs when it creates the DSA. Straw man proposal: it could do the same for local memory, then they'd be more similar. But if we made local contexts the responsibility of the caller, that would cause duplication between creating and resetting. > In shared TidStore > cases, since all memory required by shared radix tree is allocated in > the passed-in DSA area and the memory usage is the total segment size > allocated in the DSA area ...plus apparently some overhead, I just found out today, but that's beside the point. On Thu, Mar 21, 2024 at 2:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Yet another idea is that TidStore creates its own DSA area in > TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area > (using dsa_create()) and pass it to RT_CREATE(). Also, we need a new > API to get the DSA area. The caller (e.g. parallel vacuum) gets the > dsa_handle of the DSA and stores it in the shared memory (e.g. in > PVShared). TidStoreAttach() will take two arguments: dsa_handle for > the DSA area and dsa_pointer for the shared radix tree. This idea > still requires controlling min/max segment sizes since dsa_create() > uses the 1MB as the initial segment size. But the TidStoreCreate() > would be more user friendly. This seems like an overall simplification, aside from future size configuration, so +1 to continue looking into this. If we go this route, I'd like to avoid a boolean parameter and cleanly separate TidStoreCreateLocal() and TidStoreCreateShared(). Every operation after that can introspect, but it's a bit awkward to force these cases into the same function. It always was a little bit, but this change makes it more so.
On Thu, Mar 21, 2024 at 4:35 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 21, 2024 at 1:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Or we can have a new function for dsa.c to set the initial and max > > segment size (or either one) to the existing DSA area so that > > TidStoreCreate() can specify them at creation. > > I didn't like this very much, because it's splitting an operation > across an API boundary. The caller already has all the information it > needs when it creates the DSA. Straw man proposal: it could do the > same for local memory, then they'd be more similar. But if we made > local contexts the responsibility of the caller, that would cause > duplication between creating and resetting. Fair point. > > > In shared TidStore > > cases, since all memory required by shared radix tree is allocated in > > the passed-in DSA area and the memory usage is the total segment size > > allocated in the DSA area > > ...plus apparently some overhead, I just found out today, but that's > beside the point. > > On Thu, Mar 21, 2024 at 2:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Yet another idea is that TidStore creates its own DSA area in > > TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area > > (using dsa_create()) and pass it to RT_CREATE(). Also, we need a new > > API to get the DSA area. The caller (e.g. parallel vacuum) gets the > > dsa_handle of the DSA and stores it in the shared memory (e.g. in > > PVShared). TidStoreAttach() will take two arguments: dsa_handle for > > the DSA area and dsa_pointer for the shared radix tree. This idea > > still requires controlling min/max segment sizes since dsa_create() > > uses the 1MB as the initial segment size. But the TidStoreCreate() > > would be more user friendly. > > This seems like an overall simplification, aside from future size > configuration, so +1 to continue looking into this. If we go this > route, I'd like to avoid a boolean parameter and cleanly separate > TidStoreCreateLocal() and TidStoreCreateShared(). Every operation > after that can introspect, but it's a bit awkward to force these cases > into the same function. It always was a little bit, but this change > makes it more so. I've looked into this idea further. Overall, it looks clean and I don't see any problem so far in terms of integration with lazy vacuum. I've attached three patches for discussion and tests. - 0001 patch makes lazy vacuum use of tidstore. - 0002 patch makes DSA init/max segment size configurable (borrowed from another thread). - 0003 patch makes TidStore create its own DSA area with init/max DSA segment adjustment (PoC patch). One thing unclear to me is that this idea will be usable even when we want to use the tidstore for parallel bitmap scan. Currently, we create a shared tidbitmap on a DSA area in ParallelExecutorInfo. This DSA area is used not only for tidbitmap but also for parallel hash etc. If the tidstore created its own DSA area, parallel bitmap scan would have to use the tidstore's DSA in addition to the DSA area in ParallelExecutorInfo. I'm not sure if there are some differences between these usages in terms of resource manager etc. It seems no problem but I might be missing something. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Mar 21, 2024 at 4:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've looked into this idea further. Overall, it looks clean and I > don't see any problem so far in terms of integration with lazy vacuum. > I've attached three patches for discussion and tests. Seems okay in the big picture, it's the details we need to be careful of. v77-0001 - dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items)); - dead_items->max_items = max_items; - dead_items->num_items = 0; + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0); + + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo)); + dead_items_info->max_bytes = vac_work_mem * 1024L; This is confusing enough that it looks like a bug: [inside TidStoreCreate()] /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */ while (16 * maxBlockSize > max_bytes * 1024L) maxBlockSize >>= 1; This was copied from CreateWorkExprContext, which operates directly on work_mem -- if the parameter is actually bytes, we can't "* 1024" here. If we're passing something measured in kilobytes, the parameter is badly named. Let's use convert once and use bytes everywhere. Note: This was not another pass over the whole vacuum patch, just looking an the issue at hand. Also for later: Dilip Kumar reviewed an earlier version. v77-0002: +#define dsa_create(tranch_id) \ + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE) Since these macros are now referring to defaults, maybe their name should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE (*_MAX_*) +/* The minimum size of a DSM segment. */ +#define DSA_MIN_SEGMENT_SIZE ((size_t) 1024) That's a *lot* smaller than it is now. Maybe 256kB? We just want 1MB m_w_m to work correctly. v77-0003: +/* Public APIs to create local or shared TidStore */ + +TidStore * +TidStoreCreateLocal(size_t max_bytes) +{ + return tidstore_create_internal(max_bytes, false, 0); +} + +TidStore * +TidStoreCreateShared(size_t max_bytes, int tranche_id) +{ + return tidstore_create_internal(max_bytes, true, tranche_id); +} I don't think these operations have enough in common to justify sharing even an internal implementation. Choosing aset block size is done for both memory types, but it's pointless to do it for shared memory, because the local context is then only used for small metadata. + /* + * Choose the DSA initial and max segment sizes to be no longer than + * 1/16 and 1/8 of max_bytes, respectively. + */ I'm guessing the 1/8 here because the number of segments is limited? I know these numbers are somewhat arbitrary, but readers will wonder why one has 1/8 and the other has 1/16. + if (dsa_init_size < DSA_MIN_SEGMENT_SIZE) + dsa_init_size = DSA_MIN_SEGMENT_SIZE; + if (dsa_max_size < DSA_MAX_SEGMENT_SIZE) + dsa_max_size = DSA_MAX_SEGMENT_SIZE; The second clamp seems against the whole point of this patch -- it seems they should all be clamped bigger than the DSA_MIN_SEGMENT_SIZE? Did you try it with 1MB m_w_m?
On Thu, Mar 21, 2024 at 7:48 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 21, 2024 at 4:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I've looked into this idea further. Overall, it looks clean and I > > don't see any problem so far in terms of integration with lazy vacuum. > > I've attached three patches for discussion and tests. > > Seems okay in the big picture, it's the details we need to be careful of. > > v77-0001 > > - dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items)); > - dead_items->max_items = max_items; > - dead_items->num_items = 0; > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0); > + > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo)); > + dead_items_info->max_bytes = vac_work_mem * 1024L; > > This is confusing enough that it looks like a bug: > > [inside TidStoreCreate()] > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */ > while (16 * maxBlockSize > max_bytes * 1024L) > maxBlockSize >>= 1; > > This was copied from CreateWorkExprContext, which operates directly on > work_mem -- if the parameter is actually bytes, we can't "* 1024" > here. If we're passing something measured in kilobytes, the parameter > is badly named. Let's use convert once and use bytes everywhere. True. The attached 0001 patch fixes it. > > v77-0002: > > +#define dsa_create(tranch_id) \ > + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE) > > Since these macros are now referring to defaults, maybe their name > should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE > (*_MAX_*) It makes sense to rename DSA_INITIAL_SEGMENT_SIZE , but I think that the DSA_MAX_SEGMENT_SIZE is the theoretical maximum size, the current name also makes sense to me. > > +/* The minimum size of a DSM segment. */ > +#define DSA_MIN_SEGMENT_SIZE ((size_t) 1024) > > That's a *lot* smaller than it is now. Maybe 256kB? We just want 1MB > m_w_m to work correctly. Fixed. > > v77-0003: > > +/* Public APIs to create local or shared TidStore */ > + > +TidStore * > +TidStoreCreateLocal(size_t max_bytes) > +{ > + return tidstore_create_internal(max_bytes, false, 0); > +} > + > +TidStore * > +TidStoreCreateShared(size_t max_bytes, int tranche_id) > +{ > + return tidstore_create_internal(max_bytes, true, tranche_id); > +} > > I don't think these operations have enough in common to justify > sharing even an internal implementation. Choosing aset block size is > done for both memory types, but it's pointless to do it for shared > memory, because the local context is then only used for small > metadata. > > + /* > + * Choose the DSA initial and max segment sizes to be no longer than > + * 1/16 and 1/8 of max_bytes, respectively. > + */ > > I'm guessing the 1/8 here because the number of segments is limited? I > know these numbers are somewhat arbitrary, but readers will wonder why > one has 1/8 and the other has 1/16. > > + if (dsa_init_size < DSA_MIN_SEGMENT_SIZE) > + dsa_init_size = DSA_MIN_SEGMENT_SIZE; > + if (dsa_max_size < DSA_MAX_SEGMENT_SIZE) > + dsa_max_size = DSA_MAX_SEGMENT_SIZE; > > The second clamp seems against the whole point of this patch -- it > seems they should all be clamped bigger than the DSA_MIN_SEGMENT_SIZE? > Did you try it with 1MB m_w_m? I've incorporated the above comments and test results look good to me. I've attached the several patches: - 0002 is a minor fix for tidstore I found. - 0005 changes the create APIs of tidstore. - 0006 update the vacuum improvement patch to use the new TidStoreCreateLocal/Shared() APIs. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v78-0005-Rethink-create-and-attach-APIs-of-shared-TidStor.patch
- v78-0004-Allow-specifying-initial-and-maximum-segment-siz.patch
- v78-0003-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
- v78-0006-Adjust-the-vacuum-improvement-patch-to-new-TidSt.patch
- v78-0002-Fix-an-inconsistent-function-prototype-with-the-.patch
- v78-0001-Fix-a-calculation-in-TidStoreCreate.patch
John Naylor <johncnaylorls@gmail.com> writes: > Done. I pushed this with a few last-minute cosmetic adjustments. This > has been a very long time coming, but we're finally in the home > stretch! I'm not sure why it took a couple weeks for Coverity to notice ee1b30f12, but it saw it today, and it's not happy: /srv/coverity/git/pgsql-git/postgresql/src/include/lib/radixtree.h: 1621 in local_ts_extend_down() 1615 node = child; 1616 shift -= RT_SPAN; 1617 } 1618 1619 /* Reserve slot for the value. */ 1620 n4 = (RT_NODE_4 *) node.local; >>> CID 1594658: Integer handling issues (BAD_SHIFT) >>> In expression "key >> shift", shifting by a negative amount has undefined behavior. The shift amount, "shift", isas little as -7. 1621 n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift); 1622 n4->base.count = 1; 1623 1624 return &n4->children[0]; 1625 } 1626 I think the point here is that if you start with an arbitrary non-negative shift value, the preceding loop may in fact decrement it down to something less than zero before exiting, in which case we would indeed have trouble. I suspect that the code is making undocumented assumptions about the possible initial values of shift. Maybe some Asserts would be good? Also, if we're effectively assuming that shift must be exactly zero here, why not let the compiler hard-code that? - n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift); + n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0); regards, tom lane
On Mon, Mar 25, 2024 at 1:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > John Naylor <johncnaylorls@gmail.com> writes: > > Done. I pushed this with a few last-minute cosmetic adjustments. This > > has been a very long time coming, but we're finally in the home > > stretch! Thank you for the report. > > I'm not sure why it took a couple weeks for Coverity to notice > ee1b30f12, but it saw it today, and it's not happy: Hmm, I've also done Coverity Scan in development but I wasn't able to see this one for some reason... > > /srv/coverity/git/pgsql-git/postgresql/src/include/lib/radixtree.h: 1621 in local_ts_extend_down() > 1615 node = child; > 1616 shift -= RT_SPAN; > 1617 } > 1618 > 1619 /* Reserve slot for the value. */ > 1620 n4 = (RT_NODE_4 *) node.local; > >>> CID 1594658: Integer handling issues (BAD_SHIFT) > >>> In expression "key >> shift", shifting by a negative amount has undefined behavior. The shift amount, "shift",is as little as -7. > 1621 n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift); > 1622 n4->base.count = 1; > 1623 > 1624 return &n4->children[0]; > 1625 } > 1626 > > I think the point here is that if you start with an arbitrary > non-negative shift value, the preceding loop may in fact decrement it > down to something less than zero before exiting, in which case we > would indeed have trouble. I suspect that the code is making > undocumented assumptions about the possible initial values of shift. > Maybe some Asserts would be good? Also, if we're effectively assuming > that shift must be exactly zero here, why not let the compiler > hard-code that? > > - n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift); > + n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0); Sounds like a good solution. I've attached the patch for that. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
Masahiko Sawada <sawada.mshk@gmail.com> writes: > On Mon, Mar 25, 2024 at 1:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think the point here is that if you start with an arbitrary >> non-negative shift value, the preceding loop may in fact decrement it >> down to something less than zero before exiting, in which case we >> would indeed have trouble. I suspect that the code is making >> undocumented assumptions about the possible initial values of shift. >> Maybe some Asserts would be good? Also, if we're effectively assuming >> that shift must be exactly zero here, why not let the compiler >> hard-code that? > Sounds like a good solution. I've attached the patch for that. Personally I'd put the Assert immediately after the loop, because it's not related to the "Reserve slot for the value" comment. Seems reasonable otherwise. regards, tom lane
On Mon, Mar 25, 2024 at 8:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Mar 25, 2024 at 1:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > I'm not sure why it took a couple weeks for Coverity to notice > > ee1b30f12, but it saw it today, and it's not happy: > > Hmm, I've also done Coverity Scan in development but I wasn't able to > see this one for some reason... Hmm, before 30e144287 this code only ran in a test module, is it possible Coverity would not find it there?
John Naylor <johncnaylorls@gmail.com> writes: > Hmm, before 30e144287 this code only ran in a test module, is it > possible Coverity would not find it there? That could indeed explain why Coverity didn't see it. I'm not sure how our community run is set up, but it may not build the test modules. regards, tom lane
On Mon, Mar 25, 2024 at 10:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Masahiko Sawada <sawada.mshk@gmail.com> writes: > > On Mon, Mar 25, 2024 at 1:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I think the point here is that if you start with an arbitrary > >> non-negative shift value, the preceding loop may in fact decrement it > >> down to something less than zero before exiting, in which case we > >> would indeed have trouble. I suspect that the code is making > >> undocumented assumptions about the possible initial values of shift. > >> Maybe some Asserts would be good? Also, if we're effectively assuming > >> that shift must be exactly zero here, why not let the compiler > >> hard-code that? > > > Sounds like a good solution. I've attached the patch for that. > > Personally I'd put the Assert immediately after the loop, because > it's not related to the "Reserve slot for the value" comment. > Seems reasonable otherwise. > Thanks. Pushed the fix after moving the Assert. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 21, 2024 at 7:48 PM John Naylor <johncnaylorls@gmail.com> wrote: > > v77-0001 > > > > - dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items)); > > - dead_items->max_items = max_items; > > - dead_items->num_items = 0; > > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0); > > + > > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo)); > > + dead_items_info->max_bytes = vac_work_mem * 1024L; > > > > This is confusing enough that it looks like a bug: > > > > [inside TidStoreCreate()] > > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */ > > while (16 * maxBlockSize > max_bytes * 1024L) > > maxBlockSize >>= 1; > > > > This was copied from CreateWorkExprContext, which operates directly on > > work_mem -- if the parameter is actually bytes, we can't "* 1024" > > here. If we're passing something measured in kilobytes, the parameter > > is badly named. Let's use convert once and use bytes everywhere. > > True. The attached 0001 patch fixes it. v78-0001 and 02 are fine, but for 0003 there is a consequence that I didn't see mentioned: vac_work_mem now refers to bytes, where before it referred to kilobytes. It seems pretty confusing to use a different convention from elsewhere, especially if it has the same name but different meaning across versions. Worse, this change is buried inside a moving-stuff-around diff, making it hard to see. Maybe "convert only once" is still possible, but I was actually thinking of + dead_items_info->max_bytes = vac_work_mem * 1024L; + vacrel->dead_items = TidStoreCreate(dead_items_info->max_bytes, NULL, 0); That way it's pretty obvious that it's correct. That may require a bit of duplication and moving around for shmem, but there is some of that already. More on 0003: - * The major space usage for vacuuming is storage for the array of dead TIDs + * The major space usage for vacuuming is TidStore, a storage for dead TIDs + * autovacuum_work_mem) memory space to keep track of dead TIDs. If the + * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum I wonder if the comments here should refer to it using a more natural spelling, like "TID store". - * items in the dead_items array for later vacuuming, count live and + * items in the dead_items for later vacuuming, count live and Maybe "the dead_items area", or "the dead_items store" or "in dead_items"? - * remaining LP_DEAD line pointers on the page in the dead_items - * array. These dead items include those pruned by lazy_scan_prune() - * as well we line pointers previously marked LP_DEAD. + * remaining LP_DEAD line pointers on the page in the dead_items. + * These dead items include those pruned by lazy_scan_prune() as well + * we line pointers previously marked LP_DEAD. Here maybe "into dead_items". Also, "we line pointers" seems to be a pre-existing typo. - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages", - vacrel->relname, (long long) index, vacuumed_pages))); + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages", + vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages))); This is a translated message, so let's keep the message the same. /* * Allocate dead_items (either using palloc, or in dynamic shared memory). * Sets dead_items in vacrel for caller. * * Also handles parallel initialization as part of allocating dead_items in * DSM when required. */ static void dead_items_alloc(LVRelState *vacrel, int nworkers) This comment didn't change at all. It's not wrong, but let's consider updating the specifics. v78-0004: > > +#define dsa_create(tranch_id) \ > > + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE) > > > > Since these macros are now referring to defaults, maybe their name > > should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE > > (*_MAX_*) > > It makes sense to rename DSA_INITIAL_SEGMENT_SIZE , but I think that > the DSA_MAX_SEGMENT_SIZE is the theoretical maximum size, the current > name also makes sense to me. Right, that makes sense. v78-0005: "Although commit XXX allowed specifying the initial and maximum DSA segment sizes, callers still needed to clamp their own limits, which was not consistent and user-friendly." Perhaps s/still needed/would have needed/ ..., since we're preventing that necessity. > > Did you try it with 1MB m_w_m? > > I've incorporated the above comments and test results look good to me. Could you be more specific about what the test was? Does it work with 1MB m_w_m? + /* + * Choose the initial and maximum DSA segment sizes to be no longer + * than 1/16 and 1/8 of max_bytes, respectively. If the initial + * segment size is low, we end up having many segments, which risks + * exceeding the total number of segments the platform can have. The second sentence is technically correct, but I'm not sure how it relates to the code that follows. + while (16 * dsa_init_size > max_bytes) + dsa_init_size >>= 1; + while (8 * dsa_max_size > max_bytes) + dsa_max_size >>= 1; I'm not sure we need a separate loop for "dsa_init_size". Can we just have : while (8 * dsa_max_size > max_bytes) dsa_max_size >>= 1; if (dsa_max_size < DSA_MIN_SEGMENT_SIZE) dsa_max_size = DSA_MIN_SEGMENT_SIZE; if (dsa_init_size > dsa_max_size) dsa_init_size = dsa_max_size; @@ -113,13 +113,10 @@ static void tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno, * CurrentMemoryContext at the time of this call. The TID storage, backed * by a radix tree, will live in its child memory context, rt_context. The * TidStore will be limited to (approximately) max_bytes total memory - * consumption. If the 'area' is non-NULL, the radix tree is created in the - * DSA area. - * - * The returned object is allocated in backend-local memory. + * consumption. The existing comment slipped past my radar, but max_bytes is not a limit, it's a hint. Come to think of it, it never was a limit in the normal sense, but in earlier patches it was the criteria for reporting "I'm full" when asked. void TidStoreDestroy(TidStore *ts) { - /* Destroy underlying radix tree */ if (TidStoreIsShared(ts)) + { + /* Destroy underlying radix tree */ shared_ts_free(ts->tree.shared); + + dsa_detach(ts->area); + } else local_ts_free(ts->tree.local); It's still destroyed in the local case, so not sure why this comment was moved? v78-0006: -#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2 +/* 2 was PARALLEL_VACUUM_KEY_DEAD_ITEMS */ I don't see any use in core outside this module -- maybe it's possible to renumber these?
On Mon, Mar 25, 2024 at 3:25 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Mar 21, 2024 at 7:48 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > v77-0001 > > > > > > - dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items)); > > > - dead_items->max_items = max_items; > > > - dead_items->num_items = 0; > > > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0); > > > + > > > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo)); > > > + dead_items_info->max_bytes = vac_work_mem * 1024L; > > > > > > This is confusing enough that it looks like a bug: > > > > > > [inside TidStoreCreate()] > > > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */ > > > while (16 * maxBlockSize > max_bytes * 1024L) > > > maxBlockSize >>= 1; > > > > > > This was copied from CreateWorkExprContext, which operates directly on > > > work_mem -- if the parameter is actually bytes, we can't "* 1024" > > > here. If we're passing something measured in kilobytes, the parameter > > > is badly named. Let's use convert once and use bytes everywhere. > > > > True. The attached 0001 patch fixes it. > > v78-0001 and 02 are fine, but for 0003 there is a consequence that I > didn't see mentioned: I think that the fix done in 0001 patch can be merged into 0003 patch. > vac_work_mem now refers to bytes, where before > it referred to kilobytes. It seems pretty confusing to use a different > convention from elsewhere, especially if it has the same name but > different meaning across versions. Worse, this change is buried inside > a moving-stuff-around diff, making it hard to see. Maybe "convert only > once" is still possible, but I was actually thinking of > > + dead_items_info->max_bytes = vac_work_mem * 1024L; > + vacrel->dead_items = TidStoreCreate(dead_items_info->max_bytes, NULL, 0); > > That way it's pretty obvious that it's correct. That may require a bit > of duplication and moving around for shmem, but there is some of that > already. Agreed. > > More on 0003: > > - * The major space usage for vacuuming is storage for the array of dead TIDs > + * The major space usage for vacuuming is TidStore, a storage for dead TIDs > > + * autovacuum_work_mem) memory space to keep track of dead TIDs. If the > + * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum > > I wonder if the comments here should refer to it using a more natural > spelling, like "TID store". > > - * items in the dead_items array for later vacuuming, count live and > + * items in the dead_items for later vacuuming, count live and > > Maybe "the dead_items area", or "the dead_items store" or "in dead_items"? > > - * remaining LP_DEAD line pointers on the page in the dead_items > - * array. These dead items include those pruned by lazy_scan_prune() > - * as well we line pointers previously marked LP_DEAD. > + * remaining LP_DEAD line pointers on the page in the dead_items. > + * These dead items include those pruned by lazy_scan_prune() as well > + * we line pointers previously marked LP_DEAD. > > Here maybe "into dead_items". > > Also, "we line pointers" seems to be a pre-existing typo. > > - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages", > - vacrel->relname, (long long) index, vacuumed_pages))); > + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers > in %u pages", > + vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages))); > > This is a translated message, so let's keep the message the same. > > /* > * Allocate dead_items (either using palloc, or in dynamic shared memory). > * Sets dead_items in vacrel for caller. > * > * Also handles parallel initialization as part of allocating dead_items in > * DSM when required. > */ > static void > dead_items_alloc(LVRelState *vacrel, int nworkers) > > This comment didn't change at all. It's not wrong, but let's consider > updating the specifics. Fixed above comments. > v78-0005: > > "Although commit XXX > allowed specifying the initial and maximum DSA segment sizes, callers > still needed to clamp their own limits, which was not consistent and > user-friendly." > > Perhaps s/still needed/would have needed/ ..., since we're preventing > that necessity. > > > > Did you try it with 1MB m_w_m? > > > > I've incorporated the above comments and test results look good to me. > > Could you be more specific about what the test was? > Does it work with 1MB m_w_m? If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB. FYI other test cases I tested were: * m_w_m = 2199023254528 (maximum value) initial: 1MB max: 128GB * m_w_m = 64MB (default) initial: 1MB max: 8MB > > + /* > + * Choose the initial and maximum DSA segment sizes to be no longer > + * than 1/16 and 1/8 of max_bytes, respectively. If the initial > + * segment size is low, we end up having many segments, which risks > + * exceeding the total number of segments the platform can have. > > The second sentence is technically correct, but I'm not sure how it > relates to the code that follows. > > + while (16 * dsa_init_size > max_bytes) > + dsa_init_size >>= 1; > + while (8 * dsa_max_size > max_bytes) > + dsa_max_size >>= 1; > > I'm not sure we need a separate loop for "dsa_init_size". Can we just have : > > while (8 * dsa_max_size > max_bytes) > dsa_max_size >>= 1; > > if (dsa_max_size < DSA_MIN_SEGMENT_SIZE) > dsa_max_size = DSA_MIN_SEGMENT_SIZE; > > if (dsa_init_size > dsa_max_size) > dsa_init_size = dsa_max_size; Agreed. > > @@ -113,13 +113,10 @@ static void > tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno, > * CurrentMemoryContext at the time of this call. The TID storage, backed > * by a radix tree, will live in its child memory context, rt_context. The > * TidStore will be limited to (approximately) max_bytes total memory > - * consumption. If the 'area' is non-NULL, the radix tree is created in the > - * DSA area. > - * > - * The returned object is allocated in backend-local memory. > + * consumption. > > The existing comment slipped past my radar, but max_bytes is not a > limit, it's a hint. Come to think of it, it never was a limit in the > normal sense, but in earlier patches it was the criteria for reporting > "I'm full" when asked. Updated the comment. > > void > TidStoreDestroy(TidStore *ts) > { > - /* Destroy underlying radix tree */ > if (TidStoreIsShared(ts)) > + { > + /* Destroy underlying radix tree */ > shared_ts_free(ts->tree.shared); > + > + dsa_detach(ts->area); > + } > else > local_ts_free(ts->tree.local); > > It's still destroyed in the local case, so not sure why this comment was moved? > > v78-0006: > > -#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2 > +/* 2 was PARALLEL_VACUUM_KEY_DEAD_ITEMS */ > > I don't see any use in core outside this module -- maybe it's possible > to renumber these? Fixed the above points. I've attached the latest patches. The 0004 and 0006 patches are updates from the previous version. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
- v79-0006-Address-review-comments-on-vacuum-integration.patch
- v79-0004-Address-review-comments-on-tidstore.patch
- v79-0005-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
- v79-0002-Allow-specifying-initial-and-maximum-segment-siz.patch
- v79-0003-Rethink-create-and-attach-APIs-of-shared-TidStor.patch
- v79-0001-Fix-an-inconsistent-function-prototype-with-the-.patch
On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Mar 25, 2024 at 3:25 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > - * remaining LP_DEAD line pointers on the page in the dead_items > > - * array. These dead items include those pruned by lazy_scan_prune() > > - * as well we line pointers previously marked LP_DEAD. > > + * remaining LP_DEAD line pointers on the page in the dead_items. > > + * These dead items include those pruned by lazy_scan_prune() as well > > + * we line pointers previously marked LP_DEAD. > > > > Here maybe "into dead_items". - * remaining LP_DEAD line pointers on the page in the dead_items. + * remaining LP_DEAD line pointers on the page into the dead_items. Let me explain. It used to be "in the dead_items array." It is not an array anymore, so it was changed to "in the dead_items". dead_items is a variable name, and names don't take "the". "into dead_items" seems most natural to me, but there are other possible phrasings. > > > > Did you try it with 1MB m_w_m? > > > > > > I've incorporated the above comments and test results look good to me. > > > > Could you be more specific about what the test was? > > Does it work with 1MB m_w_m? > > If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB. > > FYI other test cases I tested were: > > * m_w_m = 2199023254528 (maximum value) > initial: 1MB > max: 128GB > > * m_w_m = 64MB (default) > initial: 1MB > max: 8MB If the test was a vacuum, how big a table was needed to hit 128GB? > > The existing comment slipped past my radar, but max_bytes is not a > > limit, it's a hint. Come to think of it, it never was a limit in the > > normal sense, but in earlier patches it was the criteria for reporting > > "I'm full" when asked. > > Updated the comment. + * max_bytes is not a limit; it's used to choose the memory block sizes of + * a memory context for TID storage in order for the total memory consumption + * not to be overshot a lot. The caller can use the max_bytes as the criteria + * for reporting whether it's full or not. This is good information. I suggest this edit: "max_bytes" is not an internally-enforced limit; it is used only as a hint to cap the memory block size of the memory context for TID storage. This reduces space wastage due to over-allocation. If the caller wants to monitor memory usage, it must compare its limit with the value reported by TidStoreMemoryUsage(). Other comments: v79-0002 looks good to me. v79-0003: "With this commit, when creating a shared TidStore, a dedicated DSA area is created for TID storage instead of using the provided DSA area." This is very subtle, but "the provided..." implies there still is one. -> "a provided..." + * Similar to TidStoreCreateLocal() but create a shared TidStore on a + * DSA area. The TID storage will live in the DSA area, and a memory + * context rt_context will have only meta data of the radix tree. -> "the memory context" I think you can go ahead and commit 0002 and 0003/4. v79-0005: - bypass = (vacrel->lpdead_item_pages < threshold && - vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L)); + bypass = (vacrel->lpdead_item_pages < threshold) && + TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L); The parentheses look strange, and the first line shouldn't change without a good reason. - /* Set dead_items space */ - dead_items = (VacDeadItems *) shm_toc_lookup(toc, - PARALLEL_VACUUM_KEY_DEAD_ITEMS, - false); + /* Set dead items */ + dead_items = TidStoreAttach(shared->dead_items_dsa_handle, + shared->dead_items_handle); I feel ambivalent about this comment change. The original is not very descriptive to begin with. If we need to change at all, maybe "find dead_items in shared memory"? v79-0005: As I said earlier, Dilip Kumar reviewed an earlier version. v79-0006: vac_work_mem should also go back to being an int.
On Wed, Mar 27, 2024 at 9:25 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Mar 25, 2024 at 3:25 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > - * remaining LP_DEAD line pointers on the page in the dead_items > > > - * array. These dead items include those pruned by lazy_scan_prune() > > > - * as well we line pointers previously marked LP_DEAD. > > > + * remaining LP_DEAD line pointers on the page in the dead_items. > > > + * These dead items include those pruned by lazy_scan_prune() as well > > > + * we line pointers previously marked LP_DEAD. > > > > > > Here maybe "into dead_items". > > - * remaining LP_DEAD line pointers on the page in the dead_items. > + * remaining LP_DEAD line pointers on the page into the dead_items. > > Let me explain. It used to be "in the dead_items array." It is not an > array anymore, so it was changed to "in the dead_items". dead_items is > a variable name, and names don't take "the". "into dead_items" seems > most natural to me, but there are other possible phrasings. Thanks for the explanation. I was distracted. Fixed in the latest patch. > > > > > > Did you try it with 1MB m_w_m? > > > > > > > > I've incorporated the above comments and test results look good to me. > > > > > > Could you be more specific about what the test was? > > > Does it work with 1MB m_w_m? > > > > If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB. > > > > FYI other test cases I tested were: > > > > * m_w_m = 2199023254528 (maximum value) > > initial: 1MB > > max: 128GB > > > > * m_w_m = 64MB (default) > > initial: 1MB > > max: 8MB > > If the test was a vacuum, how big a table was needed to hit 128GB? I just checked how TIdStoreCreateLocal() calculated the initial and max segment sizes while changing m_w_m, so didn't check how big segments are actually allocated in the maximum value test case. > > > > The existing comment slipped past my radar, but max_bytes is not a > > > limit, it's a hint. Come to think of it, it never was a limit in the > > > normal sense, but in earlier patches it was the criteria for reporting > > > "I'm full" when asked. > > > > Updated the comment. > > + * max_bytes is not a limit; it's used to choose the memory block sizes of > + * a memory context for TID storage in order for the total memory consumption > + * not to be overshot a lot. The caller can use the max_bytes as the criteria > + * for reporting whether it's full or not. > > This is good information. I suggest this edit: > > "max_bytes" is not an internally-enforced limit; it is used only as a > hint to cap the memory block size of the memory context for TID > storage. This reduces space wastage due to over-allocation. If the > caller wants to monitor memory usage, it must compare its limit with > the value reported by TidStoreMemoryUsage(). > > Other comments: Thanks for the suggestion! > > v79-0002 looks good to me. > > v79-0003: > > "With this commit, when creating a shared TidStore, a dedicated DSA > area is created for TID storage instead of using the provided DSA > area." > > This is very subtle, but "the provided..." implies there still is one. > -> "a provided..." > > + * Similar to TidStoreCreateLocal() but create a shared TidStore on a > + * DSA area. The TID storage will live in the DSA area, and a memory > + * context rt_context will have only meta data of the radix tree. > > -> "the memory context" Fixed in the latest patch. > > I think you can go ahead and commit 0002 and 0003/4. I've pushed the 0002 (dsa init and max segment size) patch, and will push the attached 0001 patch next. > > v79-0005: > > - bypass = (vacrel->lpdead_item_pages < threshold && > - vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L)); > + bypass = (vacrel->lpdead_item_pages < threshold) && > + TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L); > > The parentheses look strange, and the first line shouldn't change > without a good reason. Fixed. > > - /* Set dead_items space */ > - dead_items = (VacDeadItems *) shm_toc_lookup(toc, > - PARALLEL_VACUUM_KEY_DEAD_ITEMS, > - false); > + /* Set dead items */ > + dead_items = TidStoreAttach(shared->dead_items_dsa_handle, > + shared->dead_items_handle); > > I feel ambivalent about this comment change. The original is not very > descriptive to begin with. If we need to change at all, maybe "find > dead_items in shared memory"? Agreed. > > v79-0005: As I said earlier, Dilip Kumar reviewed an earlier version. > > v79-0006: > > vac_work_mem should also go back to being an int. Fixed. I've attached the latest patches. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Mar 27, 2024 at 5:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Mar 27, 2024 at 9:25 AM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Mon, Mar 25, 2024 at 3:25 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > > > > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > - * remaining LP_DEAD line pointers on the page in the dead_items > > > > - * array. These dead items include those pruned by lazy_scan_prune() > > > > - * as well we line pointers previously marked LP_DEAD. > > > > + * remaining LP_DEAD line pointers on the page in the dead_items. > > > > + * These dead items include those pruned by lazy_scan_prune() as well > > > > + * we line pointers previously marked LP_DEAD. > > > > > > > > Here maybe "into dead_items". > > > > - * remaining LP_DEAD line pointers on the page in the dead_items. > > + * remaining LP_DEAD line pointers on the page into the dead_items. > > > > Let me explain. It used to be "in the dead_items array." It is not an > > array anymore, so it was changed to "in the dead_items". dead_items is > > a variable name, and names don't take "the". "into dead_items" seems > > most natural to me, but there are other possible phrasings. > > Thanks for the explanation. I was distracted. Fixed in the latest patch. > > > > > > > > > Did you try it with 1MB m_w_m? > > > > > > > > > > I've incorporated the above comments and test results look good to me. > > > > > > > > Could you be more specific about what the test was? > > > > Does it work with 1MB m_w_m? > > > > > > If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB. > > > > > > FYI other test cases I tested were: > > > > > > * m_w_m = 2199023254528 (maximum value) > > > initial: 1MB > > > max: 128GB > > > > > > * m_w_m = 64MB (default) > > > initial: 1MB > > > max: 8MB > > > > If the test was a vacuum, how big a table was needed to hit 128GB? > > I just checked how TIdStoreCreateLocal() calculated the initial and > max segment sizes while changing m_w_m, so didn't check how big > segments are actually allocated in the maximum value test case. > > > > > > > The existing comment slipped past my radar, but max_bytes is not a > > > > limit, it's a hint. Come to think of it, it never was a limit in the > > > > normal sense, but in earlier patches it was the criteria for reporting > > > > "I'm full" when asked. > > > > > > Updated the comment. > > > > + * max_bytes is not a limit; it's used to choose the memory block sizes of > > + * a memory context for TID storage in order for the total memory consumption > > + * not to be overshot a lot. The caller can use the max_bytes as the criteria > > + * for reporting whether it's full or not. > > > > This is good information. I suggest this edit: > > > > "max_bytes" is not an internally-enforced limit; it is used only as a > > hint to cap the memory block size of the memory context for TID > > storage. This reduces space wastage due to over-allocation. If the > > caller wants to monitor memory usage, it must compare its limit with > > the value reported by TidStoreMemoryUsage(). > > > > Other comments: > > Thanks for the suggestion! > > > > > v79-0002 looks good to me. > > > > v79-0003: > > > > "With this commit, when creating a shared TidStore, a dedicated DSA > > area is created for TID storage instead of using the provided DSA > > area." > > > > This is very subtle, but "the provided..." implies there still is one. > > -> "a provided..." > > > > + * Similar to TidStoreCreateLocal() but create a shared TidStore on a > > + * DSA area. The TID storage will live in the DSA area, and a memory > > + * context rt_context will have only meta data of the radix tree. > > > > -> "the memory context" > > Fixed in the latest patch. > > > > > I think you can go ahead and commit 0002 and 0003/4. > > I've pushed the 0002 (dsa init and max segment size) patch, and will > push the attached 0001 patch next. Pushed the refactoring patch. I've attached the rebased vacuum improvement patch for cfbot. I mentioned in the commit message that this patch eliminates the 1GB limitation. I think the patch is in good shape. Do you have other comments or suggestions, John? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Pushed the refactoring patch. > > I've attached the rebased vacuum improvement patch for cfbot. I > mentioned in the commit message that this patch eliminates the 1GB > limitation. > > I think the patch is in good shape. Do you have other comments or > suggestions, John? I'll do another pass tomorrow, but first I wanted to get in another slightly-challenging in-situ test. On my humble laptop, I can still fit a table large enough to cause PG16 to choke on multiple rounds of index cleanup: drop table if exists test; create unlogged table test (a int, b uuid) with (autovacuum_enabled=false); insert into test (a,b) select i, gen_random_uuid() from generate_series(1,1000*1000*1000) i; create index on test (a); create index on test (b); delete from test; vacuum (verbose, truncate off, parallel 2) test; INFO: vacuuming "john.public.test" INFO: launched 1 parallel vacuum worker for index vacuuming (planned: 1) INFO: finished vacuuming "john.public.test": index scans: 1 pages: 0 removed, 6369427 remain, 6369427 scanned (100.00% of total) tuples: 999997174 removed, 2826 remain, 0 are dead but not yet removable tuples missed: 2826 dead from 18 pages not removed due to cleanup lock contention removable cutoff: 771, which was 0 XIDs old when operation ended new relfrozenxid: 767, which is 4 XIDs ahead of previous value frozen: 0 pages from table (0.00% of total) had 0 tuples frozen index scan needed: 6369409 pages from table (100.00% of total) had 999997174 dead item identifiers removed index "test_a_idx": pages: 2741898 in total, 2741825 newly deleted, 2741825 currently deleted, 0 reusable index "test_b_idx": pages: 3850387 in total, 3842056 newly deleted, 3842056 currently deleted, 0 reusable avg read rate: 159.740 MB/s, avg write rate: 161.726 MB/s buffer usage: 26367981 hits, 14958634 misses, 15144601 dirtied WAL usage: 3 records, 1 full page images, 2050 bytes system usage: CPU: user: 151.89 s, system: 193.54 s, elapsed: 731.59 s Watching pg_stat_progress_vacuum, dead_tuple_bytes got up to 398458880. About the "tuples missed" -- I didn't expect contention during this test. I believe that's completely unrelated behavior, but wanted to mention it anyway, since I found it confusing.
On Thu, Mar 28, 2024 at 6:15 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Pushed the refactoring patch. > > > > I've attached the rebased vacuum improvement patch for cfbot. I > > mentioned in the commit message that this patch eliminates the 1GB > > limitation. > > > > I think the patch is in good shape. Do you have other comments or > > suggestions, John? > > I'll do another pass tomorrow, but first I wanted to get in another > slightly-challenging in-situ test. Thanks! > > About the "tuples missed" -- I didn't expect contention during this > test. I believe that's completely unrelated behavior, but wanted to > mention it anyway, since I found it confusing. I don't investigate it enough but bgwriter might be related to the contention. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I think the patch is in good shape. Do you have other comments or > suggestions, John? --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1918,11 +1918,6 @@ include_dir 'conf.d' too high. It may be useful to control for this by separately setting <xref linkend="guc-autovacuum-work-mem"/>. </para> - <para> - Note that for the collection of dead tuple identifiers, - <command>VACUUM</command> is only able to utilize up to a maximum of - <literal>1GB</literal> of memory. - </para> </listitem> </varlistentry> This is mentioned twice for two different GUCs -- need to remove the other one, too. Other than that, I just have minor nits: - * The major space usage for vacuuming is storage for the array of dead TIDs + * The major space usage for vacuuming is TID store, a storage for dead TIDs I think I've helped edit this sentence before, but I still don't quite like it. I'm thinking now "is storage for the dead tuple IDs". - * set upper bounds on the number of TIDs we can keep track of at once. + * set upper bounds on the maximum memory that can be used for keeping track + * of dead TIDs at once. I think "maximum" is redundant with "upper bounds". I also feel the commit message needs more "meat" -- we need to clearly narrate the features and benefits. I've attached how I would write it, but feel free to use what you like to match your taste. I've marked it Ready for Committer.
Attachment
On Fri, Mar 29, 2024 at 4:21 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I think the patch is in good shape. Do you have other comments or > > suggestions, John? > > --- a/doc/src/sgml/config.sgml > +++ b/doc/src/sgml/config.sgml > @@ -1918,11 +1918,6 @@ include_dir 'conf.d' > too high. It may be useful to control for this by separately > setting <xref linkend="guc-autovacuum-work-mem"/>. > </para> > - <para> > - Note that for the collection of dead tuple identifiers, > - <command>VACUUM</command> is only able to utilize up to a maximum of > - <literal>1GB</literal> of memory. > - </para> > </listitem> > </varlistentry> > > This is mentioned twice for two different GUCs -- need to remove the > other one, too. Good catch, removed. > Other than that, I just have minor nits: > > - * The major space usage for vacuuming is storage for the array of dead TIDs > + * The major space usage for vacuuming is TID store, a storage for dead TIDs > > I think I've helped edit this sentence before, but I still don't quite > like it. I'm thinking now "is storage for the dead tuple IDs". > > - * set upper bounds on the number of TIDs we can keep track of at once. > + * set upper bounds on the maximum memory that can be used for keeping track > + * of dead TIDs at once. > > I think "maximum" is redundant with "upper bounds". Fixed. > > I also feel the commit message needs more "meat" -- we need to clearly > narrate the features and benefits. I've attached how I would write it, > but feel free to use what you like to match your taste. Well, that's much better than mine. > > I've marked it Ready for Committer. Thank you! I've attached the patch that I'm going to push tomorrow. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Apr 1, 2024 at 9:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Thank you! I've attached the patch that I'm going to push tomorrow. Excellent! I've attached a mostly-polished update on runtime embeddable values, storing up to 3 offsets in the child pointer (1 on 32-bit platforms). As discussed, this includes a macro to cap max possible offset that can be stored in the bitmap, which I believe only reduces the valid offset range for 32kB pages on 32-bit platforms. Even there, it allows for more line pointers than can possibly be useful. It also splits into two parts for readability. It would be committed in two pieces as well, since they are independently useful.
Attachment
On Sun, Apr 7, 2024 at 9:08 AM John Naylor <johncnaylorls@gmail.com> wrote: > I've attached a mostly-polished update on runtime embeddable values, > storing up to 3 offsets in the child pointer (1 on 32-bit platforms). And...since there's a new bump context patch, I wanted to anticipate squeezing an update on top of that, if that gets committed. 0004/5 are the v6 bump context, and 0006 uses it for vacuum. The rest are to show it works -- the expected.out changes make possible problems in CI easier to see. The allocation size is 16 bytes, so this difference is entirely due to lack of chunk header: aset: 6619136 bump: 5047296 (Note: assert builds still have the chunk header for sanity checking, so this was done in a more optimized build)
Attachment
- v84-0005-Introduce-a-bump-memory-allocator.patch
- v84-0008-DEV-compare-bump-context-in-tests.patch
- v84-0004-Enlarge-bit-space-for-MemoryContextMethodID.patch
- v84-0007-DEV-log-memory-usage-in-tests.patch
- v84-0006-Use-bump-context-for-vacuum-s-TID-storage.patch
- v84-0003-Teach-radix-tree-to-embed-values-at-runtime.patch
- v84-0002-pgindent.patch
- v84-0001-store-offsets-in-the-header.patch
Hi, On 2024-04-01 11:53:28 +0900, Masahiko Sawada wrote: > On Fri, Mar 29, 2024 at 4:21 PM John Naylor <johncnaylorls@gmail.com> wrote: > > I've marked it Ready for Committer. > > Thank you! I've attached the patch that I'm going to push tomorrow. Locally I ran a 32bit build with ubsan enabled (by accident actually), which complains: performing post-bootstrap initialization ... ----------------------------------- stderr ----------------------------------- ../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341:24: runtime error: member access withinmisaligned address 0xffb6258e for type 'struct BlocktableEntry', which requires 4 byte alignment 0xffb6258e: note: pointer points here 00 00 02 00 01 40 dc e9 83 0b 80 48 70 ee 00 00 00 00 00 00 00 01 17 00 00 00 f8 d4 a6 ee e8 25 ^ #0 0x814097e in TidStoreSetBlockOffsets ../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341 #1 0x826560a in dead_items_add ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:2889 #2 0x825f8da in lazy_scan_prune ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:1502 #3 0x825da71 in lazy_scan_heap ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:977 #4 0x825ad8f in heap_vacuum_rel ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:499 #5 0x8697e97 in table_relation_vacuum ../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1725 #6 0x869fca6 in vacuum_rel ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:2206 #7 0x869a0fd in vacuum ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:622 #8 0x869986b in ExecVacuum ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:449 #9 0x8e5f832 in standard_ProcessUtility ../../../../../home/andres/src/postgresql/src/backend/tcop/utility.c:859 #10 0x8e5e5f6 in ProcessUtility ../../../../../home/andres/src/postgresql/src/backend/tcop/utility.c:523 #11 0x8e5b71a in PortalRunUtility ../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:1158 #12 0x8e5be80 in PortalRunMulti ../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:1315 #13 0x8e59f9b in PortalRun ../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:791 #14 0x8e4d5f3 in exec_simple_query ../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:1274 #15 0x8e55159 in PostgresMain ../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4680 #16 0x8e54445 in PostgresSingleUserMain ../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4136 #17 0x88bb55e in main ../../../../../home/andres/src/postgresql/src/backend/main/main.c:194 #18 0xf76f47c4 (/lib/i386-linux-gnu/libc.so.6+0x237c4) (BuildId: fe79efe6681a919714a4e119da2baac3a4953fbf) #19 0xf76f4887 in __libc_start_main (/lib/i386-linux-gnu/libc.so.6+0x23887) (BuildId: fe79efe6681a919714a4e119da2baac3a4953fbf) #20 0x80d40f7 in _start (/srv/dev/build/postgres/m-dev-assert-32/tmp_install/srv/dev/install/postgres/m-dev-assert-32/bin/postgres+0x80d40f7) SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341:24in Aborted (core dumped) child process exited with exit code 134 initdb: data directory "/srv/dev/build/postgres/m-dev-assert-32/tmp_install/initdb-template" not removed at user's request At first I was confused why CI didn't find this. Turns out that, for me, this is only triggered without compiler optimizations, and I had used -O0 while CI uses some optimizations. Backtrace: #9 0x0814097f in TidStoreSetBlockOffsets (ts=0xb8dfde4, blkno=15, offsets=0xffb6275c, num_offsets=11) at ../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341 #10 0x0826560b in dead_items_add (vacrel=0xb8df6d4, blkno=15, offsets=0xffb6275c, num_offsets=11) at ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:2889 #11 0x0825f8db in lazy_scan_prune (vacrel=0xb8df6d4, buf=24, blkno=15, page=0xeeb6c000 "", vmbuffer=729, all_visible_according_to_vm=false, has_lpdead_items=0xffb62a1f) at ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:1502 #12 0x0825da72 in lazy_scan_heap (vacrel=0xb8df6d4) at ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:977 #13 0x0825ad90 in heap_vacuum_rel (rel=0xb872810, params=0xffb62e90, bstrategy=0xb99d5e0) at ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:499 #14 0x08697e98 in table_relation_vacuum (rel=0xb872810, params=0xffb62e90, bstrategy=0xb99d5e0) at ../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1725 #15 0x0869fca7 in vacuum_rel (relid=1249, relation=0x0, params=0xffb62e90, bstrategy=0xb99d5e0) at ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:2206 #16 0x0869a0fe in vacuum (relations=0xb99de08, params=0xffb62e90, bstrategy=0xb99d5e0, vac_context=0xb99d550, isTopLevel=true) (gdb) p/x page $1 = 0xffb6258e I think compiler optimizations are only tangentially involved here, they trigger the stack frame layout to change, e.g. because some variable will just exist in a register. Looking at the code, the failure isn't suprising anymore: char data[MaxBlocktableEntrySize]; BlocktableEntry *page = (BlocktableEntry *) data; 'char' doesn't enforce any alignment, but you're storing a BlocktableEntry in a char[]. You can't just do that. Look at how we do that for e.g. PGAlignedblock. With the attached minimal fix, the tests pass again. Greetings, Andres Freund
Attachment
On Mon, Apr 8, 2024 at 2:07 AM Andres Freund <andres@anarazel.de> wrote: > > Looking at the code, the failure isn't suprising anymore: > char data[MaxBlocktableEntrySize]; > BlocktableEntry *page = (BlocktableEntry *) data; > > 'char' doesn't enforce any alignment, but you're storing a BlocktableEntry in > a char[]. You can't just do that. Look at how we do that for > e.g. PGAlignedblock. > > > With the attached minimal fix, the tests pass again. Thanks, will push this shortly!
Hi, John!
On Mon, 8 Apr 2024 at 03:13, John Naylor <johncnaylorls@gmail.com> wrote:
On Mon, Apr 8, 2024 at 2:07 AM Andres Freund <andres@anarazel.de> wrote:
>
> Looking at the code, the failure isn't suprising anymore:
> char data[MaxBlocktableEntrySize];
> BlocktableEntry *page = (BlocktableEntry *) data;
>
> 'char' doesn't enforce any alignment, but you're storing a BlocktableEntry in
> a char[]. You can't just do that. Look at how we do that for
> e.g. PGAlignedblock.
>
>
> With the attached minimal fix, the tests pass again.
Thanks, will push this shortly!
Buildfarm animal mylodon looks unhappy with this:
FAILED: src/backend/postgres_lib.a.p/access_common_tidstore.c.o ccache clang-14 -Isrc/backend/postgres_lib.a.p -Isrc/include -I../pgsql/src/include -I/usr/include/libxml2 -I/usr/include/security -fdiagnostics-color=never -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -O2 -g -fno-strict-aliasing -fwrapv -D_GNU_SOURCE -Wmissing-prototypes -Wpointer-arith -Werror=vla -Werror=unguarded-availability-new -Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security -Wdeclaration-after-statement -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -O1 -ggdb -g3 -fno-omit-frame-pointer -Wall -Wextra -Wno-unused-parameter -Wno-sign-compare -Wno-missing-field-initializers -Wno-array-bounds -std=c99 -Wc11-extensions -Werror=c11-extensions -fPIC -isystem /usr/include/mit-krb5 -pthread -DBUILDING_DLL -MD -MQ src/backend/postgres_lib.a.p/access_common_tidstore.c.o -MF src/backend/postgres_lib.a.p/access_common_tidstore.c.o.d -o src/backend/postgres_lib.a.p/access_common_tidstore.c.o -c ../pgsql/src/backend/access/common/tidstore.c ../pgsql/src/backend/access/common/tidstore.c:48:3: error: anonymous structs are a C11 extension [-Werror,-Wc11-extensions] struct ^
1 error generated.
Regards,
Pavel Borisov
Supabase
On Sun, Apr 7, 2024 at 9:08 AM John Naylor <johncnaylorls@gmail.com> wrote: > > I've attached a mostly-polished update on runtime embeddable values, > storing up to 3 offsets in the child pointer (1 on 32-bit platforms). > As discussed, this includes a macro to cap max possible offset that > can be stored in the bitmap, which I believe only reduces the valid > offset range for 32kB pages on 32-bit platforms. Even there, it allows > for more line pointers than can possibly be useful. It also splits > into two parts for readability. It would be committed in two pieces as > well, since they are independently useful. I pushed both of these and see that mylodon complains that anonymous unions are a C11 feature. I'm not actually sure that the union with uintptr_t is actually needed, though, since that's not accessed as such here. The simplest thing seems to get rid if the union and name the inner struct "header", as in the attached.
Attachment
On Mon, 8 Apr 2024 at 16:27, John Naylor <johncnaylorls@gmail.com> wrote:
On Sun, Apr 7, 2024 at 9:08 AM John Naylor <johncnaylorls@gmail.com> wrote:
>
> I've attached a mostly-polished update on runtime embeddable values,
> storing up to 3 offsets in the child pointer (1 on 32-bit platforms).
> As discussed, this includes a macro to cap max possible offset that
> can be stored in the bitmap, which I believe only reduces the valid
> offset range for 32kB pages on 32-bit platforms. Even there, it allows
> for more line pointers than can possibly be useful. It also splits
> into two parts for readability. It would be committed in two pieces as
> well, since they are independently useful.
I pushed both of these and see that mylodon complains that anonymous
unions are a C11 feature. I'm not actually sure that the union with
uintptr_t is actually needed, though, since that's not accessed as
such here. The simplest thing seems to get rid if the union and name
the inner struct "header", as in the attached.
Provided uintptr_t is not accessed it might be good to get rid of it.
Maybe this patch also need correction in this:
+#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) - sizeof(int8)) / sizeof(OffsetNumber))
Regards,
Pavel
On Mon, Apr 8, 2024 at 7:42 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote: > >> I pushed both of these and see that mylodon complains that anonymous >> unions are a C11 feature. I'm not actually sure that the union with >> uintptr_t is actually needed, though, since that's not accessed as >> such here. The simplest thing seems to get rid if the union and name >> the inner struct "header", as in the attached. > > > Provided uintptr_t is not accessed it might be good to get rid of it. > > Maybe this patch also need correction in this: > +#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) - sizeof(int8)) / sizeof(OffsetNumber)) For full context the diff was -#define NUM_FULL_OFFSETS ((sizeof(bitmapword) - sizeof(uint16)) / sizeof(OffsetNumber)) +#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) - sizeof(int8)) / sizeof(OffsetNumber)) I wanted the former, from f35bd9bf35 , to be independently useful (in case the commit in question had some unresolvable issue), and its intent is to fill struct padding when the array of bitmapword happens to have length zero. Changing to uintptr_t for the size calculation reflects the intent to fit in a (local) pointer, regardless of the size of a bitmapword. (If a DSA pointer happens to be a different size for some odd platform, it should still work, BTW.) My thinking with the union was, for big-endian, to force the 'flags' member to where it can be set, but thinking again, it should still work if by happenstance the header was smaller than the child pointer: A different bit would get tagged, but I believe that's irrelevant. The 'flags' member makes sure a byte is reserved for the tag, but it may not be where the tag is actually located, if that makes sense.
On Mon, Apr 8, 2024 at 7:26 PM John Naylor <johncnaylorls@gmail.com> wrote: > > I pushed both of these and see that mylodon complains that anonymous > unions are a C11 feature. I'm not actually sure that the union with > uintptr_t is actually needed, though, since that's not accessed as > such here. The simplest thing seems to get rid if the union and name > the inner struct "header", as in the attached. I pushed this with some comment adjustments.
I took a look at the coverage report from [1] and it seems pretty good, but there are a couple more tests we could do. - RT_KEY_GET_SHIFT is not covered for key=0: https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803 That should be fairly simple to add to the tests. - Some paths for single-value leaves are not covered: https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904 https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954 https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606 However, these paths do get regression test coverage on 32-bit machines. 64-bit builds only have leaves in the TID store, which doesn't (currently) delete entries, and doesn't instantiate the tree with the debug option. - In RT_SET "if (found)" is not covered: https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768 That's because we don't yet have code that replaces an existing value with a value of a different length. - RT_FREE_RECURSE isn't well covered: https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768 The TID store test is pretty simple as far as distribution of block keys, and focuses more on the offset bitmaps. We could try to cover all branches here, but it would make the test less readable, and it's kind of the wrong place to do that anyway. test_radixtree.c does have a commented-out option to use shared memory, but that's for local testing and won't be reflected in the coverage report. Maybe it's enough. - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered: https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644 That should be easy to add. - RT_DUMP_NODE is not covered, and never called by default anyway: https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2804 It seems we could just leave it alone since it's debug-only, but it's also a lot of lines. One idea is to use elog with DEBUG5 instead of commenting out the call sites, but that would cause a lot of noise. - TidStoreCreate* has some memory clamps that are not covered: https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179 https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234 Maybe we could experiment with using 1MB for shared, and something smaller for local. [1] https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de
On Mon, Apr 15, 2024 at 04:12:38PM +0700, John Naylor wrote: > - Some paths for single-value leaves are not covered: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904 > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954 > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606 > > However, these paths do get regression test coverage on 32-bit > machines. 64-bit builds only have leaves in the TID store, which > doesn't (currently) delete entries, and doesn't instantiate the tree > with the debug option. > > - In RT_SET "if (found)" is not covered: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768 > > That's because we don't yet have code that replaces an existing value > with a value of a different length. I saw a SIGSEGV there when using tidstore to write a fix for something else. Patch attached.
Attachment
On Mon, Apr 15, 2024 at 6:12 PM John Naylor <johncnaylorls@gmail.com> wrote: > > I took a look at the coverage report from [1] and it seems pretty > good, but there are a couple more tests we could do. Thank you for checking! > > - RT_KEY_GET_SHIFT is not covered for key=0: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803 > > That should be fairly simple to add to the tests. There are two paths to call RT_KEY_GET_SHIFT(): 1. RT_SET() -> RT_KEY_GET_SHIFT() 2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT() In both cases, it's called when key > tree->ctl->max_val. Since the minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called when key=0. > > - Some paths for single-value leaves are not covered: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904 > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954 > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606 > > However, these paths do get regression test coverage on 32-bit > machines. 64-bit builds only have leaves in the TID store, which > doesn't (currently) delete entries, and doesn't instantiate the tree > with the debug option. Right. > > - In RT_SET "if (found)" is not covered: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768 > > That's because we don't yet have code that replaces an existing value > with a value of a different length. Noah reported an issue around that. We should incorporate the patch and cover this code path. > > - RT_FREE_RECURSE isn't well covered: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768 > > The TID store test is pretty simple as far as distribution of block > keys, and focuses more on the offset bitmaps. We could try to cover > all branches here, but it would make the test less readable, and it's > kind of the wrong place to do that anyway. test_radixtree.c does have > a commented-out option to use shared memory, but that's for local > testing and won't be reflected in the coverage report. Maybe it's > enough. Agreed. > > - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644 > > That should be easy to add. Agreed. The patch is attached. > > - RT_DUMP_NODE is not covered, and never called by default anyway: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2804 > > It seems we could just leave it alone since it's debug-only, but it's > also a lot of lines. One idea is to use elog with DEBUG5 instead of > commenting out the call sites, but that would cause a lot of noise. I think we can leave it alone. > > - TidStoreCreate* has some memory clamps that are not covered: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179 > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234 > > Maybe we could experiment with using 1MB for shared, and something > smaller for local. I've confirmed that the local and shared tidstore with small max sizes such as 4kB and 1MB worked. Currently the max size is hard-coded in test_tidstore.c but if we use work_mem as the max size, we can pass different max sizes for local and shared in the test script. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Apr 25, 2024 at 6:03 AM Noah Misch <noah@leadboat.com> wrote: > > On Mon, Apr 15, 2024 at 04:12:38PM +0700, John Naylor wrote: > > - Some paths for single-value leaves are not covered: > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904 > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954 > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606 > > > > However, these paths do get regression test coverage on 32-bit > > machines. 64-bit builds only have leaves in the TID store, which > > doesn't (currently) delete entries, and doesn't instantiate the tree > > with the debug option. > > > > - In RT_SET "if (found)" is not covered: > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768 > > > > That's because we don't yet have code that replaces an existing value > > with a value of a different length. > > I saw a SIGSEGV there when using tidstore to write a fix for something else. > Patch attached. Great find, thank you for the patch! The fix looks good to me. I think we can improve regression tests for better coverage. In TidStore on a 64-bit machine, we can store 3 offsets in the header and these values are embedded to the leaf page. With more than 3 offsets, the value size becomes more than 16 bytes and a single value leaf. Therefore, if we can add the test with the array[1,2,3,4,100], we can cover the case of replacing a single-value leaf with a different size new single-value leaf. Now we add 9 pairs of do_gset_block_offset() and check_set_block_offsets(). If these are annoying, we can remove the cases of array[1] and array[1,2]. I've attached a new patch. In addition to the new test case I mentioned, I've added some new comments and removed an unnecessary added line in test_tidstore.sql. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I saw a SIGSEGV there when using tidstore to write a fix for something else. > > Patch attached. > > Great find, thank you for the patch! +1 (This occurred to me a few days ago, but I was far from my computer.) With the purge function that Noah proposed, I believe we can also get rid of the comment at the top of the .sql test file warning of a maintenance hazard: ..."To avoid adding duplicates, -- each call to do_set_block_offsets() should use different block -- numbers." I found that it doesn't add any measurable time to run the test. > The fix looks good to me. I think we can improve regression tests for > better coverage. In TidStore on a 64-bit machine, we can store 3 > offsets in the header and these values are embedded to the leaf page. > With more than 3 offsets, the value size becomes more than 16 bytes > and a single value leaf. Therefore, if we can add the test with the > array[1,2,3,4,100], we can cover the case of replacing a single-value > leaf with a different size new single-value leaf. Now we add 9 pairs Good idea. > of do_gset_block_offset() and check_set_block_offsets(). If these are > annoying, we can remove the cases of array[1] and array[1,2]. Let's keep those -- 32-bit platforms should also exercise this path.
On Thu, Apr 25, 2024 at 12:17 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > I saw a SIGSEGV there when using tidstore to write a fix for something else. > > > Patch attached. > > > > Great find, thank you for the patch! > > +1 > > (This occurred to me a few days ago, but I was far from my computer.) > > With the purge function that Noah proposed, I believe we can also get > rid of the comment at the top of the .sql test file warning of a > maintenance hazard: > ..."To avoid adding duplicates, > -- each call to do_set_block_offsets() should use different block > -- numbers." Good point. Removed. > > > of do_gset_block_offset() and check_set_block_offsets(). If these are > > annoying, we can remove the cases of array[1] and array[1,2]. > > Let's keep those -- 32-bit platforms should also exercise this path. Agreed. I've attached a new patch. I'll push it tonight, if there is no further comment. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Apr 25, 2024 at 1:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Apr 25, 2024 at 12:17 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > I saw a SIGSEGV there when using tidstore to write a fix for something else. > > > > Patch attached. > > > > > > Great find, thank you for the patch! > > > > +1 > > > > (This occurred to me a few days ago, but I was far from my computer.) > > > > With the purge function that Noah proposed, I believe we can also get > > rid of the comment at the top of the .sql test file warning of a > > maintenance hazard: > > ..."To avoid adding duplicates, > > -- each call to do_set_block_offsets() should use different block > > -- numbers." > > Good point. Removed. > > > > > > of do_gset_block_offset() and check_set_block_offsets(). If these are > > > annoying, we can remove the cases of array[1] and array[1,2]. > > > > Let's keep those -- 32-bit platforms should also exercise this path. > > Agreed. > > I've attached a new patch. I'll push it tonight, if there is no further comment. > Pushed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Apr 25, 2024 at 8:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Apr 15, 2024 at 6:12 PM John Naylor <johncnaylorls@gmail.com> wrote: > > - RT_KEY_GET_SHIFT is not covered for key=0: > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803 > > > > That should be fairly simple to add to the tests. > > There are two paths to call RT_KEY_GET_SHIFT(): > > 1. RT_SET() -> RT_KEY_GET_SHIFT() > 2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT() > > In both cases, it's called when key > tree->ctl->max_val. Since the > minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called > when key=0. Ah, right, so it is dead code. Nothing to worry about, but it does point the way to some simplifications, which I've put together in the attached. > > - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered: > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644 > > > > That should be easy to add. > > Agreed. The patch is attached. LGTM > > - TidStoreCreate* has some memory clamps that are not covered: > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179 > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234 > > > > Maybe we could experiment with using 1MB for shared, and something > > smaller for local. > > I've confirmed that the local and shared tidstore with small max sizes > such as 4kB and 1MB worked. Currently the max size is hard-coded in > test_tidstore.c but if we use work_mem as the max size, we can pass > different max sizes for local and shared in the test script. Seems okay, do you want to try that and see how it looks?
Attachment
On Wed, May 1, 2024 at 4:29 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Apr 25, 2024 at 8:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Apr 15, 2024 at 6:12 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > - RT_KEY_GET_SHIFT is not covered for key=0: > > > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803 > > > > > > That should be fairly simple to add to the tests. > > > > There are two paths to call RT_KEY_GET_SHIFT(): > > > > 1. RT_SET() -> RT_KEY_GET_SHIFT() > > 2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT() > > > > In both cases, it's called when key > tree->ctl->max_val. Since the > > minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called > > when key=0. > > Ah, right, so it is dead code. Nothing to worry about, but it does > point the way to some simplifications, which I've put together in the > attached. Thank you for the patch. It looks good to me. + /* compute the smallest shift that will allowing storing the key */ + start_shift = pg_leftmost_one_pos64(key) / RT_SPAN * RT_SPAN; The comment is moved from RT_KEY_GET_SHIFT() but I think s/will allowing storing/will allow storing/. > > > > - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered: > > > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644 > > > > > > That should be easy to add. > > > > Agreed. The patch is attached. > > LGTM > > > > - TidStoreCreate* has some memory clamps that are not covered: > > > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179 > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234 > > > > > > Maybe we could experiment with using 1MB for shared, and something > > > smaller for local. > > > > I've confirmed that the local and shared tidstore with small max sizes > > such as 4kB and 1MB worked. Currently the max size is hard-coded in > > test_tidstore.c but if we use work_mem as the max size, we can pass > > different max sizes for local and shared in the test script. > > Seems okay, do you want to try that and see how it looks? I've attached a simple patch for this. In test_tidstore.sql, we used to create two local tidstore and one shared tidstore. I thought of specifying small work_mem values for these three cases but it would remove the normal test cases. So I created separate tidstore for this test. Also, the new test is just to check if tidstore can be created with such a small size, but it might be a good idea to add some TIDs to check if it really works fine. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com