Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | CAD21AoCKBiTr6Voy20qJ09j5DZa0cOuc=VOzwm-sCLK7PDG8xg@mail.gmail.com Whole thread Raw |
In response to | Re: [PoC] Improve dead tuple storage for lazy vacuum (John Naylor <johncnaylorls@gmail.com>) |
Responses |
Re: [PoC] Improve dead tuple storage for lazy vacuum
|
List | pgsql-hackers |
On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Jan 31, 2024 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've attached the new patch set (v56). I've squashed previous updates > > and addressed review comments on v55 in separate patches. Here are the > > update summary: > > > > 0004: fix compiler warning caught by ci test. > > 0005-0008: address review comments on radix tree codes. > > 0009: cleanup #define and #undef > > 0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is > > undefined after including radixtree.h so we should not use it in test > > code. > > Great, thanks! > > I have a few questions and comments on v56, then I'll address yours > below with the attached v57, which is mostly cosmetic adjustments. Thank you for the comments! I've squashed previous updates and your changes. > > v56-0003: > > (Looking closer at tests) > > +static const bool rt_test_stats = false; > > I'm thinking we should just remove everything that depends on this, > and keep this module entirely about correctness. Agreed. Removed in 0006 patch. > > + for (int shift = 0; shift <= (64 - 8); shift += 8) > + test_node_types(shift); > > I'm not sure what the test_node_types_* functions are testing that > test_basic doesn't. They have a different, and confusing, way to stop > at every size class and check the keys/values. It seems we can replace > all that with two more calls (asc/desc) to test_basic, with the > maximum level. Agreed, addressed in 0007 patch. > > It's pretty hard to see what test_pattern() is doing, or why it's > useful. I wonder if instead the test could use something like the > benchmark where random integers are masked off. That seems simpler. I > can work on that, but I'd like to hear your side about test_pattern(). Yeah, test_pattern() is originally created for the integerset so it doesn't necessarily fit the radixtree. I agree to use some tests from benchmarks. > > v56-0007: > > + * > + * Since we can rely on DSA_AREA_LOCK to get the total amount of DSA memory, > + * the caller doesn't need to take a lock. > > Maybe something like "Since dsa_get_total_size() does appropriate locking ..."? Agreed. Fixed in 0005 patch. > > v56-0008 > > Thanks, I like how the tests look now. > > -NOTICE: testing node 4 with height 0 and ascending keys > ... > +NOTICE: testing node 1 with height 0 and ascending keys > > Now that the number is not intended to match a size class, "node X" > seems out of place. Maybe we could have a separate array with strings? > > + 1, /* RT_CLASS_4 */ > > This should be more than one, so that the basic test still exercises > paths that shift elements around. > > + 100, /* RT_CLASS_48 */ > > This node currently holds 64 for local memory. > > + 255 /* RT_CLASS_256 */ > > This is the only one where we know exactly how many it can take, so > may as well keep it at 256. Fixed in 0008 patch. > > v56-0012: > > The test module for tidstore could use a few more comments. Addressed in 0012 patch. > > v56-0015: > > +typedef dsa_pointer TidStoreHandle; > + > > -TidStoreAttach(dsa_area *area, dsa_pointer rt_dp) > +TidStoreAttach(dsa_area *area, TidStoreHandle handle) > { > TidStore *ts; > + dsa_pointer rt_dp = handle; > > My earlier opinion was that "handle" was a nicer variable name, but > this brings back the typedef and also keeps the variable name I didn't > like, but pushes it down into the function. I'm a bit confused, so > I've kept these not-squashed for now. I misunderstood your comment. I've changed to use a variable name rt_handle and removed the TidStoreHandle type. 0013 patch. > > ----------------------------------------------------------------------------------- > > Now, for v57: > > > Looking at overall changes, there are still XXX and TODO comments in > > radixtree.h: > > That's fine, as long as it's intentional as a message to readers. That > said, we can get rid of some: > > > --- > > * XXX There are 4 node kinds, and this should never be increased, > > * for several reasons: > > * 1. With 5 or more kinds, gcc tends to use a jump table for switch > > * statements. > > * 2. The 4 kinds can be represented with 2 bits, so we have the option > > * in the future to tag the node pointer with the kind, even on > > * platforms with 32-bit pointers. This might speed up node traversal > > * in trees with highly random node kinds. > > * 3. We can have multiple size classes per node kind. > > > > Can we just remove "XXX"? > > How about "NOTE"? Agreed. > > > --- > > * WIP: notes about traditional radix tree trading off span vs height... > > > > Are you going to write it? > > Yes, when I draft a rough commit message, (for next time). Thanks! > > > --- > > #ifdef RT_SHMEM > > /* WIP: do we really need this? */ > > typedef dsa_pointer RT_HANDLE; > > #endif > > > > I think it's worth having it. > > Okay, removed WIP in v57-0004. > > > --- > > * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit > > * inside a single bitmapword on most platforms, so it's a good starting > > * point. We can make it higher if we need to. > > */ > > #define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4) > > > > Are you going to work something on this? > > Hard-coded 64 for readability, and changed this paragraph to explain > the current rationale more clearly: > > "The paper uses at most 64 for this node kind, and one advantage for us > is that "isset" is a single bitmapword on most platforms, rather than > an array, allowing the compiler to get rid of loops." LGTM. > > > --- > > /* WIP: We could go first to the higher node16 size class */ > > newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO); > > > > Does it mean to go to RT_CLASS_16_HI and then further go to > > RT_CLASS_16_LO upon further deletion? > > Yes. It wouldn't be much work to make shrinking symmetrical with > growing (a good thing), but it's not essential so I haven't done it > yet. Okay, let's keep it as WIP. > > > --- > > * TODO: The current locking mechanism is not optimized for high concurrency > > * with mixed read-write workloads. In the future it might be worthwhile > > * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in > > * the paper "The ART of Practical Synchronization" by the same authors as > > * the ART paper, 2016. > > > > I think it's not TODO for now, but a future improvement. We can remove it. > > It _is_ a TODO, regardless of when it happens. Understood. > > > --- > > /* TODO: consider 5 with subclass 1 or 2. */ > > #define RT_FANOUT_4 4 > > > > Is there something we need to do here? > > Changed to: > > "To save memory in trees with sparse keys, it would make sense to have two > size classes for the smallest kind (perhaps a high class of 5 and a low class > of 2), but it would be more effective to utilize lazy expansion and > path compression." LGTM. But there is an extra '*' in the last line: + /* : + * of 2), but it would be more effective to utilize lazy expansion and + * path compression. + * */ Fixed in 0004 patch. > > > --- > > /* > > * Return index of the chunk and slot arrays for inserting into the node, > > * such that the chunk array remains ordered. > > * TODO: Improve performance for non-SIMD platforms. > > */ > > > > Are you going to work on this? > > A small step in v57-0010. I've found a way to kill two birds with one > stone, by first checking for the case that the keys are inserted in > order. This also helps the SIMD case because it must branch anyway to > avoid bitscanning a zero bitfield. This moves the branch up and turns > a mask into an assert, looking a bit nicer. I've removed the TODO, but > maybe we should add it to the search_eq function. Great! > > > --- > > /* Delete the element at 'idx' */ > > /* TODO: replace slow memmove's */ > > > > Are you going to work on this? > > Done in v57-0011. LGTM. > > The rest: > v57-0004 - 0008 should be self explanatory, but questions/pushback welcome. > v57-0009 - I'm thinking leaves don't need to be memset at all. The > value written should be entirely the caller's responsibility, it > seems. > v57-0013 - the bench module can be built locally again > v57-0016 - minor comment edits in tid store These fixes look good to me. > > My todo: > - benchmark tid store / vacuum again, since we haven't since varlen > types and removing unnecessary locks. I'm pretty sure there's an > accidental memset call that crept in there, but I'm running out of > steam today. > - leftover comment etc work Thanks. I'm also going to do some benchmarks and tests. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: