Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | CANWCAZbumguvp93hZhjGW16=Kf20tHJ0pyEXEFdZuCS0m7bcEQ@mail.gmail.com Whole thread Raw |
In response to | Re: [PoC] Improve dead tuple storage for lazy vacuum (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: [PoC] Improve dead tuple storage for lazy vacuum
Re: [PoC] Improve dead tuple storage for lazy vacuum |
List | pgsql-hackers |
On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sat, Feb 10, 2024 at 9:29 PM John Naylor <johncnaylorls@gmail.com> wrote: > I've also run the same scripts in my environment just in case and got > similar results: Thanks for testing, looks good as well. > > There are still some micro-benchmarks we could do on tidstore, and > > it'd be good to find out worse-case memory use (1 dead tuple each on > > spread-out pages), but this is decent demonstration. > > I've tested a simple case where vacuum removes 33k dead tuples spread > about every 10 pages. > > master: > 198,000 bytes (=33000 * 6) > system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s > > v-59: > 2,834,432 bytes (reported by TidStoreMemoryUsage()) > system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s The memory usage for the sparse case may be a concern, although it's not bad -- a multiple of something small is probably not huge in practice. See below for an option we have for this. > > > > I'm pretty sure there's an > > > > accidental memset call that crept in there, but I'm running out of > > > > steam today. > > > > I have just a little bit of work to add for v59: > > > > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero > > any bitmapwords. That can only happen if e.g. there is an offset > 128 > > and there are none between 64 and 128, so not a huge deal but I think > > it's a bit nicer in this patch. > > LGTM. Okay, I've squashed this. > I've drafted the commit message. Thanks, this is a good start. > I've run regression tests with valgrind and run the coverity scan, and > I don't see critical issues. Great! Now, I think we're in pretty good shape. There are a couple of things that might be objectionable, so I want to try to improve them in the little time we have: 1. Memory use for the sparse case. I shared an idea a few months ago of how runtime-embeddable values (true combined pointer-value slots) could work for tids. I don't think this is a must-have, but it's not a lot of code, and I have this working: v61-0006: Preparatory refactoring -- I think we should do this anyway, since the intent seems more clear to me. v61-0007: Runtime-embeddable tids -- Optional for v17, but should reduce memory regressions, so should be considered. Up to 3 tids can be stored in the last level child pointer. It's not polished, but I'll only proceed with that if we think we need this. "flags" iis called that because it could hold tidbitmap.c booleans (recheck, lossy) in the future, in addition to reserving space for the pointer tag. Note: I hacked the tests to only have 2 offsets per block to demo, but of course both paths should be tested. 2. Management of memory contexts. It's pretty verbose and messy. I think the abstraction could be better: A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we can't destroy or reset it. That means we have to do a lot of manual work. B: Passing "max_bytes" to the radix tree was my idea, I believe, but it seems the wrong responsibility. Not all uses will have a work_mem-type limit, I'm guessing. We only use it for limiting the max block size, and aset's default 8MB is already plenty small for vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so smaller, and there it makes sense to limit the max blocksize this way. C: The context for values has complex #ifdefs based on the value length/varlen, but it's both too much and not enough. If we get a bump context, how would we shoehorn that in for values for vacuum but not for tidbitmap? Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to TidStoreCreate(), and then to RT_CREATE. That context will contain the values (for local mem), and the node slabs will be children of the value context. That way, measuring memory usage and free-ing can just call with this parent context, and let recursion handle the rest. Perhaps the passed context can also hold the radix-tree struct, but I'm not sure since I haven't tried it. What do you think? With this resolved, I think the radix tree is pretty close to committable. The tid store will likely need some polish yet, but no major issues I know of. (And, finally, a small thing I that I wanted to share just so I don't forget, but maybe not worth the attention: In Andres's prototype, there is a comment wondering if an update can skip checking if it first need to create a root node. This is pretty easy, and done in v61-0008.)
Attachment
pgsql-hackers by date: