Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | CANWCAZY014Nvdi4YZciJ7h0vXXH=6VkN66W-oiTqc0OziXzMEQ@mail.gmail.com Whole thread Raw |
In response to | Re: [PoC] Improve dead tuple storage for lazy vacuum (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: [PoC] Improve dead tuple storage for lazy vacuum
|
List | pgsql-hackers |
On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote: > > Random offsets is what I was thinking of (if made distinct and > > ordered), but even there the code is fairy trivial, so I don't have a > > strong feeling about it. > > Agreed. Looks good. A related thing I should mention is that the tests which look up all possible offsets are really expensive with the number of blocks we're using now (assert build): v70 0.33s v72 1.15s v73 1.32 To trim that back, I think we should give up on using shared memory for the is-full test: We can cause aset to malloc a new block with a lot fewer entries. In the attached, this brings it back down to 0.43s. It might also be worth reducing the number of blocks in the random test -- multiple runs will have different offsets anyway. > > I think we can stop including the debug-tid-store patch for CI now. > > That would allow getting rid of some unnecessary variables. > > Agreed. Okay, all that remains here is to get rid of those variables (might be just one). > > + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs > > + * in one block. We return the block numbers in ascending order and the offset > > + * numbers in each result is also sorted in ascending order. > > + */ > > +TidStoreIterResult * > > +TidStoreIterateNext(TidStoreIter *iter) > > > > The wording is a bit awkward. > > Fixed. - * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs - * in one block. We return the block numbers in ascending order and the offset - * numbers in each result is also sorted in ascending order. + * Scan the TidStore and return the TIDs of the next block. The returned block + * numbers is sorted in ascending order, and the offset numbers in each result + * is also sorted in ascending order. Better, but it's still not very clear. Maybe "The offsets in each iteration result are ordered, as are the block numbers over all iterations." > > +/* Extract TIDs from the given key-value pair */ > > +static void > > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, > > BlocktableEntry *page) > > > > This is a leftover from the old encoding scheme. This should really > > take a "BlockNumber blockno" not a "key", and the only call site > > should probably cast the uint64 to BlockNumber. > > Fixed. This part looks good. I didn't notice earlier, but this comment has a similar issue @@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter) return NULL; /* Collect TIDs extracted from the key-value pair */ - tidstore_iter_extract_tids(iter, key, page); + tidstore_iter_extract_tids(iter, (BlockNumber) key, page); ..."extracted" was once a separate operation. I think just removing that one word is enough to update it. Some other review on code comments: v73-0001: + /* Enlarge the TID array if necessary */ It's "arrays" now. v73-0005: +-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has +-- different randon 100 offset numbers each other. The numbers are obvious from the query. Maybe just mention that the offsets are randomized and must be unique and ordered. + * The caller is responsible for release any locks. "releasing" > > +typedef struct BlocktableEntry > > +{ > > + uint16 nwords; > > + bitmapword words[FLEXIBLE_ARRAY_MEMBER]; > > +} BlocktableEntry; > > > > In my WIP for runtime-embeddable offsets, nwords needs to be one byte. I should be more clear here: nwords fitting into one byte allows 3 embedded offsets (1 on 32-bit platforms, which is good for testing at least). With uint16 nwords that reduces to 2 (none on 32-bit platforms). Further, after the current patch series is fully committed, I plan to split the embedded-offset patch into two parts: The first would store the offsets in the header, but would still need a (smaller) allocation. The second would embed them in the child pointer. Only the second patch will care about the size of nwords because it needs to reserve a byte for the pointer tag. > > That doesn't have any real-world affect on the largest offset > > encountered, and only in 32-bit builds with 32kB block size would the > > theoretical max change at all. To be precise, we could use in the > > MaxBlocktableEntrySize calculation: > > > > Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1); > > I don't get this expression. Making the nwords one byte works well? > With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256 > bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit > OS, respectively. One byte nwrods variable seems not to be sufficient I believe there is confusion between bitmap words and bytes: 2048 / 64 = 32 words = 256 bytes It used to be max tuples per (heap) page, but we wanted a simple way to make this independent of heap. I believe we won't need to ever store the actual MaxOffsetNumber, although we technically still could with a one-byte type and 32kB pages, at least on 64-bit platforms. > for both cases. Also, where does the expression "BITS_PER_BITMAPWORD * > PG_INT8_MAX - 1" come from? 127 words, each with 64 (or 32) bits. The zero bit is not a valid offset, so subtract one. And I used signed type in case there was a need for -1 to mean something.
Attachment
pgsql-hackers by date: