Thread: Use read streams in pg_visibility
Hi, I am working on using the read stream in pg_visibility. There are two places to use it: 1- collect_visibility_data() This one is straightforward. I created a read stream object if 'include_pd' is true because read I/O is done when 'include_pd' is true. There is ~4% timing improvement with this change. I started the server with the default settings and created a 6 GB table. Then run 100 times pg_visibility() by clearing the OS cache between each run. ---------- 2- collect_corrupt_items() This one is more complicated. The read stream callback function loops until it finds a suitable block to read. So, if the callback returns an InvalidBlockNumber; it means that the stream processed all possible blocks and the stream should be finished. There is ~3% timing improvement with this change. I started the server with the default settings and created a 6 GB table. Then run 100 times pg_check_visible() by clearing the OS cache between each run. The downside of this approach is there are too many "vmbuffer is valid but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so vmbuffer needs to be read again" cases in the read stream version (700 vs 20 for the 6 GB table). This is caused by the callback function of the read stream reading a new vmbuf while getting next block numbers. So, vmbuf is wrong when we are checking visibility map bits that might have changed while we were acquiring the page lock. ---------- Both patches are attached. Any feedback would be appreciated. -- Regards, Nazir Bilal Yavuz Microsoft
Attachment
On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote: > 2- collect_corrupt_items() > > This one is more complicated. The read stream callback function loops > until it finds a suitable block to read. So, if the callback returns > an InvalidBlockNumber; it means that the stream processed all possible > blocks and the stream should be finished. There is ~3% timing > improvement with this change. I started the server with the default > settings and created a 6 GB table. Then run 100 times > pg_check_visible() by clearing the OS cache between each run. > > The downside of this approach is there are too many "vmbuffer is valid > but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so > vmbuffer needs to be read again" cases in the read stream version (700 > vs 20 for the 6 GB table). This is caused by the callback function of > the read stream reading a new vmbuf while getting next block numbers. > So, vmbuf is wrong when we are checking visibility map bits that might > have changed while we were acquiring the page lock. Did the test that found 700 "read again" use a different procedure than the one shared as setup.sh down-thread? Using that script alone, none of the vm bits are set, so I'd not expect any heap page reads. The "vmbuffer needs to be read again" regression fits what I would expect the v1 patch to do with a table having many vm bits set. In general, I think the fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer while collect_corrupt_items() works on another vmbuffer. Much of the time, they'll be the same buffer. It could be as simple as that, but you could consider further optimizations like these: - Use ReadRecentBuffer() or a similar technique, to avoid a buffer mapping lookup when the other Buffer var already has the block you want. - [probably not worth it] Add APIs for pg_visibility.c to tell read_stream.c to stop calling the ReadStreamBlockNumberCB for awhile. This could help if nonzero vm bits are infrequent, causing us to visit few heap blocks per vm block. For example, if one block out of every 33000 is all-visible, every heap block we visit has a different vmbuffer. It's likely not optimal for the callback to pin and unpin 20 vmbuffers, then have collect_corrupt_items() pin and unpin the same 20 vmbuffers. pg_visibility could notice this trend and request a stop of the callbacks until more of the heap block work completes. If pg_visibility is going to be the only place in the code with this use case, it's probably not worth carrying the extra API just for pg_visibility. However, if we get a stronger use case later, pg_visibility could be another beneficiary. > +/* > + * Callback function to get next block for read stream object used in > + * collect_visibility_data() function. > + */ > +static BlockNumber > +collect_visibility_data_read_stream_next_block(ReadStream *stream, > + void *callback_private_data, > + void *per_buffer_data) > +{ > + struct collect_visibility_data_read_stream_private *p = callback_private_data; > + > + if (p->blocknum < p->nblocks) > + return p->blocknum++; > + > + return InvalidBlockNumber; This is the third callback that just iterates over a block range, after pg_prewarm_read_stream_next_block() and copy_storage_using_buffer_read_stream_next_block(). While not a big problem, I think it's time to have a general-use callback for block range scans. The quantity of duplicate code is low, but the existing function names are long and less informative than a behavior-based name. > +static BlockNumber > +collect_corrupt_items_read_stream_next_block(ReadStream *stream, > + void *callback_private_data, > + void *per_buffer_data) > +{ > + struct collect_corrupt_items_read_stream_private *p = callback_private_data; > + > + for (; p->blocknum < p->nblocks; p->blocknum++) > + { > + bool check_frozen = false; > + bool check_visible = false; > + > + if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer)) > + check_frozen = true; > + if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer)) > + check_visible = true; > + if (!check_visible && !check_frozen) > + continue; If a vm has zero bits set, this loop will scan the entire vm before returning. Hence, this loop deserves a CHECK_FOR_INTERRUPTS() or a comment about how VM_ALL_FROZEN/VM_ALL_VISIBLE reaches a CHECK_FOR_INTERRUPTS(). > @@ -687,6 +734,20 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen) > items->count = 64; > items->tids = palloc(items->count * sizeof(ItemPointerData)); > > + p.blocknum = 0; > + p.nblocks = nblocks; > + p.rel = rel; > + p.vmbuffer = &vmbuffer; > + p.all_frozen = all_frozen; > + p.all_visible = all_visible; > + stream = read_stream_begin_relation(READ_STREAM_FULL, > + bstrategy, > + rel, > + MAIN_FORKNUM, > + collect_corrupt_items_read_stream_next_block, > + &p, > + 0); > + > /* Loop over every block in the relation. */ > for (blkno = 0; blkno < nblocks; ++blkno) The callback doesn't return blocks having zero vm bits, so the blkno variable is not accurate. I didn't test, but I think the loop's "Recheck to avoid returning spurious results." looks at the bit for the wrong block. If that's what v1 does, could you expand the test file to catch that? For example, make a two-block table with only the second block all-visible. Since the callback skips blocks, this function should use the read_stream_reset() style of looping: while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer) ... Thanks, nm
On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote: > On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah@leadboat.com> wrote: > > On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote: > > > The downside of this approach is there are too many "vmbuffer is valid > > > but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so > > > vmbuffer needs to be read again" cases in the read stream version (700 > > > vs 20 for the 6 GB table). This is caused by the callback function of > > > the read stream reading a new vmbuf while getting next block numbers. > > > So, vmbuf is wrong when we are checking visibility map bits that might > > > have changed while we were acquiring the page lock. > > > > Did the test that found 700 "read again" use a different procedure than the > > one shared as setup.sh down-thread? Using that script alone, none of the vm > > bits are set, so I'd not expect any heap page reads. > > Yes, it is basically the same script but the query is 'SELECT > pg_check_visible('${TABLE}'::regclass);'. I gather the scripts deal exclusively in tables with no vm bits set, so pg_visibility did no heap page reads under those scripts. But the '700 "read again"' result suggests either I'm guessing wrong, or that result came from a different test procedure. Thoughts? > > The "vmbuffer needs to be read again" regression fits what I would expect the > > v1 patch to do with a table having many vm bits set. In general, I think the > > fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer > > while collect_corrupt_items() works on another vmbuffer. Much of the time, > > they'll be the same buffer. It could be as simple as that, but you could > > consider further optimizations like these: > > Thanks for the suggestion. Keeping two vmbuffers lowered the count of > "read-again" cases to ~40. I ran the test again and the timing > improvement is ~5% now. > I think the number of "read-again" cases are low enough now. Agreed. The increase from 20 to 40 is probably an increase in buffer mapping lookups, not an increase in I/O. > Does creating something like > pg_general_read_stream_next_block() in read_stream code and exporting > it makes sense? To me, that name isn't conveying the function's behavior, and the words "pg_" and "general_" aren't adding information there. How about one of these names or similar: seq_read_stream_cb sequential_read_stream_cb block_range_read_stream_cb > > The callback doesn't return blocks having zero vm bits, so the blkno variable > > is not accurate. I didn't test, but I think the loop's "Recheck to avoid > > returning spurious results." looks at the bit for the wrong block. If that's > > what v1 does, could you expand the test file to catch that? For example, make > > a two-block table with only the second block all-visible. > > Yes, it was not accurate. I am getting blockno from the buffer now. I > checked and confirmed it is working as expected by manually logging > blocknos returned from the read stream. I am not sure how to add a > test case for this. VACUUM FREEZE makes an all-visible, all-frozen table. DELETE of a particular TID, even if rolled back, clears both vm bits for the TID's page. Past tests like that had instability problems. One cause is a concurrent session's XID or snapshot, which can prevent VACUUM setting vm bits. Using a TEMP table may have been one of the countermeasures, but I don't remember clearly. Hence, please search the archives or the existing pg_visibility tests for how we dealt with that. It may not be problem for this particular test. > There is one behavioral change introduced with the patches. It could > happen when collect_corrupt_items() is called with both all_visible > and all_frozen being true. > -> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns > false in callback. Callback returns this block number because > VM_ALL_FROZEN() is true. > -> At the /* Recheck to avoid returning spurious results. */ part, we > should only check VM_ALL_FROZEN() again as this was returning true in > the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE(). > VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now > (vice versa of callback). > -> We were skipping this block in the master but the patched version > does not skip that. > > Is this a problem? If this is a problem, a solution might be to > callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the > per_buffer_data. No, I don't consider that a problem. That's not making us do additional I/O, just testing more bits within the pages we're already loading. The difference in behavior is user-visible, but it's the same behavior change the user would get if the timing had been a bit different.
On Tue, Aug 27, 2024 at 10:49:19AM +0300, Nazir Bilal Yavuz wrote: > On Fri, 23 Aug 2024 at 22:01, Noah Misch <noah@leadboat.com> wrote: > > On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote: > > > On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah@leadboat.com> wrote: > > > > On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote: > I liked the block_range_read_stream_cb. Attached patches for that > (first 3 patches). I chose an nblocks way instead of last_blocks in > the struct. To read blocks 10 and 11, I would expect to initialize the struct with one of: { .first=10, .nblocks=2 } { .first=10, .last_inclusive=11 } { .first=10, .last_exclusive=12 } With the patch's API, I would need {.first=10,.nblocks=12}. The struct field named "nblocks" behaves like a last_block_exclusive. Please either make the behavior an "nblocks" behavior or change the field name to replace the term "nblocks" with something matching the behavior. (I used longer field names in my examples here, to disambiguate those examples. It's okay if the final field names aren't those, as long as the field names and the behavior align.) > > > > The callback doesn't return blocks having zero vm bits, so the blkno variable > > > > is not accurate. I didn't test, but I think the loop's "Recheck to avoid > > > > returning spurious results." looks at the bit for the wrong block. If that's > > > > what v1 does, could you expand the test file to catch that? For example, make > > > > a two-block table with only the second block all-visible. > > > > > > Yes, it was not accurate. I am getting blockno from the buffer now. I > > > checked and confirmed it is working as expected by manually logging > > > blocknos returned from the read stream. I am not sure how to add a > > > test case for this. > > > > VACUUM FREEZE makes an all-visible, all-frozen table. DELETE of a particular > > TID, even if rolled back, clears both vm bits for the TID's page. Past tests > > like that had instability problems. One cause is a concurrent session's XID > > or snapshot, which can prevent VACUUM setting vm bits. Using a TEMP table may > > have been one of the countermeasures, but I don't remember clearly. Hence, > > please search the archives or the existing pg_visibility tests for how we > > dealt with that. It may not be problem for this particular test. > > Thanks for the information, I will check these. What I still do not > understand is how to make sure that only the second block is processed > and the first one is skipped. pg_check_visible() and pg_check_frozen() > returns TIDs that cause corruption in the visibility map, there is no > information about block numbers. I see what you're saying. collect_corrupt_items() needs a corrupt table to report anything; all corruption-free tables get the same output. Testing this would need extra C code or techniques like corrupt_page_checksum() to create the corrupt state. That wouldn't be a bad thing to have, but it's big enough that I'll consider it out of scope for $SUBJECT. With the callback change above, I'll be ready to push all this.
On Mon, Sep 02, 2024 at 03:20:12PM +0300, Nazir Bilal Yavuz wrote: > Thanks, updated patches are attached. > +/* > + * Ask the callback which block it would like us to read next, with a small > + * buffer in front to allow read_stream_unget_block() to work and to allow the > + * fast path to skip this function and work directly from the array. > */ > static inline BlockNumber > read_stream_get_block(ReadStream *stream, void *per_buffer_data) v4-0001-Add-general-use-struct-and-callback-to-read-strea.patch introduced this update to the read_stream_get_block() header comment, but we're not changing read_stream_get_block() here. I reverted this. Pushed with some other cosmetic changes.
On Tue, Sep 03, 2024 at 10:50:11AM -0700, Noah Misch wrote: > On Mon, Sep 02, 2024 at 03:20:12PM +0300, Nazir Bilal Yavuz wrote: > > Thanks, updated patches are attached. > > > +/* > > + * Ask the callback which block it would like us to read next, with a small > > + * buffer in front to allow read_stream_unget_block() to work and to allow the > > + * fast path to skip this function and work directly from the array. > > */ > > static inline BlockNumber > > read_stream_get_block(ReadStream *stream, void *per_buffer_data) > > v4-0001-Add-general-use-struct-and-callback-to-read-strea.patch introduced > this update to the read_stream_get_block() header comment, but we're not > changing read_stream_get_block() here. I reverted this. > > Pushed with some other cosmetic changes. I see I pushed something unacceptable under ASAN. I will look into that: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-09-03%2017%3A47%3A20
On Tue, Sep 03, 2024 at 10:46:34PM +0300, Nazir Bilal Yavuz wrote: > On Tue, 3 Sept 2024 at 22:20, Noah Misch <noah@leadboat.com> wrote: > > On Tue, Sep 03, 2024 at 10:50:11AM -0700, Noah Misch wrote: > > > Pushed with some other cosmetic changes. > > Thanks! > > > I see I pushed something unacceptable under ASAN. I will look into that: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-09-03%2017%3A47%3A20 > > I think it is related to the scope of BlockRangeReadStreamPrivate in > the collect_visibility_data() function. Attached a small fix for that. Right. https://postgr.es/m/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com then observed that collect_corrupt_items() was now guaranteed to never detect corruption. I have pushed revert ddfc556 of the pg_visibility.c changes. For the next try, could you add that testing we discussed?
Hi, On Wed, 4 Sept 2024 at 21:43, Noah Misch <noah@leadboat.com> wrote: > > https://postgr.es/m/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com > then observed that collect_corrupt_items() was now guaranteed to never detect > corruption. I have pushed revert ddfc556 of the pg_visibility.c changes. For > the next try, could you add that testing we discussed? Do you think that corrupting the visibility map and then seeing if pg_check_visible() and pg_check_frozen() report something is enough? -- Regards, Nazir Bilal Yavuz Microsoft
On Thu, Sep 05, 2024 at 03:59:53PM +0300, Nazir Bilal Yavuz wrote: > On Wed, 4 Sept 2024 at 21:43, Noah Misch <noah@leadboat.com> wrote: > > https://postgr.es/m/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com > > then observed that collect_corrupt_items() was now guaranteed to never detect > > corruption. I have pushed revert ddfc556 of the pg_visibility.c changes. For > > the next try, could you add that testing we discussed? > > Do you think that corrupting the visibility map and then seeing if > pg_check_visible() and pg_check_frozen() report something is enough? I think so. Please check that it would have caught both the blkno bug and the above bug.
On Tue, Sep 10, 2024 at 02:35:46PM +0300, Nazir Bilal Yavuz wrote: > Your patch is correct. I wrongly assumed it would catch blockno bug, > the attached version catches it. I made blockno = 0 invisible and not > frozen before copying the vm file. So, in the blockno buggy version; > callback will skip that block but the main loop in the > collect_corrupt_items() will not skip it. I tested it with your patch > and there is exactly 1 blockno difference between expected and result > output. Pushed. I added autovacuum=off so auto-analyze of a system catalog can't take a snapshot that blocks VACUUM updating the vismap. I doubt that could happen under default settings, but this lets us disregard the possibility entirely. I also fixed the mix of tabs and spaces inside test file string literals.
Hi, On Wed, 11 Sept 2024 at 01:38, Noah Misch <noah@leadboat.com> wrote: > > On Tue, Sep 10, 2024 at 02:35:46PM +0300, Nazir Bilal Yavuz wrote: > > Your patch is correct. I wrongly assumed it would catch blockno bug, > > the attached version catches it. I made blockno = 0 invisible and not > > frozen before copying the vm file. So, in the blockno buggy version; > > callback will skip that block but the main loop in the > > collect_corrupt_items() will not skip it. I tested it with your patch > > and there is exactly 1 blockno difference between expected and result > > output. > > Pushed. I added autovacuum=off so auto-analyze of a system catalog can't take > a snapshot that blocks VACUUM updating the vismap. I doubt that could happen > under default settings, but this lets us disregard the possibility entirely. Thanks! > I also fixed the mix of tabs and spaces inside test file string literals. I ran both pgindent and pgperltidy but they didn't catch them. Is there an automated way to catch them? -- Regards, Nazir Bilal Yavuz Microsoft
On Wed, Sep 11, 2024 at 09:19:09AM +0300, Nazir Bilal Yavuz wrote: > On Wed, 11 Sept 2024 at 01:38, Noah Misch <noah@leadboat.com> wrote: > > I also fixed the mix of tabs and spaces inside test file string literals. > > I ran both pgindent and pgperltidy but they didn't catch them. Is > there an automated way to catch them? git diff --check