Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From John Naylor
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CANWCAZbO6ZOEqbD=OJqt=diVqXJGiUHfG8EYcJ2zR4fCnfrH_w@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Thu, Mar 14, 2024 at 8:53 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 14, 2024 at 9:59 AM John Naylor <johncnaylorls@gmail.com> wrote:
> > > BTW do we still want to test the tidstore by using a combination of
> > > SQL functions? We might no longer need to input TIDs via a SQL
> > > function.
> >
> > I'm not sure. I stopped short of doing that to get feedback on this
> > much. One advantage with SQL functions is we can use generate_series
> > to easily input lists of blocks with different numbers and strides,
> > and array literals for offsets are a bit easier. What do you think?
>
> While I'm not a fan of the following part, I agree that it makes sense
> to use SQL functions for test data generation:
>
> -- Constant values used in the tests.
> \set maxblkno 4294967295
> -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291.
> -- We use a higher number to test tidstore.
> \set maxoffset 512

I'm not really a fan of these either, and could be removed a some
point if we've done everything else nicely.

> It would also be easier for developers to test the tidstore with their
> own data set. So I agreed with the current approach; use SQL functions
> for data generation and do the actual tests inside C functions.

Okay, here's an another idea: Change test_lookup_tids() to be more
general and put the validation down into C as well. First we save the
blocks from do_set_block_offsets() into a table, then with all those
blocks lookup a sufficiently-large range of possible offsets and save
found values in another array. So the static items structure would
have 3 arrays: inserts, successful lookups, and iteration (currently
the iteration output is private to check_set_block_offsets(). Then
sort as needed and check they are all the same.

Further thought: We may not really need to test block numbers that
vigorously, since the radix tree tests should cover keys/values pretty
well. The difference here is using bitmaps of tids and that should be
well covered.

Locally (not CI), we should try big inputs to make sure we can
actually go up to many GB -- it's easier and faster this way than
having vacuum give us a large data set.

> Is it
> convenient for developers if we have functions like generate_tids()
> and generate_random_tids() to generate TIDs so that they can pass them
> to do_set_block_offsets()?

I guess I don't see the advantage of adding a layer of indirection at
this point, but it could be useful at a later time.



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Typos in reorderbuffer.c.
Next
From: Dilip Kumar
Date:
Subject: Re: BitmapHeapScan streaming read user and prelim refactoring