Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoBjmj0fDyTJCYvs3W2GQb9bUT5pwQLOU_=BUTOWhj9ToQ@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers
On Sat, Jan 28, 2023 at 8:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Thu, Jan 26, 2023 at 3:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Jan 26, 2023 at 3:54 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
>
> > I think that we need to prevent concurrent updates (RT_SET() and
> > RT_DELETE()) during the iteration to get the consistent result through
> > the whole iteration operation. Unlike other operations such as
> > RT_SET(), we cannot expect that a job doing something for each
> > key-value pair in the radix tree completes in a short time, so we
> > cannot keep holding the radix tree lock until the end of the
> > iteration.
>
> This sounds like a performance concern, rather than a correctness concern, is that right? If so, I don't think we
shouldworry too much about optimizing simple locking, because it will *never* be fast enough for highly-concurrent
read-writeworkloads anyway, and anyone interested in those workloads will have to completely replace the locking
scheme,possibly using one of the ideas in the last ART paper you mentioned. 
>
> The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as
possibleanyway. 

Yes, but if a concurrent writer waits for another process to finish
the iteration, it ends up waiting on a lwlock, which is not
interruptible.

>
> > So the idea is that we set iter_active to true (with the
> > lock in exclusive mode), and prevent concurrent updates when the flag
> > is true.
>
> ...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.

Right. I think if we want to wait rather than an ERROR, the waiter
should wait in an interruptible way, for example, a condition
variable. I did a simpler way in the v22 patch.

...but looking at dshash.c, dshash_seq_next() seems to return an entry
while holding a lwlock on the partition. My assumption might be wrong.

>
> > > Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the
nextpatch, the email should contain a few sentences describing how locking is intended to work, including for
iteration.
> >
> > The lock I'm thinking of adding is a simple readers-writer lock. This
> > lock is used for concurrent radix tree operations except for the
> > iteration. For operations concurrent to the iteration, I used a flag
> > for the reason I mentioned above.
>
> This doesn't tell me anything -- we already agreed on "simple reader-writer lock", months ago I believe. And I only
havea vague idea about the tradeoffs made regarding iteration. 
>
> + * WIP: describe about how locking works.
>
> A first draft of what is intended for this WIP would be a good start. This WIP is from v23-0016, which contains no
commentsand a one-line commit message. I'd rather not try closely studying that patch (or how it works with 0011) until
Ihave a clearer understanding of what requirements are assumed, what trade-offs are considered, and how it should be
tested.
>
> [thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a
parameterfor whether the iteration function wants to reserve the privilege to perform writes? It could take the
appropriatelock at the start, and there could then be multiple read-only iterators, but only one read/write iterator.
Note,I'm just guessing here, and I don't want to make things more difficult for future improvements. 

Seems a good idea. Given the use case for parallel heap vacuum, it
would be a good idea to support having multiple read-only writers. The
iteration of the v22 is read-only, so if we want to support read-write
iterator, we would need to support a function that modifies the
current key-value returned by the iteration.

>
> > > Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of
thisdata structure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads
andwrites and so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs,
wecan't rely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So
wemust come up with test cases ourselves. 
> >
> > Using the isolation tester to test locking seems like a good idea. We
> > can include it in test_radixtree. But given that the locking in the
> > radix tree is very simple, the test case would be very simple. It may
> > be controversial whether it's worth adding such testing by adding both
> > the new test module and test cases.
>
> I mean that the isolation tester (or something else) would contain test cases. I didn't mean to imply redundant
testing.

Okay, understood.

>
> > I think the user (e.g, vacuumlazy.c) can pass the maximum offset
> > number to the parallel vacuum.
>
> Okay, sounds good.
>
> Most of v23's cleanups/fixes in the radix template look good to me, although I didn't read the debugging code very
closely.There is one exception: 
>
> 0006 - I've never heard of memset'ing a variable to avoid "variable unused" compiler warnings, and it seems strange.
Itturns out we don't actually need this variable in the first place. The attached .txt patch removes the local variable
andjust writes to the passed pointer. This required callers to initialize a couple of their own variables, but only
childpointers, at least on gcc 12. 

Agreed with the attached patch.

>  And I will work later on making "value" in the public API a pointer.

Thanks!

>
> 0017 - I haven't taken a close look at the new changes, but I did notice this some time ago:
>
> + if (TidStoreIsShared(ts))
> + return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
> + else
> + return sizeof(TidStore) + sizeof(TidStore) +
> + local_rt_memory_usage(ts->tree.local);
>
> There is repetition in the else branch.

Agreed, will remove.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: [PATCH] Fix old thinko in formula to compute sweight in numeric_sqrt().
Next
From: Masahiko Sawada
Date:
Subject: Re: Assertion failure in SnapBuildInitialSnapshot()