Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Robert Haas
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id CA+TgmoZdQGv35Qzrrj+-X_GypCPqv0FqfpQLmUfva1rY75hCfw@mail.gmail.com
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: POC: Cleaning up orphaned files using undo logs  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Wed, Jul 10, 2019 at 12:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Broadly, you are correct to point out that you need to avoid chasing
> stale pointers, and there are a bunch of ways to accomplish that:
> approach #1 avoids using real pointers, and the rest just make sure
> that any stale pointers don't stick around long enough to cause any
> harm. There are probably also several other totally realistic
> alternatives, and I don't know for sure what is best, or how much it
> matters.

After some off-list discussion with Andres ...

Another possible approach here, which I think I like better, is to
switch from using a binary heap to using an rbtree.  That wouldn't
work well in DSM because of the way it uses pointers, but here we're
putting data in the shared memory segment so it seems like it should
work.  The idea would be to allocate an array of entries with a
freelist, and then have allocfunc and freefunc defined to push and pop
the freelist.  Unlike a binary heap, an rbtree lets us (a) do
peek-ahead in sorted order and (b) delete elements from an arbitrary
position without rebuilding anything.

If we adopt this approach, then I think a bunch of the problems we've
been talking about actually get a lot easier.  If we pull an item from
the ordered-by-XID rbtree or the ordered-by-undo-size rbtree, we can
remove it from the other one cheaply, because we can store a pointer
to the RBTNode in the main object.  So then we never have any stale
pointers in any data structure, which means we don't have to have a
strategy to avoid accidentally following them.

The fact that we can peak-ahead correctly without any new code is also
very nice.  I'm still concerned that peeking ahead isn't the right
approach in general, but if we're going to do it, peeking ahead to the
actually-next-highest-priority item is a lot better than peeking ahead
to some-item-that-may-be-fairly-high-priority.

One problem which Andres spotted is that rbt_delete() can actually
move content around, so if you just cache the RBTNode returned by
rbt_insert(), it might not be the right one by the time you
rbt_delete(), if other stuff has been deleted first.  There are
several possible approaches to that problem, but one that I'm
wondering about is modifying rbt_delete_node() so that it doesn't rely
on rbt_copy_data.  The idea is that if y != z, instead of copying the
data from y to z, copy the left/right/parent pointers from z into y,
and make z's left, right, and parent nodes point to y instead.  Then
we always end up removing the correct node, which would make things
much easier for us and might well be helpful to other code that uses
rbtree as well.

Another small problem, also spotted by Andres, is that rbt_create()
uses palloc.  That seems easy to work around: just provide an
rbt_intialize() function that a caller can use instead of it wants to
initialize an already-allocated block of memory.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and KeyManagement Service (KMS)
Next
From: Rémi Zara
Date:
Subject: Re: coypu: "FATAL: sorry, too many clients already"