Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Antonin Houska
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id 92024.1605176256@antos
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: POC: Cleaning up orphaned files using undo logs
Re: POC: Cleaning up orphaned files using undo logs
List pgsql-hackers
Thomas Munro <thomas.munro@gmail.com> wrote:

> On Thu, Nov 28, 2019 at 3:45 PM Michael Paquier <michael@paquier.xyz> wrote:
> > On Tue, Sep 17, 2019 at 10:03:20AM +1200, Thomas Munro wrote:
> > > Oops, right.  So it should just be added to the if condition.  Will do.
> >
> > It's been a couple of months and the discussion has stale.  It seems
> > also that the patch was waiting for an update.  So I am marking it as
> > RwF for now.  Please feel free to update it if you feel that's not
> > adapted.
> 
> Thanks.  We decided to redesign a couple of aspects of the undo
> storage and record layers that this patch was intended to demonstrate,
> and work on that is underway.  More on that soon.

As my boss expressed in his recent blog post, we'd like to contribute to the
zheap development, and a couple of developers from other companies are
interested in this as well. Amit Kapila suggested that the "cleanup of
orphaned files" feature is a good start point in getting the code into PG
core, so I've spent some time on it and tried to rebase the patch set.

In fact what I did is not mere rebasing against the current master branch -
I've also (besides various bug fixes) done some design changes.

Incorporated the new Undo Record Set (URS) infrastructure
---------------------------------------------------------

This is also pointed out in [0].

I started from [1] and tried to implement some missing parts (e.g. proper
closing of the URSs after crash), introduced UNDO_DEBUG preprocessor macro
which makes the undo log segments very small and fixed some bugs that the
small segments exposed.

The most significant change I've done was removal of the undo requests from
checkpoint. I could not find any particular bug / race conditions related to
including the requests into the checkpoint, but I concluded that it's easier
to think about consistency and checkpoint timings if we scan the undo log on
restart (after recovery has finished) and create the requests from scratch.

[2] shows where I ended up before I started to rebase this patchset.

No background undo
------------------

Reduced complexity of the patch seems to be the priority at the moment. Amit
suggested that cleanup of an orphaned relation file is simple enough to be
done on foreground and I agree.

"undo worker" is still there, but it only processes undo requests after server
restart because relation data can only be changed in a transaction - it seems
cleaner to launch a background worker for this than to hack the startup
process.

Since the concept of undo requests is closely related to the undo worker, I
removed undorequest.c too. The new (much simpler) undo worker gets the
information on incomplete / aborted transactions from the undo log as
mentioned above.

SMGR enhancement
----------------

I used the 0001 patch from [3] rather than [4], although it's more invasive
because I noticed somewhere in the discussion that there should be no reserved
database OID for the undo log. (InvalidOid cannot be used because it's already
in use for shared catalogs.)

Components added
----------------

pg_undo_dump utility and test framework for undoread.c. BTW, undoread.c seems
to need some refactoring.


Following are a few areas which are not implemented yet because more
discussion is needed there:

Discarding
----------

There's no discard worker for the URS infrastructure yet. I thought about
discarding the undo log during checkpoint, but checkpoint should probably do
more straightforward tasks than the calculation of a new discard pointer for
each undo log, so a background worker is needed. A few notes on that:

  * until the zheap AM gets added, only the transaction that creates the undo
    records needs to access them. This assumption should make the discarding
    algorithm a bit simpler. Note that with zheap, the other transactions need
    to look for old versions of tuples, so the concept of oldestXidHavingUndo
    variable is needed there.

  * it's rather simple to pass pointer the URS pointer to the discard worker
    when transaction either committed or the undo has been executed. If the
    URS only consists of one chunk, the discard pointer can simply be advanced
    to the end of the chunk. But if there are multiple chunks, the discard
    worker might need to scan quite some amount of the undo log because (IIUC)
    chunks of different URSs can be interleaved (if there's not enough space
    for a record in the log 1, log 2 is used, but before we get to discarding,
    another transaction could have added its chunk to the log 1) and because
    the chunks only contain links backwards, not forward. If we added the
    forward link to the chunk header, it would make chunk closing more
    complex.

    How about storing the type header (which includes XID) in each chunk
    instead of only the first chunk of the URS? Thus we'd be able to check for
    each chunk separately whether it can be discarded.

  * if the URS belongs to an aborted transaction or a transaction that could
    not finish due to server crash, the transaction status alone does not
    justify discarding: we also need to be sure that the underlying undo
    records have been applied. So if we want to do without the
    oldestXidHavingUndo variable, some sort of undo progress tracking is
    needed, see below.

Do not execute the same undo record multiple times
--------------------------------------------------

Although I've noticed in the zheap code that it checks whether particular undo
action was already undone, I think this functionality fits better in the URS
layer. Also note in [1] (i.e. the undo layer, no zheap) that the header
comment of AtSubAbort_XactUndo() refers to this problem.

I've tried to implement such a thing (not included in this patch) by adding
last_rec_applied field to UndoRecordSetChunkHeader. When the UNDO stage of the
transaction starts, this field is set to the last undo record of given chunk,
and once that record is applied, the pointer moves to the previous record in
terms of undo pointer (i.e. the next record to be applied - the records are
applied in reverse order) and so on. For recovery purposes, the pointer is
maintained in a similar way as the ud_insertion_point field of
UndoPageHeaderData. However, although I haven't tested performance yet, I
wonder if it's o.k. to lock the buffer containing the chunk header exclusively
for each undo record execution. I wonder if there's a better place to store
the progress information, maybe at page level?

I can spend more time on this project, but need a hint which part I should
focus on. Other hackers might have the same problem. Thanks for any
suggestions.

[0] https://www.postgresql.org/message-id/CA%2BTgmoZwkqXs3hpT_nd17fyMnZDkg8yU%3D5kG%2BHQw%2B80rumiwUA%40mail.gmail.com
[1] https://github.com/EnterpriseDB/zheap/tree/undo-record-set
[2] https://github.com/cybertec-postgresql/postgres/tree/undo-record-set-ah
[3] https://www.postgresql.org/message-id/CA%2BhUKGJfznxutTwpMLKPMjU_k9GhERoogyxx2Sf105LOA2La2A%40mail.gmail.com
[4] https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhRq-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com


Attachment

pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: Asynchronous Append on postgres_fdw nodes.
Next
From: Etsuro Fujita
Date:
Subject: Re: Asynchronous Append on postgres_fdw nodes.