Home > mailing lists

Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: POC: Cleaning up orphaned files using undo logs
Date	August 5, 2019 16:24:50
Msg-id	CA+TgmoZpyX3Jhhw9GeYpye24P9=h4sP+jzRns71wo_Vb4rViiQ@mail.gmail.com Whole thread Raw
In response to	Re: POC: Cleaning up orphaned files using undo logs (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses	Re: POC: Cleaning up orphaned files using undo logs
List	pgsql-hackers

Tree view

On Sun, Aug 4, 2019 at 5:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I feel that the level of abstraction is not quite right. There are a
> bunch of fields, like uur_block, uur_offset, uur_tuple, that are
> probably useful for some UNDO resource managers (zheap I presume), but
> seem kind of arbitrary. How is uur_tuple different from uur_payload?
> Should they be named more generically as uur_payload1 and uur_payload2?
> And why two, why not three or four different payloads? In the WAL record
> format, there's a concept of "block id", which allows you to store N
> number of different payloads in the record, I think that would be a
> better approach. Or only have one payload, and let the resource manager
> code divide it as it sees fit.
>
> Many of the fields support a primitive type of compression, where a
> field can be omitted if it has the same value as on the first record on
> an UNDO page. That's handy. But again I don't like the fact that the
> fields have been hard-coded into the UNDO record format. I can see e.g.
> the relation oid to be useful for many AMs. But not all. And other AMs
> might well want to store and deduplicate other things, aside from the
> fields that are in the patch now. I'd like to move most of the fields to
> AM specific code, and somehow generalize the compression. One approach
> would be to let the AM store an arbitrary struct, and run it through a
> general-purpose compression algorithm, using the UNDO page's first
> record as the "dictionary".

I thought about this, too. I agree that there's something a little
unsatisfying about the current structure, but I haven't been able to
come up with something that seems definitively better. I think
something along the lines of what you are describing here might work
well, but I am VERY doubtful about the idea of a fixed-size struct. I
think AMs are going to want to store variable-length data: especially
tuples, but maybe also other stuff. For instance, imagine some AM that
wants to implement locking that's more fine-grained that the four
levels of tuple locks we have today: instead of just having key locks
and all-columns locks, you could want to store the exact columns to be
locked. Or maybe your TIDs are variable-width.

And the problem is that as soon as you move to something where you
pack in a bunch of variable-sized fields, you lose the ability to
refer to thinks using reasonable names.  That's where I came up with
the idea of an UnpackedUndoRecord: give the common fields that
"everyone's going to need" human-readable names, and jam only the
strange, AM-specific stuff into the payload.  But if those needs are
not actually universal but very much AM-specific, then I'm afraid
we're going to end up with deeply inscrutable code for packing and
unpacking records.  I imagine it's possible to come up with a good
structure for that, but I don't think we have one today.

> I don't like the way UndoFetchRecord returns a palloc'd
> UnpackedUndoRecord. I would prefer something similar to the xlogreader
> API, where a new call to UndoFetchRecord invalidates the previous
> result. On efficiency grounds, to avoid the palloc, but also to be
> consistent with xlogreader.

I don't think that's going to work very well, because we often need to
deal with multiple records at a time.  There is (or was) a bulk-fetch
interface, but I've also found while experimenting with this code that
it can be useful to do things like:

current = undo_fetch(starting_record);
loop:
    next = undo_fetch(current->next_record_ptr);
    if some_test(next):
        break;
    undo_free(current);
    current = next;

I think we shouldn't view such cases as exceptions to the general
paradigm of looking at undo records one at a time, but instead as the
normal case for which everything is optimized.  Cases like orphaned
file cleanup where the number of undo records is probably small and
they're all independent of each other will, I think, turn out to be
the exception rather than the rule.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

From: Robert Haas
Date: 05 August 2019, 15:58:50
Subject: Re: POC: Cleaning up orphaned files using undo logs

From: Robert Haas
Date: 05 August 2019, 16:42:59
Subject: Re: block-level incremental backup

Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

Previous

Next