Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id e8116a4f-de54-4d2b-3e25-4937364749e5@iki.fi
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: POC: Cleaning up orphaned files using undo logs
List pgsql-hackers
On 05/08/2019 16:24, Robert Haas wrote:
> On Sun, Aug 4, 2019 at 5:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I feel that the level of abstraction is not quite right. There are a
>> bunch of fields, like uur_block, uur_offset, uur_tuple, that are
>> probably useful for some UNDO resource managers (zheap I presume), but
>> seem kind of arbitrary. How is uur_tuple different from uur_payload?
>> Should they be named more generically as uur_payload1 and uur_payload2?
>> And why two, why not three or four different payloads? In the WAL record
>> format, there's a concept of "block id", which allows you to store N
>> number of different payloads in the record, I think that would be a
>> better approach. Or only have one payload, and let the resource manager
>> code divide it as it sees fit.
>>
>> Many of the fields support a primitive type of compression, where a
>> field can be omitted if it has the same value as on the first record on
>> an UNDO page. That's handy. But again I don't like the fact that the
>> fields have been hard-coded into the UNDO record format. I can see e.g.
>> the relation oid to be useful for many AMs. But not all. And other AMs
>> might well want to store and deduplicate other things, aside from the
>> fields that are in the patch now. I'd like to move most of the fields to
>> AM specific code, and somehow generalize the compression. One approach
>> would be to let the AM store an arbitrary struct, and run it through a
>> general-purpose compression algorithm, using the UNDO page's first
>> record as the "dictionary".
> 
> I thought about this, too. I agree that there's something a little
> unsatisfying about the current structure, but I haven't been able to
> come up with something that seems definitively better. I think
> something along the lines of what you are describing here might work
> well, but I am VERY doubtful about the idea of a fixed-size struct. I
> think AMs are going to want to store variable-length data: especially
> tuples, but maybe also other stuff. For instance, imagine some AM that
> wants to implement locking that's more fine-grained that the four
> levels of tuple locks we have today: instead of just having key locks
> and all-columns locks, you could want to store the exact columns to be
> locked. Or maybe your TIDs are variable-width.

Sure, a fixed-size struct is quite limiting. My point is that all that 
packing of data into UNDO records should be AM-specific. Maybe it would 
be handy to have a a few common fields in the undo record header itself, 
but most data should be in the AM-specific payload, because it varies 
across AMs.

> And the problem is that as soon as you move to something where you
> pack in a bunch of variable-sized fields, you lose the ability to
> refer to thinks using reasonable names.  That's where I came up with
> the idea of an UnpackedUndoRecord: give the common fields that
> "everyone's going to need" human-readable names, and jam only the
> strange, AM-specific stuff into the payload.  But if those needs are
> not actually universal but very much AM-specific, then I'm afraid
> we're going to end up with deeply inscrutable code for packing and
> unpacking records.  I imagine it's possible to come up with a good
> structure for that, but I don't think we have one today.

Yeah, that's also a problem with complicated WAL record types. Hopefully 
the complex cases are an exception, not the norm. A complex case is 
unlikely to fit any pre-defined set of fields anyway. (We could look at 
how e.g. protobuf works, if this is really a big problem. I'm not 
suggesting that we add a dependency just for this, but there might be 
some patterns or interfaces that we could mimic.)

If you remember, we did a big WAL format refactoring in 9.5, which moved 
some information from AM-specific structs to the common headers. Namely, 
the information on the relation blocks that the WAL record applies to. 
That was a very handy refactoring, and allowed tools like pg_waldump to 
print more detailed information about all WAL record types. For WAL 
records, moving the block information was natural, because there was 
special handling for full-page images anyway. However, I don't think we 
have enough experience with UNDO log yet, to know which fields would be 
best to include in the common undo header, and which to leave as 
AM-specific payload. I think we should keep the common header slim, and 
delegate to the AM routines.

For UNDO records, having an XID on every record probably makes sense; 
all the use cases for UNDO log we've discussed are transactional. The 
rules on which UNDO records to apply and what/when to discard, depend on 
whether a transaction committed or aborted and when, so you need the XID 
for that. Although, the rule also depends on the AM; for cleaning up 
orphaned files, an UNDO record for a committed transaction can be 
discarded immediately, while zheap and zedstore records need to be kept 
around longer. So the exact rules for that will need to be AM-specific, 
too. Or maybe there are only a few different cases and we can enumerate 
them, so that an AM can just set a flag on the UNDO record to indicate 
when it can be discarded, instead of having a callback or some other 
totally generic approach.

In short, I think we should keep the common code that deals with UNDO 
records more dumb, and delegate to the AMs more. That's enough for 
cleaning up orphaned files, we don't really need the more complicated 
stuff for that. We probably need more smarts for zheap/zedstore, but we 
don't quite know what it should look like yet. Let's keep it simple for 
now, so that we can get something we can review and commit sooner, and 
we can build on top of that later.

>> I don't like the way UndoFetchRecord returns a palloc'd
>> UnpackedUndoRecord. I would prefer something similar to the xlogreader
>> API, where a new call to UndoFetchRecord invalidates the previous
>> result. On efficiency grounds, to avoid the palloc, but also to be
>> consistent with xlogreader.
> 
> I don't think that's going to work very well, because we often need to
> deal with multiple records at a time.  There is (or was) a bulk-fetch
> interface, but I've also found while experimenting with this code that
> it can be useful to do things like:
> 
> current = undo_fetch(starting_record);
> loop:
>      next = undo_fetch(current->next_record_ptr);
>      if some_test(next):
>          break;
>      undo_free(current);
>      current = next;
> 
> I think we shouldn't view such cases as exceptions to the general
> paradigm of looking at undo records one at a time, but instead as the
> normal case for which everything is optimized.  Cases like orphaned
> file cleanup where the number of undo records is probably small and
> they're all independent of each other will, I think, turn out to be
> the exception rather than the rule.

Hmm. If you're following an UNDO chain, from newest to oldest, I would 
assume that the newer record has enough information to decide whether 
you need to look at the previous record. If the previous record is no 
longer interesting, it might already be discarded away, after all.

I tried to browse through the zheap code but couldn't see that pattern. 
I'm not too familiar with the code, so I might've looked in the wrong 
place, though.

- Heikki



pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Store FullTransactionId in TwoPhaseFileHeader/GlobalTransactionData
Next
From: Konstantin Knizhnik
Date:
Subject: Re: Built-in connection pooler