Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CA+TgmoZpyX3Jhhw9GeYpye24P9=h4sP+jzRns71wo_Vb4rViiQ@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
|
List | pgsql-hackers |
On Sun, Aug 4, 2019 at 5:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I feel that the level of abstraction is not quite right. There are a > bunch of fields, like uur_block, uur_offset, uur_tuple, that are > probably useful for some UNDO resource managers (zheap I presume), but > seem kind of arbitrary. How is uur_tuple different from uur_payload? > Should they be named more generically as uur_payload1 and uur_payload2? > And why two, why not three or four different payloads? In the WAL record > format, there's a concept of "block id", which allows you to store N > number of different payloads in the record, I think that would be a > better approach. Or only have one payload, and let the resource manager > code divide it as it sees fit. > > Many of the fields support a primitive type of compression, where a > field can be omitted if it has the same value as on the first record on > an UNDO page. That's handy. But again I don't like the fact that the > fields have been hard-coded into the UNDO record format. I can see e.g. > the relation oid to be useful for many AMs. But not all. And other AMs > might well want to store and deduplicate other things, aside from the > fields that are in the patch now. I'd like to move most of the fields to > AM specific code, and somehow generalize the compression. One approach > would be to let the AM store an arbitrary struct, and run it through a > general-purpose compression algorithm, using the UNDO page's first > record as the "dictionary". I thought about this, too. I agree that there's something a little unsatisfying about the current structure, but I haven't been able to come up with something that seems definitively better. I think something along the lines of what you are describing here might work well, but I am VERY doubtful about the idea of a fixed-size struct. I think AMs are going to want to store variable-length data: especially tuples, but maybe also other stuff. For instance, imagine some AM that wants to implement locking that's more fine-grained that the four levels of tuple locks we have today: instead of just having key locks and all-columns locks, you could want to store the exact columns to be locked. Or maybe your TIDs are variable-width. And the problem is that as soon as you move to something where you pack in a bunch of variable-sized fields, you lose the ability to refer to thinks using reasonable names. That's where I came up with the idea of an UnpackedUndoRecord: give the common fields that "everyone's going to need" human-readable names, and jam only the strange, AM-specific stuff into the payload. But if those needs are not actually universal but very much AM-specific, then I'm afraid we're going to end up with deeply inscrutable code for packing and unpacking records. I imagine it's possible to come up with a good structure for that, but I don't think we have one today. > I don't like the way UndoFetchRecord returns a palloc'd > UnpackedUndoRecord. I would prefer something similar to the xlogreader > API, where a new call to UndoFetchRecord invalidates the previous > result. On efficiency grounds, to avoid the palloc, but also to be > consistent with xlogreader. I don't think that's going to work very well, because we often need to deal with multiple records at a time. There is (or was) a bulk-fetch interface, but I've also found while experimenting with this code that it can be useful to do things like: current = undo_fetch(starting_record); loop: next = undo_fetch(current->next_record_ptr); if some_test(next): break; undo_free(current); current = next; I think we shouldn't view such cases as exceptions to the general paradigm of looking at undo records one at a time, but instead as the normal case for which everything is optimized. Cases like orphaned file cleanup where the number of undo records is probably small and they're all independent of each other will, I think, turn out to be the exception rather than the rule. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: