Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CAFiTN-srKL3-WR9jq8HH-JX4SNn7MH3H9nx=aiC+f-msYe8mhw@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
On Tue, Jul 30, 2019 at 12:21 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Hi Dilip, > > > commit 2f3c127b9e8bc7d27cf7adebff0a355684dfb94e > > Author: Dilip Kumar <dilipkumar@localhost.localdomain> > > Date: Thu May 2 11:28:13 2019 +0530 > > > > Provide interfaces to store and fetch undo records. > > +#include "commands/tablecmds.h" > +#include "storage/block.h" > +#include "storage/buf.h" > +#include "storage/buf_internals.h" > +#include "storage/bufmgr.h" > +#include "miscadmin.h" > > "miscadmin.h" comes before "storage...". Right, fixed. > > +/* > + * Compute the size of the partial record on the undo page. > + * > + * Compute the complete record size by uur_info and variable field length > + * stored in the page header and then subtract the offset of the record so that > + * we can get the exact size of partial record in this page. > + */ > +static inline Size > +UndoPagePartialRecSize(UndoPageHeader phdr) > +{ > + Size size; > > We decided to use size_t everywhere in new code (except perhaps > functions conforming to function pointer types that historically use > Size in their type). > > + /* > + * Compute the header size from undo record uur_info, stored in the page > + * header. > + */ > + size = UndoRecordHeaderSize(phdr->uur_info); > + > + /* > + * Add length of the variable part and undo length. Now, we know the > + * complete length of the undo record. > + */ > + size += phdr->tuple_len + phdr->payload_len + sizeof(uint16); > + > + /* > + * Subtract the size which is stored in the previous page to get the > + * partial record size stored in this page. > + */ > + size -= phdr->record_offset; > + > + return size; > > This is probably a stupid question but why isn't it enough to just > store the offset of the first record that begins on this page, or 0 > for none yet? Why do we need to worry about the partial record's > payload etc? Right, as this patch stand it would be enough to just store the offset where the first complete record start. But for undo page consistency checker we need to mask the CID field in the partial record as well. So we need to know how many bytes of the partial records are already written in the previous page (phdr->record_offset), what all fields are there in the partial record (uur_info) and the variable part to compute the next record offset. Currently, I have improved it by storing the complete record length instead of payload and tuple length but this we can further improve by storing the next record offset directly that will avoid some computation. I haven't worked on undo consistency patch much in this version so I will analyze this further in the next version. > > +UndoRecPtr > +PrepareUndoInsert(UndoRecordInsertContext *context, > + UnpackedUndoRecord *urec, > + Oid dbid) > +{ > ... > + /* Fetch compression info for the transaction. */ > + compression_info = GetTopTransactionUndoCompressionInfo(category); > > How can this work correctly in recovery? [Edit: it doesn't, as you > just pointed out] > > I had started reviewing an older version of your patch (the version > that had made it as far as the undoprocessing branch as of recently), > before I had the bright idea to look for a newer version. I was going > to object to the global variable you had there in the earlier version. > It seems to me that you have to be able to reproduce the exact same > compression in recovery that you produced as "do" time, no? How can > TopTranasctionStateData be the right place for this in recovery? > > One data structure that could perhaps hold this would be > UndoLogTableEntry (the per-backend cache, indexed by undo log number, > with pretty fast lookups; used for things like > UndoLogNumberGetCategory()). As long as you never want to have > inter-transaction compression, that should have the right scope to > give recovery per-undo log tracking. If you ever wanted to do > compression between transactions too, maybe UndoLogSlot could work, > but that'd have more complications. Currently, I have read it from the first record on the page. > > +/* > + * Read undo records of the transaction in bulk > + * > + * Read undo records between from_urecptr and to_urecptr until we exhaust the > + * the memory size specified by undo_apply_size. If we could not read all the > + * records till to_urecptr then the caller should consume current set > of records > + * and call this function again. > + * > + * from_urecptr - Where to start fetching the undo records. > If we can not > + * read all the records because of memory limit then this > + * will be set to the previous undo record > pointer from where > + * we need to start fetching on next call. > Otherwise it will > + * be set to InvalidUndoRecPtr. > + * to_urecptr - Last undo record pointer to be fetched. > + * undo_apply_size - Memory segment limit to collect undo records. > + * nrecords - Number of undo records read. > + * one_page - Caller is applying undo only for one block not for > + * complete transaction. If this is set true then instead > + * of following transaction undo chain using > prevlen we will > + * follow the block prev chain of the block so that we can > + * avoid reading many unnecessary undo records of the > + * transaction. > + */ > +UndoRecInfo * > +UndoBulkFetchRecord(UndoRecPtr *from_urecptr, UndoRecPtr to_urecptr, > + int undo_apply_size, int *nrecords, bool one_page) > > Could you please make it clear in comments and assertions what the > relation between from_urecptr and to_urecptr is and what they mean > (they must be in the same undo log, one must be <= the other, both > point to the *start* of a record, so it's not the same as the total > range of undo)? I have enhanced the comments for the same > > undo_apply_size is not a good parameter name, because the function is > useful for things other than applying records -- like the > undoinspect() extension (or some better version of that), for example. > Maybe max_result_size or something like that? Changed > > +{ > ... > + /* Allocate memory for next undo record. */ > + uur = palloc0(sizeof(UnpackedUndoRecord)); > ... > + > + size = UnpackedUndoRecordSize(uur); > + total_size += size; > > I see, so the unpacked records are still allocated one at a time. I > guess that's OK for now. From some earlier discussion I had been > expecting an arrangement where the actual records were laid out > contiguously with their subcomponents (things they point to in > palloc()'d memory) nearby. In earlier version I was allocating one single memory and then packing the records in that memory. But, their we need to take care of alignnment of each unpacked undo record so that we can directly access them so we have changed it this way. > > +static uint16 > +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer, > + UndoLogCategory category) > +{ > ... > + char prevlen[2]; > ... > + prev_rec_len = *(uint16 *) (prevlen); > > I don't think that's OK, and might crash on a non-Intel system. How > about using a union of uint16 and char[2]? changed > > + /* Copy undo record transaction header if it is present. */ > + if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0) > + memcpy(&ucontext->urec_txn, uur->uur_txn, SizeOfUndoRecordTransaction); > > I was wondering why you don't use D = S instead of mempcy(&D, &S, > size) wherever you can, until I noticed you use these SizeOfXXX macros > that don't include trailing padding from structs, and that's also how > you allocate objects. Hmm. So if I were to complain about you not > using plain old assignment whenever you can, I'd also have to complain > about that. Fixed > > I think that that technique of defining a SizeOfXXX macro that > excludes trailing bytes makes sense for writing into WAL or undo log > buffers using mempcy(). I'm not sure it makes sense for palloc() and > copying into typed variables like you're doing here and I think I'd > prefer the notational simplicity of using the (very humble) type > system facilities C gives us. (Some memory checker might not like it > you palloc(the shorter size) and then use = if the compiler chooses to > implement it as memcpy sizeof().) > > +/* > + * The below common information will be stored in the first undo record of the > + * page. Every subsequent undo record will not store this information, if > + * required this information will be retrieved from the first undo > record of the > + * page. > + */ > +typedef struct UndoCompressionInfo > > Shouldn't this say "Every subsequent record will not store this > information *if it's the same as the relevant fields in the first > record*"? > > +#define UREC_INFO_TRANSACTION 0x001 > +#define UREC_INFO_RMID 0x002 > +#define UREC_INFO_RELOID 0x004 > +#define UREC_INFO_XID 0x008 > > Should we call this UREC_INFO_FXID, since it refers to a FullTransactionId? Done > > +/* > + * Every undo record begins with an UndoRecordHeader structure, which is > + * followed by the additional structures indicated by the contents of > + * urec_info. All structures are packed into the alignment without padding > + * bytes, and the undo record itself need not be aligned either, so care > + * must be taken when reading the header. > + */ > > I think you mean "All structures are packed into undo pages without > considering alignment and without trailing padding bytes"? This comes > from the definition of the SizeOfXXX macros IIUC. There might still > be padding between members of some of those structs, no? Like this > one, that has the second member at offset 2 on my system: Done > > +typedef struct UndoRecordHeader > +{ > + uint8 urec_type; /* record type code */ > + uint16 urec_info; /* flag bits */ > +} UndoRecordHeader; > + > +#define SizeOfUndoRecordHeader \ > + (offsetof(UndoRecordHeader, urec_info) + sizeof(uint16)) > > +/* > + * Information for a transaction to which this undo belongs. This > + * also stores the dbid and the progress of the undo apply during rollback. > + */ > +typedef struct UndoRecordTransaction > +{ > + /* > + * Undo block number where we need to start reading the undo for applying > + * the undo action. InvalidBlockNumber means undo applying hasn't > + * started for the transaction and MaxBlockNumber mean undo completely > + * applied. And, any other block number means we have applied partial undo > + * so next we can start from this block. > + */ > + BlockNumber urec_progress; > + Oid urec_dbid; /* database id */ > + UndoRecPtr urec_next; /* urec pointer of the next transaction */ > +} UndoRecordTransaction; > > I propose that we rename this to UndoRecordGroupHeader (or something > like that... maybe "Set", but we also use "set" as a verb in various > relevant function names): I have changed this > > 1. We'll also use these for the new "shared" records we recently > invented that don't relate to a transaction. This is really about > defining the unit of discarding; we throw away the whole set of > records at once, which is why it's basically about proividing a space > for "urec_next". > > 2. Though it also holds rollback progress information, which is a > transaction-specific concept, there can be more than one of these sets > of records for a single transaction anyway. A single transaction can > write undo stuff in more than one undo log (different categories > perm/temp/unlogged/shared and also due to log switching when they are > full). > > So really it's just a header for an arbitrary set of records, used to > track when and how to discard them. > > If you agree with that idea, perhaps urec_next should become something > like urec_next_group, too. "next" is a bit vague, especially for > something as untyped as UndoRecPtr: someone might think it points to > the next record. Changed > > More soon. the latest patch at https://www.postgresql.org/message-id/CAFiTN-uf4Bh0FHwec%2BJGbiLq%2Bj00V92W162SLd_JVvwW-jwREg%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: