Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CA+hUKGJz5xtLrcZE418_Xat8o7T1Wy6D5bK8ZMX1QrTqQbJT-w@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
Re: POC: Cleaning up orphaned files using undo logs |
List | pgsql-hackers |
On Tue, Jul 30, 2019 at 5:03 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Fri, Jul 19, 2019 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > On Thu, Jul 11, 2019 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > I don't like the fact that undoaccess.c has a new global, > > > > undo_compression_info. I haven't read the code thoroughly, but do we > > > > really need that? I think it's never modified (so it could just be > > > > declared const), > > > > > > Actually, this will get modified otherwise across undo record > > > insertion how we will know what was the values of the common fields in > > > the first record of the page. Another option could be that every time > > > we insert the record, read the value from the first complete undo > > > record on the page but that will be costly because for every new > > > insertion we need to read the first undo record of the page. > > > > > > > This information won't be shared across transactions, so can't we keep > > it in top transaction's state? It seems to me that will be better > > than to maintain it as a global state. > > I think this idea is good for the DO time but during REDO time it will > not work as we will not have the transaction state. Having said that > the current idea of keeping in the global variable will also not work > during REDO time because the WAL from the different transaction can be > interleaved. There are few ideas to handle this issue > > 1. At DO time keep in TopTransactionState as you suggested and during > recovery time read from the first complete record on the page. > 2. Just to keep the code uniform always read from the first complete > record of the page. > > After putting more though I am more inclined towards idea-2. Because > we are anyway inserting our current record into that page basically we > have read the buffer and also holds the exclusive lock on the buffer. > So reading a few extra bytes from the buffer will not hurt us IMHO. > > If someone has a better solution please suggest. Hi Dilip, Here's some initial review of the following patch (from your public undo_interface_v1 branch as of this morning). I haven't tested this version yet, because my means of testing this stuff involves waiting for undoprocessing to be rebased, so that I can test it with my orphaned files stuff and other test programs. It contains another suggestion for that problem you just mentioned (and also me pointing out what you just pointed out, since I wrote it earlier) though I'm not sure if it's better than your options above. > commit 2f3c127b9e8bc7d27cf7adebff0a355684dfb94e > Author: Dilip Kumar <dilipkumar@localhost.localdomain> > Date: Thu May 2 11:28:13 2019 +0530 > > Provide interfaces to store and fetch undo records. +#include "commands/tablecmds.h" +#include "storage/block.h" +#include "storage/buf.h" +#include "storage/buf_internals.h" +#include "storage/bufmgr.h" +#include "miscadmin.h" "miscadmin.h" comes before "storage...". +/* + * Compute the size of the partial record on the undo page. + * + * Compute the complete record size by uur_info and variable field length + * stored in the page header and then subtract the offset of the record so that + * we can get the exact size of partial record in this page. + */ +static inline Size +UndoPagePartialRecSize(UndoPageHeader phdr) +{ + Size size; We decided to use size_t everywhere in new code (except perhaps functions conforming to function pointer types that historically use Size in their type). + /* + * Compute the header size from undo record uur_info, stored in the page + * header. + */ + size = UndoRecordHeaderSize(phdr->uur_info); + + /* + * Add length of the variable part and undo length. Now, we know the + * complete length of the undo record. + */ + size += phdr->tuple_len + phdr->payload_len + sizeof(uint16); + + /* + * Subtract the size which is stored in the previous page to get the + * partial record size stored in this page. + */ + size -= phdr->record_offset; + + return size; This is probably a stupid question but why isn't it enough to just store the offset of the first record that begins on this page, or 0 for none yet? Why do we need to worry about the partial record's payload etc? +UndoRecPtr +PrepareUndoInsert(UndoRecordInsertContext *context, + UnpackedUndoRecord *urec, + Oid dbid) +{ ... + /* Fetch compression info for the transaction. */ + compression_info = GetTopTransactionUndoCompressionInfo(category); How can this work correctly in recovery? [Edit: it doesn't, as you just pointed out] I had started reviewing an older version of your patch (the version that had made it as far as the undoprocessing branch as of recently), before I had the bright idea to look for a newer version. I was going to object to the global variable you had there in the earlier version. It seems to me that you have to be able to reproduce the exact same compression in recovery that you produced as "do" time, no? How can TopTranasctionStateData be the right place for this in recovery? One data structure that could perhaps hold this would be UndoLogTableEntry (the per-backend cache, indexed by undo log number, with pretty fast lookups; used for things like UndoLogNumberGetCategory()). As long as you never want to have inter-transaction compression, that should have the right scope to give recovery per-undo log tracking. If you ever wanted to do compression between transactions too, maybe UndoLogSlot could work, but that'd have more complications. +/* + * Read undo records of the transaction in bulk + * + * Read undo records between from_urecptr and to_urecptr until we exhaust the + * the memory size specified by undo_apply_size. If we could not read all the + * records till to_urecptr then the caller should consume current set of records + * and call this function again. + * + * from_urecptr - Where to start fetching the undo records. If we can not + * read all the records because of memory limit then this + * will be set to the previous undo record pointer from where + * we need to start fetching on next call. Otherwise it will + * be set to InvalidUndoRecPtr. + * to_urecptr - Last undo record pointer to be fetched. + * undo_apply_size - Memory segment limit to collect undo records. + * nrecords - Number of undo records read. + * one_page - Caller is applying undo only for one block not for + * complete transaction. If this is set true then instead + * of following transaction undo chain using prevlen we will + * follow the block prev chain of the block so that we can + * avoid reading many unnecessary undo records of the + * transaction. + */ +UndoRecInfo * +UndoBulkFetchRecord(UndoRecPtr *from_urecptr, UndoRecPtr to_urecptr, + int undo_apply_size, int *nrecords, bool one_page) Could you please make it clear in comments and assertions what the relation between from_urecptr and to_urecptr is and what they mean (they must be in the same undo log, one must be <= the other, both point to the *start* of a record, so it's not the same as the total range of undo)? undo_apply_size is not a good parameter name, because the function is useful for things other than applying records -- like the undoinspect() extension (or some better version of that), for example. Maybe max_result_size or something like that? +{ ... + /* Allocate memory for next undo record. */ + uur = palloc0(sizeof(UnpackedUndoRecord)); ... + + size = UnpackedUndoRecordSize(uur); + total_size += size; I see, so the unpacked records are still allocated one at a time. I guess that's OK for now. From some earlier discussion I had been expecting an arrangement where the actual records were laid out contiguously with their subcomponents (things they point to in palloc()'d memory) nearby. +static uint16 +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer, + UndoLogCategory category) +{ ... + char prevlen[2]; ... + prev_rec_len = *(uint16 *) (prevlen); I don't think that's OK, and might crash on a non-Intel system. How about using a union of uint16 and char[2]? + /* Copy undo record transaction header if it is present. */ + if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0) + memcpy(&ucontext->urec_txn, uur->uur_txn, SizeOfUndoRecordTransaction); I was wondering why you don't use D = S instead of mempcy(&D, &S, size) wherever you can, until I noticed you use these SizeOfXXX macros that don't include trailing padding from structs, and that's also how you allocate objects. Hmm. So if I were to complain about you not using plain old assignment whenever you can, I'd also have to complain about that. I think that that technique of defining a SizeOfXXX macro that excludes trailing bytes makes sense for writing into WAL or undo log buffers using mempcy(). I'm not sure it makes sense for palloc() and copying into typed variables like you're doing here and I think I'd prefer the notational simplicity of using the (very humble) type system facilities C gives us. (Some memory checker might not like it you palloc(the shorter size) and then use = if the compiler chooses to implement it as memcpy sizeof().) +/* + * The below common information will be stored in the first undo record of the + * page. Every subsequent undo record will not store this information, if + * required this information will be retrieved from the first undo record of the + * page. + */ +typedef struct UndoCompressionInfo Shouldn't this say "Every subsequent record will not store this information *if it's the same as the relevant fields in the first record*"? +#define UREC_INFO_TRANSACTION 0x001 +#define UREC_INFO_RMID 0x002 +#define UREC_INFO_RELOID 0x004 +#define UREC_INFO_XID 0x008 Should we call this UREC_INFO_FXID, since it refers to a FullTransactionId? +/* + * Every undo record begins with an UndoRecordHeader structure, which is + * followed by the additional structures indicated by the contents of + * urec_info. All structures are packed into the alignment without padding + * bytes, and the undo record itself need not be aligned either, so care + * must be taken when reading the header. + */ I think you mean "All structures are packed into undo pages without considering alignment and without trailing padding bytes"? This comes from the definition of the SizeOfXXX macros IIUC. There might still be padding between members of some of those structs, no? Like this one, that has the second member at offset 2 on my system: +typedef struct UndoRecordHeader +{ + uint8 urec_type; /* record type code */ + uint16 urec_info; /* flag bits */ +} UndoRecordHeader; + +#define SizeOfUndoRecordHeader \ + (offsetof(UndoRecordHeader, urec_info) + sizeof(uint16)) +/* + * Information for a transaction to which this undo belongs. This + * also stores the dbid and the progress of the undo apply during rollback. + */ +typedef struct UndoRecordTransaction +{ + /* + * Undo block number where we need to start reading the undo for applying + * the undo action. InvalidBlockNumber means undo applying hasn't + * started for the transaction and MaxBlockNumber mean undo completely + * applied. And, any other block number means we have applied partial undo + * so next we can start from this block. + */ + BlockNumber urec_progress; + Oid urec_dbid; /* database id */ + UndoRecPtr urec_next; /* urec pointer of the next transaction */ +} UndoRecordTransaction; I propose that we rename this to UndoRecordGroupHeader (or something like that... maybe "Set", but we also use "set" as a verb in various relevant function names): 1. We'll also use these for the new "shared" records we recently invented that don't relate to a transaction. This is really about defining the unit of discarding; we throw away the whole set of records at once, which is why it's basically about proividing a space for "urec_next". 2. Though it also holds rollback progress information, which is a transaction-specific concept, there can be more than one of these sets of records for a single transaction anyway. A single transaction can write undo stuff in more than one undo log (different categories perm/temp/unlogged/shared and also due to log switching when they are full). So really it's just a header for an arbitrary set of records, used to track when and how to discard them. If you agree with that idea, perhaps urec_next should become something like urec_next_group, too. "next" is a bit vague, especially for something as untyped as UndoRecPtr: someone might think it points to the next record. More soon. -- Thomas Munro https://enterprisedb.com
pgsql-hackers by date: