Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CAFiTN-v54rTaizf+Lq9-9sA3C=ROC1LPH8gxEwP1z5eEZSRJ_g@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Andres Freund <andres@anarazel.de>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
|
List | pgsql-hackers |
On Mon, Aug 5, 2019 at 11:59 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > (as I was out of context due to dealing with bugs, I've switched to > lookign at the current zheap/undoprocessing branch. > > On 2019-07-30 01:02:20 -0700, Andres Freund wrote: > > +/* > > + * Insert a previously-prepared undo records. > > + * > > + * This function will write the actual undo record into the buffers which are > > + * already pinned and locked in PreparedUndoInsert, and mark them dirty. This > > + * step should be performed inside a critical section. > > + */ > > Again, I think it's not ok to just assume you can lock an essentially > unbounded number of buffers. This seems almost guaranteed to result in > deadlocks. And there's limits on how many lwlocks one can hold etc. I think for controlling that we need to put a limit on max prepared undo? I am not sure any other way of limiting the number of buffers because we must lock all the buffer in which we are going to insert the undo record under one WAL logged operation. > > As far as I can tell there's simply no deadlock avoidance scheme in use > here *at all*? I must be missing something. We are always locking buffer in block order so I am not sure how it can deadlock? Am I missing something? > > > > + /* Main loop for writing the undo record. */ > > + do > > + { > > I'd prefer this to not be a do{} while(true) loop - as written I need to > read to the end to see what the condition is. I don't think we have any > loops like that in the code. Right, changed > > > > + /* > > + * During recovery, there might be some blocks which are already > > + * deleted due to some discard command so we can just skip > > + * inserting into those blocks. > > + */ > > + if (!BufferIsValid(buffer)) > > + { > > + Assert(InRecovery); > > + > > + /* > > + * Skip actual writing just update the context so that we have > > + * write offset for inserting into next blocks. > > + */ > > + SkipInsertingUndoData(&ucontext, BLCKSZ - starting_byte); > > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > > + break; > > + } > > How exactly can this happen? Suppose you insert one record for the transaction which split in block1 and 2. Now, before this block is actually going to the disk the transaction committed and become all visible the undo logs are discarded. It's possible that block 1 is completely discarded but block 2 is not because it might have undo for the next transaction. Now, during recovery (FPW is off) if block 1 is missing but block 2 is their so we need to skip inserting undo for block 1 as it does not exist. > > > > + else > > + { > > + page = BufferGetPage(buffer); > > + > > + /* > > + * Initialize the page whenever we try to write the first > > + * record in page. We start writing immediately after the > > + * block header. > > + */ > > + if (starting_byte == UndoLogBlockHeaderSize) > > + UndoPageInit(page, BLCKSZ, prepared_undo->urec->uur_info, > > + ucontext.already_processed, > > + prepared_undo->urec->uur_tuple.len, > > + prepared_undo->urec->uur_payload.len); > > + > > + /* > > + * Try to insert the record into the current page. If it > > + * doesn't succeed then recall the routine with the next page. > > + */ > > + InsertUndoData(&ucontext, page, starting_byte); > > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > > + { > > + MarkBufferDirty(buffer); > > + break; > > At this point we're five indentation levels deep. I'd extract at least > either the the per prepared undo code or the code performing the writing > across block boundaries into a separate function. Perhaps both. I have moved it to the separate function. > > > > > +/* > > + * Helper function for UndoGetOneRecord > > + * > > + * If any of rmid/reloid/xid/cid is not available in the undo record, then > > + * it will get the information from the first complete undo record in the > > + * page. > > + */ > > +static void > > +GetCommonUndoRecInfo(UndoPackContext *ucontext, UndoRecPtr urp, > > + RelFileNode rnode, UndoLogCategory category, Buffer buffer) > > +{ > > + /* > > + * If any of the common header field is not available in the current undo > > + * record then we must read it from the first complete record of the page. > > + */ > > How is it guaranteed that the first record on the page is actually from > the current transaction? Can't there be a situation where that's from > another transaction? If the first record is not from the same transaction then the record must have all those fields in it so it should never try to access the first record. I have updated the comments for the same. > > > > > +/* > > + * Helper function for UndoFetchRecord and UndoBulkFetchRecord > > + * > > + * curbuf - If an input buffer is valid then this function will not release the > > + * pin on that buffer. If the buffer is not valid then it will assign curbuf > > + * with the first buffer of the current undo record and also it will keep the > > + * pin and lock on that buffer in a hope that while traversing the undo chain > > + * the caller might want to read the previous undo record from the same block. > > + */ > > Wait, so at exit *curbuf is pinned but not locked, if passed in, but is > pinned *and* locked when not? That'd not be a sane API. I don't think > the code works like that atm though. Comments were wrong, I have fixed. > > > > +static UnpackedUndoRecord * > > +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode, > > + UndoLogCategory category, Buffer *curbuf) > > +{ > > + Page page; > > + int starting_byte = UndoRecPtrGetPageOffset(urp); > > + BlockNumber cur_blk; > > + UndoPackContext ucontext = {{0}}; > > + Buffer buffer = *curbuf; > > + > > + cur_blk = UndoRecPtrGetBlockNum(urp); > > + > > + /* Initiate unpacking one undo record. */ > > + BeginUnpackUndo(&ucontext); > > + > > + while (true) > > + { > > + /* If we already have a buffer then no need to allocate a new one. */ > > + if (!BufferIsValid(buffer)) > > + { > > + buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk, > > + RBM_NORMAL, NULL, > > + RelPersistenceForUndoLogCategory(category)); > > + > > + /* > > + * Remember the first buffer where this undo started as next undo > > + * record what we fetch might fall on the same buffer. > > + */ > > + if (!BufferIsValid(*curbuf)) > > + *curbuf = buffer; > > + } > > + > > + /* Acquire shared lock on the buffer before reading undo from it. */ > > + LockBuffer(buffer, BUFFER_LOCK_SHARE); > > + > > + page = BufferGetPage(buffer); > > + > > + UnpackUndoData(&ucontext, page, starting_byte); > > + > > + /* > > + * We are done if we have reached to the done stage otherwise move to > > + * next block and continue reading from there. > > + */ > > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > > + { > > + if (buffer != *curbuf) > > + UnlockReleaseBuffer(buffer); > > + > > + /* > > + * Get any of the missing fields from the first record of the > > + * page. > > + */ > > + GetCommonUndoRecInfo(&ucontext, urp, rnode, category, *curbuf); > > + break; > > + } > > + > > + /* > > + * The record spans more than a page so we would have copied it (see > > + * UnpackUndoRecord). In such cases, we can release the buffer. > > + */ > > Where would it have been copied? Presumably in UnpackUndoData()? Imo the > comment should say so. > > I'm a bit confused by the use of "would" in that comment. Either we > have, or not? This comment is obsolete so removed. > > + if (buffer != *curbuf) > > + UnlockReleaseBuffer(buffer); > > Wait, so we *keep* the buffer locked if it the same as *curbuf? That > can't be right. At the end we are releasing the lock on the *curbuf. But now I have changed it so that it is more readable. > > > > + * Fetch the undo record for given undo record pointer. > > + * > > + * This will internally allocate the memory for the unpacked undo record which > > + * intern will > > "intern" should probably be internally? But I'm not sure what the two > "internally"s really add here. > > > > > +/* > > + * Release the memory of the undo record allocated by UndoFetchRecord and > > + * UndoBulkFetchRecord. > > + */ > > +void > > +UndoRecordRelease(UnpackedUndoRecord *urec) > > +{ > > + /* Release the memory of payload data if we allocated it. */ > > + if (urec->uur_payload.data) > > + pfree(urec->uur_payload.data); > > + > > + /* Release memory of tuple data if we allocated it. */ > > + if (urec->uur_tuple.data) > > + pfree(urec->uur_tuple.data); > > + > > + /* Release memory of the transaction header if we allocated it. */ > > + if (urec->uur_txn) > > + pfree(urec->uur_txn); > > + > > + /* Release memory of the logswitch header if we allocated it. */ > > + if (urec->uur_logswitch) > > + pfree(urec->uur_logswitch); > > + > > + /* Release the memory of the undo record. */ > > + pfree(urec); > > +} > > Those comments before each pfree are not useful. Removed > > > Also, isn't this both fairly slow and fairly failure prone? The next > record is going to need all that memory again, no? It seems to me that > there should be one record that's allocated once, and then reused over > multiple fetches, increasing the size if necesssary. > > I'm very doubtful that all this freeing of individual allocations in the > undo code makes sense. Shouldn't this just be done in short lived memory > contexts, that then get reset as a whole? That's both far less failure > prone, and faster. > > > > + * one_page - Caller is applying undo only for one block not for > > + * complete transaction. If this is set true then instead > > + * of following transaction undo chain using prevlen we will > > + * follow the block prev chain of the block so that we can > > + * avoid reading many unnecessary undo records of the > > + * transaction. > > + */ > > +UndoRecInfo * > > +UndoBulkFetchRecord(UndoRecPtr *from_urecptr, UndoRecPtr to_urecptr, > > + int undo_apply_size, int *nrecords, bool one_page) > > > There's no caller for one_page mode in the series - I assume that's for > later, during page-wise undo? It seems to behave in quite noticably > different ways, is that really OK? Makes the code quite hard to > understand. > Also, it seems quite poorly named to me. It sounds like it's about > fetching a single undo page (which makes no sense, obviously). But what > it does is to switch to an entirely different way of traversing the undo > chains. one_page was zheap specific so I have removed it. I think in zheap specific function we can implement it by UndoFetchRecord in a loop. > > > > > + /* > > + * In one_page mode we are fetching undo only for one page instead of > > + * fetching all the undo of the transaction. Basically, we are fetching > > + * interleaved undo records. So it does not make sense to do any prefetch > > + * in that case. > > What does "interleaved" mean here? I meant that for one page we are following blockprev chain instead of complete transaction undo chain so there is no guarantee that the undo records are together. Basically, the undo records for the different blocks can be interleaved so I am not sure should we prefetch or not. I assume that there will often be > other UNDO records interspersed? But that's not guaranteed at all, > right? In fact, for a lot of workloads it seems likely that there will > be many consecutive undo records for a single page? In fact, won't that > be the majority of cases? Ok, that point makes sense to me but I thought if we always assume this we will do unwanted prefetch where this is not the case and we will put unnecessary load on the I/O. Currently, I have moved that code out of the undo layer so we can take a call while designing zheap specific function. > > Thus it's not obvious to me that there's not often going to be > consecutive pages for this case too. I'd even say that minimizing IO > delay is *MORE* important during page-wise undo, as that happens in the > context of client accesses, and it's not incurring cost on the party > that performed DML, but on some random third party. > > > I'm doubtful this is a sane interface. There's a lot of duplication > between one_page and not one_page. It presupposes specific ways of > constructing chains that are likely to depend on the AM. to_urecptr is > only used in certain situations. E.g. I strongly suspect that for > zheap's visibility determinations we'd want to concurrently follow all > the necessary chains to determine visibility for all all tuples on the > page, far enough to find the visible tuple - for seqscan's / bitmap heap > scans / everything using page mode scans, that'll be way more efficient > than doing this one-by-one and possibly even repeatedly. But what is > exactly the right thing to do is going to be highly AM specific. > > I vaguely suspect what you'd want is an interface where the "bulk fetch" > context basically has a FIFO queue of undo records to fetch, and a > function to actually perform fetching. Whenever a record has been > retrieved, a callback determines whether additional records are needed. > In the case of fetching all the undo for a transaction, you'd just queue > - probably in a more efficient representation - all the necessary > undo. In case of page-wise undo, you'd queue the first record of the > chain you'd want to undo, with a callback for queuing the next > record. For visibility determinations in zheap, you'd queue all the > different necessary chains, with a callback that queues the next > necessary record if still needed for visibility determination. > > And then I suspect you'd have a separate callback whenever records have > been fetched, with all the 'unconsumed' records. That then can, > e.g. based on memory consumption, decide to process them or not. For > visibility information you'd probably just want to condense the records > to the minimum necessary (i.e. visibility information for the relevant > tuples, and the visibile tuple when encountered) as soon as available. I haven't think on this part yet. I will analyze part. > > Obviously that's pretty handwavy. > > > > > > Also, if we are fetching undo records from more than one > > + * log, we don't know the boundaries for prefetching. Hence, we can't use > > + * prefetching in this case. > > + */ > > Hm. Why don't we know the boundaries (or cheaply infer them)? I have added comments for that. Basically, when we get the undo records from the different log (from and to pointers are in the different log) we don't know in latest undo log till what point the undo are from this transaction. We may consider prefetching to the start of the current log but there is no guarantee that all the blocks of the current logs are valid and not yet discarded. Ideally, the better fix would be that the caller always pass the from and to pointer from the same undo log. > > > > + /* > > + * If prefetch_pages are half of the prefetch_target then it's time to > > + * prefetch again. > > + */ > > + if (prefetch_pages < prefetch_target / 2) > > + PrefetchUndoPages(rnode, prefetch_target, &prefetch_pages, to_blkno, > > + from_blkno, category); > > Hm. Why aren't we prefetching again as soon as possible? Given the > current code there's not really benefit in fetching many adjacent pages > at once. And this way it seems we're somewhat likely to cause fairly > bursty IO? Hmm right, we can always prefetch as soon as we are behind the prefetch target. Done that way. > > > > + /* > > + * In one_page mode it's possible that the undo of the transaction > > + * might have been applied by worker and undo got discarded. Prevent > > + * discard worker from discarding undo data while we are reading it. > > + * See detail comment in UndoFetchRecord. In normal mode we are > > + * holding transaction undo action lock so it can not be discarded. > > + */ > > I don't really see a comment explaining this in UndoFetchRecord. Are > you referring to InHotStandby? Because there's no comment about one_page > mode as far as I can tell? The comment is clearly referring to that, > rather than InHotStandby? I have removed one_page code. > > > > > + if (one_page) > > + { > > + /* Refer comments in UndoFetchRecord. */ > > Missing "to". > > > > + if (InHotStandby) > > + { > > + if (UndoRecPtrIsDiscarded(urecptr)) > > + break; > > + } > > + else > > + { > > + LWLockAcquire(&slot->discard_lock, LW_SHARED); > > + if (slot->logno != logno || urecptr < slot->oldest_data) > > + { > > + /* > > + * The undo log slot has been recycled because it was > > + * entirely discarded, or the data has been discarded > > + * already. > > + */ > > + LWLockRelease(&slot->discard_lock); > > + break; > > + } > > + } > > I find this deeply unsatisfying. It's repeated in a bunch of > places. There's completely different behaviour between the hot-standby > and !hot-standby case. There's UndoRecPtrIsDiscarded for the HS case, > but we do a different test for !HS. There's no explanation as to why > this is even reachable. I have added comments in UndoFetchRecord. > > > > + /* Read the undo record. */ > > + UndoGetOneRecord(uur, urecptr, rnode, category, &buffer); > > + > > + /* Release the discard lock after fetching the record. */ > > + if (!InHotStandby) > > + LWLockRelease(&slot->discard_lock); > > + } > > + else > > + UndoGetOneRecord(uur, urecptr, rnode, category, &buffer); > > > And then we do none of this in !one_page mode. UndoBulkFetchRecord is always called from the aborted transaction so its undo can never get discarded concurrently so ideally, we don't need to check for discard. But, during one_page mode, we follow the when it comes from zheap for the one page it is possible that the undo for the transaction are applied from the worker for the complete transaction and its undo logs are discarded. But, I think this is highly am specific so I have removed one_page mode from here. > > > > + /* > > + * As soon as the transaction id is changed we can stop fetching the > > + * undo record. Ideally, to_urecptr should control this but while > > + * reading undo only for a page we don't know what is the end undo > > + * record pointer for the transaction. > > + */ > > + if (one_page) > > + { > > + if (!FullTransactionIdIsValid(fxid)) > > + fxid = uur->uur_fxid; > > + else if (!FullTransactionIdEquals(fxid, uur->uur_fxid)) > > + break; > > + } > > + > > + /* Remember the previous undo record pointer. */ > > + prev_urec_ptr = urecptr; > > + > > + /* > > + * Calculate the previous undo record pointer of the transaction. If > > + * we are reading undo only for a page then follow the blkprev chain > > + * of the page. Otherwise, calculate the previous undo record pointer > > + * using transaction's current undo record pointer and the prevlen. If > > + * undo record has a valid uur_prevurp, this is the case of log switch > > + * during the transaction so we can directly use uur_prevurp as our > > + * previous undo record pointer of the transaction. > > + */ > > + if (one_page) > > + urecptr = uur->uur_prevundo; > > + else if (uur->uur_logswitch) > > + urecptr = uur->uur_logswitch->urec_prevurp; > > + else if (prev_urec_ptr == to_urecptr || > > + uur->uur_info & UREC_INFO_TRANSACTION) > > + urecptr = InvalidUndoRecPtr; > > + else > > + urecptr = UndoGetPrevUndoRecptr(prev_urec_ptr, buffer, category); > > + > > FWIW, this is one of those concerns I was referring to above. What > exactly needs to happen seems highly AM specific. 1. one_page check is gone 2. uur->uur_info & UREC_INFO_TRANSACTION is also related to one_page so removed this too. 3. else if (uur->uur_logswitch) -> I think this is also related to the incapability of the caller that it can not identify the log switch but expect the bulk fetch to detect it and break fetching so that we can update the progress in the transaction header of the current log. I think we can solve these issue by callback as well as you suggested above. > > > > +/* > > + * Read length of the previous undo record. > > + * > > + * This function will take an undo record pointer as an input and read the > > + * length of the previous undo record which is stored at the end of the previous > > + * undo record. If the undo record is split then this will add the undo block > > + * header size in the total length. > > + */ > > This should add some note as to when it's expected to be necessary. I > was kind of concerned that this can be necessary, but it's only needed > during log switches, which disarms that concern. I think this is a normal case because the undo_len store the actual length of the record. But, if the undo record split across 2 pages and if we are at the end of the undo record (start of the next record) then for computing the absolute start offset of the previous undo record we need the exact distance between these two records and that will be current_offset - (the actual length of the previous record + Undo record header if the previous log is split across 2 pages) > > > > +static uint16 > > +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer, > > + UndoLogCategory category) > > +{ > > + UndoLogOffset page_offset = UndoRecPtrGetPageOffset(urp); > > + BlockNumber cur_blk = UndoRecPtrGetBlockNum(urp); > > + Buffer buffer = input_buffer; > > + Page page = NULL; > > + char *pagedata = NULL; > > + char prevlen[2]; > > + RelFileNode rnode; > > + int byte_to_read = sizeof(uint16); > > Shouldn't it be byte_to_read? And the sizeof a type that's tied with the > actual undo format? Imagine we'd ever want to change the length format > for undo records - this would be hard to find. I did not get this comments. Do you mean that we should not rely on undo format i.e. we should not assume that undo length is stored at the end of the undo record? > > > > + char persistence; > > + uint16 prev_rec_len = 0; > > + > > + /* Get relfilenode. */ > > + UndoRecPtrAssignRelFileNode(rnode, urp); > > + persistence = RelPersistenceForUndoLogCategory(category); > > + > > + if (BufferIsValid(buffer)) > > + { > > + page = BufferGetPage(buffer); > > + pagedata = (char *) page; > > + } > > + > > + /* > > + * Length if the previous undo record is store at the end of that record > > + * so just fetch last 2 bytes. > > + */ > > + while (byte_to_read > 0) > > + { > > Why does this need a loop around the number of bytes? Can there ever be > a case where this is split across a record? If so, isn't that a bad idea > anyway? Yes, as of now, undo record can be splitted at any point even the undo length can be split acorss 2 pages. I think we can reduce complexity by making sure undo length doesn't get split acorss pages. But for handling that while allocating the undo we need to detect this whether the undo length can get splitted by checking the space in the current page and the undo record length and based on that we need to allocate 1 extra byte in the undo log. Seems that will add an extra complexity. > > > > + /* Read buffer if the current buffer is not valid. */ > > + if (!BufferIsValid(buffer)) > > + { > > + buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, > > + cur_blk, RBM_NORMAL, NULL, > > + persistence); > > + > > + LockBuffer(buffer, BUFFER_LOCK_SHARE); > > + > > + page = BufferGetPage(buffer); > > + pagedata = (char *) page; > > + } > > + > > + page_offset -= 1; > > + > > + /* > > + * Read current prevlen byte from current block if page_offset hasn't > > + * reach to undo block header. Otherwise, go to the previous block > > + * and continue reading from there. > > + */ > > + if (page_offset >= UndoLogBlockHeaderSize) > > + { > > + prevlen[byte_to_read - 1] = pagedata[page_offset]; > > + byte_to_read -= 1; > > + } > > + else > > + { > > + /* > > + * Release the current buffer if it is not provide by the caller. > > + */ > > + if (input_buffer != buffer) > > + UnlockReleaseBuffer(buffer); > > + > > + /* > > + * Could not read complete prevlen from the current block so go to > > + * the previous block and start reading from end of the block. > > + */ > > + cur_blk -= 1; > > + page_offset = BLCKSZ; > > + > > + /* > > + * Reset buffer so that we can read it again for the previous > > + * block. > > + */ > > + buffer = InvalidBuffer; > > + } > > + } > > I can't help but think that this shouldn't be yet another copy of logic > for how to read undo pages. I haven't yet thought but I will try to unify this with ReadUndoBytes. Actually, I didn't do that already because ReadUndoByte needs a start pointer where we need to read the given number of bytes but here we have an end pointer. May be by this logic we can compute the start pointer but that will look equally complex. I will work on this and try to figure out something. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: