Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | 20190805182934.cgmxmxnjhm3ohhhf@alap3.anarazel.de Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Andres Freund <andres@anarazel.de>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
Re: POC: Cleaning up orphaned files using undo logs |
List | pgsql-hackers |
Hi, (as I was out of context due to dealing with bugs, I've switched to lookign at the current zheap/undoprocessing branch. On 2019-07-30 01:02:20 -0700, Andres Freund wrote: > +/* > + * Insert a previously-prepared undo records. > + * > + * This function will write the actual undo record into the buffers which are > + * already pinned and locked in PreparedUndoInsert, and mark them dirty. This > + * step should be performed inside a critical section. > + */ Again, I think it's not ok to just assume you can lock an essentially unbounded number of buffers. This seems almost guaranteed to result in deadlocks. And there's limits on how many lwlocks one can hold etc. As far as I can tell there's simply no deadlock avoidance scheme in use here *at all*? I must be missing something. > + /* Main loop for writing the undo record. */ > + do > + { I'd prefer this to not be a do{} while(true) loop - as written I need to read to the end to see what the condition is. I don't think we have any loops like that in the code. > + /* > + * During recovery, there might be some blocks which are already > + * deleted due to some discard command so we can just skip > + * inserting into those blocks. > + */ > + if (!BufferIsValid(buffer)) > + { > + Assert(InRecovery); > + > + /* > + * Skip actual writing just update the context so that we have > + * write offset for inserting into next blocks. > + */ > + SkipInsertingUndoData(&ucontext, BLCKSZ - starting_byte); > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > + break; > + } How exactly can this happen? > + else > + { > + page = BufferGetPage(buffer); > + > + /* > + * Initialize the page whenever we try to write the first > + * record in page. We start writing immediately after the > + * block header. > + */ > + if (starting_byte == UndoLogBlockHeaderSize) > + UndoPageInit(page, BLCKSZ, prepared_undo->urec->uur_info, > + ucontext.already_processed, > + prepared_undo->urec->uur_tuple.len, > + prepared_undo->urec->uur_payload.len); > + > + /* > + * Try to insert the record into the current page. If it > + * doesn't succeed then recall the routine with the next page. > + */ > + InsertUndoData(&ucontext, page, starting_byte); > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > + { > + MarkBufferDirty(buffer); > + break; At this point we're five indentation levels deep. I'd extract at least either the the per prepared undo code or the code performing the writing across block boundaries into a separate function. Perhaps both. > +/* > + * Helper function for UndoGetOneRecord > + * > + * If any of rmid/reloid/xid/cid is not available in the undo record, then > + * it will get the information from the first complete undo record in the > + * page. > + */ > +static void > +GetCommonUndoRecInfo(UndoPackContext *ucontext, UndoRecPtr urp, > + RelFileNode rnode, UndoLogCategory category, Buffer buffer) > +{ > + /* > + * If any of the common header field is not available in the current undo > + * record then we must read it from the first complete record of the page. > + */ How is it guaranteed that the first record on the page is actually from the current transaction? Can't there be a situation where that's from another transaction? > +/* > + * Helper function for UndoFetchRecord and UndoBulkFetchRecord > + * > + * curbuf - If an input buffer is valid then this function will not release the > + * pin on that buffer. If the buffer is not valid then it will assign curbuf > + * with the first buffer of the current undo record and also it will keep the > + * pin and lock on that buffer in a hope that while traversing the undo chain > + * the caller might want to read the previous undo record from the same block. > + */ Wait, so at exit *curbuf is pinned but not locked, if passed in, but is pinned *and* locked when not? That'd not be a sane API. I don't think the code works like that atm though. > +static UnpackedUndoRecord * > +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode, > + UndoLogCategory category, Buffer *curbuf) > +{ > + Page page; > + int starting_byte = UndoRecPtrGetPageOffset(urp); > + BlockNumber cur_blk; > + UndoPackContext ucontext = {{0}}; > + Buffer buffer = *curbuf; > + > + cur_blk = UndoRecPtrGetBlockNum(urp); > + > + /* Initiate unpacking one undo record. */ > + BeginUnpackUndo(&ucontext); > + > + while (true) > + { > + /* If we already have a buffer then no need to allocate a new one. */ > + if (!BufferIsValid(buffer)) > + { > + buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk, > + RBM_NORMAL, NULL, > + RelPersistenceForUndoLogCategory(category)); > + > + /* > + * Remember the first buffer where this undo started as next undo > + * record what we fetch might fall on the same buffer. > + */ > + if (!BufferIsValid(*curbuf)) > + *curbuf = buffer; > + } > + > + /* Acquire shared lock on the buffer before reading undo from it. */ > + LockBuffer(buffer, BUFFER_LOCK_SHARE); > + > + page = BufferGetPage(buffer); > + > + UnpackUndoData(&ucontext, page, starting_byte); > + > + /* > + * We are done if we have reached to the done stage otherwise move to > + * next block and continue reading from there. > + */ > + if (ucontext.stage == UNDO_PACK_STAGE_DONE) > + { > + if (buffer != *curbuf) > + UnlockReleaseBuffer(buffer); > + > + /* > + * Get any of the missing fields from the first record of the > + * page. > + */ > + GetCommonUndoRecInfo(&ucontext, urp, rnode, category, *curbuf); > + break; > + } > + > + /* > + * The record spans more than a page so we would have copied it (see > + * UnpackUndoRecord). In such cases, we can release the buffer. > + */ Where would it have been copied? Presumably in UnpackUndoData()? Imo the comment should say so. I'm a bit confused by the use of "would" in that comment. Either we have, or not? > + if (buffer != *curbuf) > + UnlockReleaseBuffer(buffer); Wait, so we *keep* the buffer locked if it the same as *curbuf? That can't be right. > + * Fetch the undo record for given undo record pointer. > + * > + * This will internally allocate the memory for the unpacked undo record which > + * intern will "intern" should probably be internally? But I'm not sure what the two "internally"s really add here. > +/* > + * Release the memory of the undo record allocated by UndoFetchRecord and > + * UndoBulkFetchRecord. > + */ > +void > +UndoRecordRelease(UnpackedUndoRecord *urec) > +{ > + /* Release the memory of payload data if we allocated it. */ > + if (urec->uur_payload.data) > + pfree(urec->uur_payload.data); > + > + /* Release memory of tuple data if we allocated it. */ > + if (urec->uur_tuple.data) > + pfree(urec->uur_tuple.data); > + > + /* Release memory of the transaction header if we allocated it. */ > + if (urec->uur_txn) > + pfree(urec->uur_txn); > + > + /* Release memory of the logswitch header if we allocated it. */ > + if (urec->uur_logswitch) > + pfree(urec->uur_logswitch); > + > + /* Release the memory of the undo record. */ > + pfree(urec); > +} Those comments before each pfree are not useful. Also, isn't this both fairly slow and fairly failure prone? The next record is going to need all that memory again, no? It seems to me that there should be one record that's allocated once, and then reused over multiple fetches, increasing the size if necesssary. I'm very doubtful that all this freeing of individual allocations in the undo code makes sense. Shouldn't this just be done in short lived memory contexts, that then get reset as a whole? That's both far less failure prone, and faster. > + * one_page - Caller is applying undo only for one block not for > + * complete transaction. If this is set true then instead > + * of following transaction undo chain using prevlen we will > + * follow the block prev chain of the block so that we can > + * avoid reading many unnecessary undo records of the > + * transaction. > + */ > +UndoRecInfo * > +UndoBulkFetchRecord(UndoRecPtr *from_urecptr, UndoRecPtr to_urecptr, > + int undo_apply_size, int *nrecords, bool one_page) There's no caller for one_page mode in the series - I assume that's for later, during page-wise undo? It seems to behave in quite noticably different ways, is that really OK? Makes the code quite hard to understand. Also, it seems quite poorly named to me. It sounds like it's about fetching a single undo page (which makes no sense, obviously). But what it does is to switch to an entirely different way of traversing the undo chains. > + /* > + * In one_page mode we are fetching undo only for one page instead of > + * fetching all the undo of the transaction. Basically, we are fetching > + * interleaved undo records. So it does not make sense to do any prefetch > + * in that case. What does "interleaved" mean here? I assume that there will often be other UNDO records interspersed? But that's not guaranteed at all, right? In fact, for a lot of workloads it seems likely that there will be many consecutive undo records for a single page? In fact, won't that be the majority of cases? Thus it's not obvious to me that there's not often going to be consecutive pages for this case too. I'd even say that minimizing IO delay is *MORE* important during page-wise undo, as that happens in the context of client accesses, and it's not incurring cost on the party that performed DML, but on some random third party. I'm doubtful this is a sane interface. There's a lot of duplication between one_page and not one_page. It presupposes specific ways of constructing chains that are likely to depend on the AM. to_urecptr is only used in certain situations. E.g. I strongly suspect that for zheap's visibility determinations we'd want to concurrently follow all the necessary chains to determine visibility for all all tuples on the page, far enough to find the visible tuple - for seqscan's / bitmap heap scans / everything using page mode scans, that'll be way more efficient than doing this one-by-one and possibly even repeatedly. But what is exactly the right thing to do is going to be highly AM specific. I vaguely suspect what you'd want is an interface where the "bulk fetch" context basically has a FIFO queue of undo records to fetch, and a function to actually perform fetching. Whenever a record has been retrieved, a callback determines whether additional records are needed. In the case of fetching all the undo for a transaction, you'd just queue - probably in a more efficient representation - all the necessary undo. In case of page-wise undo, you'd queue the first record of the chain you'd want to undo, with a callback for queuing the next record. For visibility determinations in zheap, you'd queue all the different necessary chains, with a callback that queues the next necessary record if still needed for visibility determination. And then I suspect you'd have a separate callback whenever records have been fetched, with all the 'unconsumed' records. That then can, e.g. based on memory consumption, decide to process them or not. For visibility information you'd probably just want to condense the records to the minimum necessary (i.e. visibility information for the relevant tuples, and the visibile tuple when encountered) as soon as available. Obviously that's pretty handwavy. > Also, if we are fetching undo records from more than one > + * log, we don't know the boundaries for prefetching. Hence, we can't use > + * prefetching in this case. > + */ Hm. Why don't we know the boundaries (or cheaply infer them)? > + /* > + * If prefetch_pages are half of the prefetch_target then it's time to > + * prefetch again. > + */ > + if (prefetch_pages < prefetch_target / 2) > + PrefetchUndoPages(rnode, prefetch_target, &prefetch_pages, to_blkno, > + from_blkno, category); Hm. Why aren't we prefetching again as soon as possible? Given the current code there's not really benefit in fetching many adjacent pages at once. And this way it seems we're somewhat likely to cause fairly bursty IO? > + /* > + * In one_page mode it's possible that the undo of the transaction > + * might have been applied by worker and undo got discarded. Prevent > + * discard worker from discarding undo data while we are reading it. > + * See detail comment in UndoFetchRecord. In normal mode we are > + * holding transaction undo action lock so it can not be discarded. > + */ I don't really see a comment explaining this in UndoFetchRecord. Are you referring to InHotStandby? Because there's no comment about one_page mode as far as I can tell? The comment is clearly referring to that, rather than InHotStandby? > + if (one_page) > + { > + /* Refer comments in UndoFetchRecord. */ Missing "to". > + if (InHotStandby) > + { > + if (UndoRecPtrIsDiscarded(urecptr)) > + break; > + } > + else > + { > + LWLockAcquire(&slot->discard_lock, LW_SHARED); > + if (slot->logno != logno || urecptr < slot->oldest_data) > + { > + /* > + * The undo log slot has been recycled because it was > + * entirely discarded, or the data has been discarded > + * already. > + */ > + LWLockRelease(&slot->discard_lock); > + break; > + } > + } I find this deeply unsatisfying. It's repeated in a bunch of places. There's completely different behaviour between the hot-standby and !hot-standby case. There's UndoRecPtrIsDiscarded for the HS case, but we do a different test for !HS. There's no explanation as to why this is even reachable. > + /* Read the undo record. */ > + UndoGetOneRecord(uur, urecptr, rnode, category, &buffer); > + > + /* Release the discard lock after fetching the record. */ > + if (!InHotStandby) > + LWLockRelease(&slot->discard_lock); > + } > + else > + UndoGetOneRecord(uur, urecptr, rnode, category, &buffer); And then we do none of this in !one_page mode. > + /* > + * As soon as the transaction id is changed we can stop fetching the > + * undo record. Ideally, to_urecptr should control this but while > + * reading undo only for a page we don't know what is the end undo > + * record pointer for the transaction. > + */ > + if (one_page) > + { > + if (!FullTransactionIdIsValid(fxid)) > + fxid = uur->uur_fxid; > + else if (!FullTransactionIdEquals(fxid, uur->uur_fxid)) > + break; > + } > + > + /* Remember the previous undo record pointer. */ > + prev_urec_ptr = urecptr; > + > + /* > + * Calculate the previous undo record pointer of the transaction. If > + * we are reading undo only for a page then follow the blkprev chain > + * of the page. Otherwise, calculate the previous undo record pointer > + * using transaction's current undo record pointer and the prevlen. If > + * undo record has a valid uur_prevurp, this is the case of log switch > + * during the transaction so we can directly use uur_prevurp as our > + * previous undo record pointer of the transaction. > + */ > + if (one_page) > + urecptr = uur->uur_prevundo; > + else if (uur->uur_logswitch) > + urecptr = uur->uur_logswitch->urec_prevurp; > + else if (prev_urec_ptr == to_urecptr || > + uur->uur_info & UREC_INFO_TRANSACTION) > + urecptr = InvalidUndoRecPtr; > + else > + urecptr = UndoGetPrevUndoRecptr(prev_urec_ptr, buffer, category); > + FWIW, this is one of those concerns I was referring to above. What exactly needs to happen seems highly AM specific. > +/* > + * Read length of the previous undo record. > + * > + * This function will take an undo record pointer as an input and read the > + * length of the previous undo record which is stored at the end of the previous > + * undo record. If the undo record is split then this will add the undo block > + * header size in the total length. > + */ This should add some note as to when it's expected to be necessary. I was kind of concerned that this can be necessary, but it's only needed during log switches, which disarms that concern. > +static uint16 > +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer, > + UndoLogCategory category) > +{ > + UndoLogOffset page_offset = UndoRecPtrGetPageOffset(urp); > + BlockNumber cur_blk = UndoRecPtrGetBlockNum(urp); > + Buffer buffer = input_buffer; > + Page page = NULL; > + char *pagedata = NULL; > + char prevlen[2]; > + RelFileNode rnode; > + int byte_to_read = sizeof(uint16); Shouldn't it be byte_to_read? And the sizeof a type that's tied with the actual undo format? Imagine we'd ever want to change the length format for undo records - this would be hard to find. > + char persistence; > + uint16 prev_rec_len = 0; > + > + /* Get relfilenode. */ > + UndoRecPtrAssignRelFileNode(rnode, urp); > + persistence = RelPersistenceForUndoLogCategory(category); > + > + if (BufferIsValid(buffer)) > + { > + page = BufferGetPage(buffer); > + pagedata = (char *) page; > + } > + > + /* > + * Length if the previous undo record is store at the end of that record > + * so just fetch last 2 bytes. > + */ > + while (byte_to_read > 0) > + { Why does this need a loop around the number of bytes? Can there ever be a case where this is split across a record? If so, isn't that a bad idea anyway? > + /* Read buffer if the current buffer is not valid. */ > + if (!BufferIsValid(buffer)) > + { > + buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, > + cur_blk, RBM_NORMAL, NULL, > + persistence); > + > + LockBuffer(buffer, BUFFER_LOCK_SHARE); > + > + page = BufferGetPage(buffer); > + pagedata = (char *) page; > + } > + > + page_offset -= 1; > + > + /* > + * Read current prevlen byte from current block if page_offset hasn't > + * reach to undo block header. Otherwise, go to the previous block > + * and continue reading from there. > + */ > + if (page_offset >= UndoLogBlockHeaderSize) > + { > + prevlen[byte_to_read - 1] = pagedata[page_offset]; > + byte_to_read -= 1; > + } > + else > + { > + /* > + * Release the current buffer if it is not provide by the caller. > + */ > + if (input_buffer != buffer) > + UnlockReleaseBuffer(buffer); > + > + /* > + * Could not read complete prevlen from the current block so go to > + * the previous block and start reading from end of the block. > + */ > + cur_blk -= 1; > + page_offset = BLCKSZ; > + > + /* > + * Reset buffer so that we can read it again for the previous > + * block. > + */ > + buffer = InvalidBuffer; > + } > + } I can't help but think that this shouldn't be yet another copy of logic for how to read undo pages. Need to do something else for a bit. More later. Greetings, Andres Freund
pgsql-hackers by date: