Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: WIP: WAL prefetch (another approach) |
Date | |
Msg-id | CA+hUKGKusWunqOd9En+FkT+US_kuDK-MSZyXHCr=OTwRw8Ah1A@mail.gmail.com Whole thread Raw |
In response to | Re: WIP: WAL prefetch (another approach) (Julien Rouhaud <rjuju123@gmail.com>) |
Responses |
Re: WIP: WAL prefetch (another approach)
Re: WIP: WAL prefetch (another approach) Re: WIP: WAL prefetch (another approach) |
List | pgsql-hackers |
On Wed, Mar 9, 2022 at 7:47 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > I for now went through 0001, TL;DR the patch looks good to me. I have a few > minor comments though, mostly to make things a bit clearer (at least to me). Hi Julien, Thanks for your review of 0001! It gave me a few things to think about and some good improvements. > diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c > index 2340dc247b..c129df44ac 100644 > --- a/src/bin/pg_waldump/pg_waldump.c > +++ b/src/bin/pg_waldump/pg_waldump.c > @@ -407,10 +407,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len) > * add an accessor macro for this. > */ > *fpi_len = 0; > + for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++) > { > if (XLogRecHasBlockImage(record, block_id)) > - *fpi_len += record->blocks[block_id].bimg_len; > + *fpi_len += record->record->blocks[block_id].bimg_len; > } > (and similar in that file, xlogutils.c and xlogreader.c) > > This could use XLogRecGetBlock? Note that this macro is for now never used. Yeah, I think that is a good idea for pg_waldump.c and xlogutils.c. Done. > xlogreader.c also has some similar forgotten code that could use > XLogRecMaxBlockId. That is true, but I was thinking of it like this: most of the existing code that interacts with xlogreader.c is working with the old model, where the XLogReader object holds only one "current" record. For that reason the XLogRecXXX() macros continue to work as before, implicitly referring to the record that XLogReadRecord() most recently returned. For xlogreader.c code, I prefer not to use the XLogRecXXX() macros, even when referring to the "current" record, since xlogreader.c has switched to a new multi-record model. In other words, they're sort of 'old API' accessors provided for continuity. Does this make sense? > + * See if we can release the last record that was returned by > + * XLogNextRecord(), to free up space. > + */ > +void > +XLogReleasePreviousRecord(XLogReaderState *state) > > The comment seems a bit misleading, as I first understood it as it could be > optional even if the record exists. Maybe something more like "Release the > last record if any"? Done. > + * Remove it from the decoded record queue. It must be the oldest item > + * decoded, decode_queue_tail. > + */ > + record = state->record; > + Assert(record == state->decode_queue_tail); > + state->record = NULL; > + state->decode_queue_tail = record->next; > > The naming is a bit counter intuitive to me, as before reading the rest of the > code I wasn't expecting the item at the tail of the queue to have a next > element. Maybe just inverting tail and head would make it clearer? Yeah, after mulling this over for a day, I agree. I've flipped it around. Explanation: You're quite right, singly-linked lists traditionally have a 'tail' that points to null, so it makes sense for new items to be added there and older items to be consumed from the 'head' end, as you expected. But... it's also typical (I think?) in ring buffers AKA circular buffers to insert at the 'head', and remove from the 'tail'. This code has both a linked-list (the chain of decoded records with a ->next pointer), and the underlying storage, which is a circular buffer of bytes. I didn't want them to use opposite terminology, and since I started by writing the ring buffer part, that's where I finished up... I agree that it's an improvement to flip them. > +DecodedXLogRecord * > +XLogNextRecord(XLogReaderState *state, char **errormsg) > +{ > [...] > + /* > + * state->EndRecPtr is expected to have been set by the last call to > + * XLogBeginRead() or XLogNextRecord(), and is the location of the > + * error. > + */ > + > + return NULL; > > The comment should refer to XLogFindNextRecord, not XLogNextRecord? No, it does mean to refer to the XLogNextRecord() (ie the last time you called XLogNextRecord and successfully dequeued a record, we put its end LSN there, so if there is a deferred error, that's the corresponding LSN). Make sense? > Also, is it worth an assert (likely at the top of the function) for that? How could I assert that EndRecPtr has the right value? > XLogRecord * > XLogReadRecord(XLogReaderState *state, char **errormsg) > +{ > [...] > + if (decoded) > + { > + /* > + * XLogReadRecord() returns a pointer to the record's header, not the > + * actual decoded record. The caller will access the decoded record > + * through the XLogRecGetXXX() macros, which reach the decoded > + * recorded as xlogreader->record. > + */ > + Assert(state->record == decoded); > + return &decoded->header; > > I find it a bit weird to mention XLogReadRecord() as it's the current function. Changed to "This function ...". > +/* > + * Allocate space for a decoded record. The only member of the returned > + * object that is initialized is the 'oversized' flag, indicating that the > + * decoded record wouldn't fit in the decode buffer and must eventually be > + * freed explicitly. > + * > + * Return NULL if there is no space in the decode buffer and allow_oversized > + * is false, or if memory allocation fails for an oversized buffer. > + */ > +static DecodedXLogRecord * > +XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized) > > Is it worth clearly stating that it's the reponsability of the caller to update > the decode_buffer_head (with the real size) after a successful decoding of this > buffer? Comment added. > + if (unlikely(state->decode_buffer == NULL)) > + { > + if (state->decode_buffer_size == 0) > + state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE; > + state->decode_buffer = palloc(state->decode_buffer_size); > + state->decode_buffer_head = state->decode_buffer; > + state->decode_buffer_tail = state->decode_buffer; > + state->free_decode_buffer = true; > + } > > Maybe change XLogReaderSetDecodeBuffer to also handle allocation and use it > here too? Otherwise XLogReaderSetDecodeBuffer should probably go in 0002 as > the only caller is the recovery prefetching. I don't think it matters much? > + return decoded; > +} > > I would find it a bit clearer to explicitly return NULL here. Done. > readOff = ReadPageInternal(state, targetPagePtr, > Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ)); > - if (readOff < 0) > + if (readOff == XLREAD_WOULDBLOCK) > + return XLREAD_WOULDBLOCK; > + else if (readOff < 0) > > ReadPageInternal comment should be updated to mention the new XLREAD_WOULDBLOCK > possible return value. Yeah. Done. > It's also not particulary obvious why XLogFindNextRecord() doesn't check for > this value. AFAICS callers don't (and should never) call it with a > nonblocking == true state, maybe add an assert for that? Fair point. I have now explicitly cleared that flag. (I don't much like state->nonblocking, which might be better as an argument to page_read(), but in fact I don't like the fact that page_read callbacks are blocking in the first place, which is why I liked Horiguchi-san's patch to get rid of that... but that can be a subject for later work.) > @@ -468,7 +748,7 @@ restart: > if (pageHeader->xlp_info & XLP_FIRST_IS_OVERWRITE_CONTRECORD) > { > state->overwrittenRecPtr = RecPtr; > - ResetDecoder(state); > + //ResetDecoder(state); > > AFAICS this is indeed not necessary anymore, so it can be removed? Oops, yeah I use C++ comments when there's something I intended to remove. Done. > static void > ResetDecoder(XLogReaderState *state) > { > [...] > + /* Reset the decoded record queue, freeing any oversized records. */ > + while ((r = state->decode_queue_tail)) > > nit: I think it's better to explicitly check for the assignment being != NULL, > and existing code is more frequently written this way AFAICS. I think it's perfectly normal idiomatic C, but if you think it's clearer that way, OK, done like that. > +/* Return values from XLogPageReadCB. */ > +typedef enum XLogPageReadResultResult > > typo Fixed. I realised that this version has broken -DWAL_DEBUG. I'll fix that shortly, but I wanted to post this update ASAP, so here's a new version. The other thing I need to change is that I should turn on recovery_prefetch for platforms that support it (ie Linux and maybe NetBSD only for now), in the tests. Right now you need to put recovery_prefetch=on in a file and then run the tests with "TEMP_CONFIG=path_to_that make -C src/test/recovery check" to excercise much of 0002.
Attachment
pgsql-hackers by date: