Home > mailing lists

Re: WAL format and API changes (9.5) - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: WAL format and API changes (9.5)
Date	April 3, 2014 15:58:34
Msg-id	533D851F.3070608@vmware.com Whole thread Raw
In response to	Re: WAL format and API changes (9.5) (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: WAL format and API changes (9.5)
List	pgsql-hackers

Tree view

On 04/03/2014 06:37 PM, Tom Lane wrote:
> Also, IIRC there are places that WAL-log full pages that aren't in a
> shared buffer at all (btree build does this I think).  How will that fit
> into this model?

Hmm. We could provide a function for registering a block with given 
content, without a Buffer. Something like:

XLogRegisterPage(int id, RelFileNode, BlockNumber, Page)

>> Let's simplify that, and have one new function, XLogOpenBuffer, which
>> returns a return code that indicates which of the four cases we're
>> dealing with. A typical redo function looks like this:
>
>>     if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
>>     {
>>         /* Modify the page */
>>         ...
>
>>         PageSetLSN(page, lsn);
>>         MarkBufferDirty(buffer);
>>     }
>>     if (BufferIsValid(buffer))
>>         UnlockReleaseBuffer(buffer);
>
>> The '0' in the XLogOpenBuffer call is the ID of the block reference
>> specified in the XLogRegisterBuffer call, when the WAL record was created.
>
> +1, but one important step here is finding the data to be replayed.
> That is, a large part of the complexity of replay routines has to do
> with figuring out which parts of the WAL record were elided due to
> full-page-images, and locating the remaining parts.  What can we do
> to make that simpler?

We can certainly add more structure to the WAL records, but any extra 
information you add will make the records larger. It might be worth it, 
and would be lost in the noise for more complex records like page 
splits, but we should keep frequently-used records like heap insertions 
as lean as possible.

> Ideally, if XLogOpenBuffer (bad name BTW) returns BLK_REPLAY, it would
> also calculate and hand back the address/size of the logged data that
> had been pointed to by the associated XLogRecData chain item.  The
> trouble here is that there might've been multiple XLogRecData items
> pointing to the same buffer.  Perhaps the magic ID number you give to
> XLogOpenBuffer should be thought of as identifying an XLogRecData chain
> item, not so much a buffer?  It's fairly easy to see what to do when
> there's just one chain item per buffer, but I'm not sure what to do
> if there's more than one.

Hmm. You could register a separate XLogRecData chain for each buffer. 
Along the lines of:

rdata[0].data = data for buffer
rdata[0].len = ...
rdata[0].next = &rdata[1];
rdata[1].data = more data for same buffer
rdata[1].len = ...
rdata[2].next = NULL;

XLogRegisterBuffer(0, buffer, &data[0]);

At replay:

if (XLogOpenBuffer(0, &buffer, &xldata, &len) == BLK_REPLAY)
{/* xldata points to the data registered for this buffer */
}

Plus one more chain for the data not associated with a buffer.

- Heikki

pgsql-hackers by date:

From: Andrew Dunstan
Date: 03 April 2014, 15:51:21
Subject: Re: It seems no Windows buildfarm members are running find_typedefs

From: Tom Lane
Date: 03 April 2014, 16:11:17
Subject: Re: WAL format and API changes (9.5)

Re: WAL format and API changes (9.5) - Mailing list pgsql-hackers

Previous

Next