Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader
Date
Msg-id 201209171107.28912.andres@2ndquadrant.com
Whole thread Raw
In response to Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader  (Andres Freund <andres@2ndquadrant.com>)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
> On 17.09.2012 11:12, Andres Freund wrote:
> > On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
> >> On 15.09.2012 03:39, Andres Freund wrote:
> >> 2. We should focus on reading WAL, I don't see the point of mixing WAL
> > 
> > writing into this.
> > If you write something that filters/analyzes and then forwards WAL and
> > you want to do that without a big overhead (i.e. completely reassembling
> > everything, and then deassembling it again for writeout) its hard to do
> > that without integrating both sides.
> 
> It seems really complicated to filter/analyze WAL records without
> reassembling them, anyway. The user of the facility is in charge of
> reading the physical data, so you can still access the raw data, for
> forwarding purposes, in addition to the reassembled records.
It works ;)

> Or what exactly do you mean by "completely deassembling"? I read that to
> mean dealing with page boundaries, ie. if a record is split across
> pages, copy parts into a contiguous temporary buffer.
Well, if you want to fully split reading and writing of records - which is a 
nice goal! - you basically need the full logic of XLogInsert again to split 
them apart again to write them. Alternatively you need to store record 
boundaries somewhere and copy that way, but in the end if you filter you need 
to correct CRCs...

> > Also, I want to read records incrementally/partially just as data comes
> > in which again is hard to combine with writing out the data again.
> 
> You mean, you want to start reading the first half of a record, before
> the 2nd half is available? That seems complicated.
Well, I just can say again: It works ;). Makes it easy to follow something like 
XLogwrtResult without taking care about record boundaries.

> I'd suggest keeping it simple for now, and optimize later if necessary.
Well, yes. The API should be able to comfortably support those cases though 
which I don't think is neccesarily the case in a simple, one call API as 
proposed.

> Note that before you have the whole WAL record, you cannot CRC check it, so
> you don't know if it's in fact a valid WAL record.
Sure. But you can start the CRC computation without any problems and finish it 
when the last part of the data comes in.

> >> I came up with the attached. I moved ReadRecord and some supporting
> >> functions from xlog.c to xlogreader.c, and made it operate on
> >> XLogReaderState instead of global global variables. As discussed before,
> >> I didn't like the callback-style API, I think the consumer of the API
> >> should rather just call ReadRecord repeatedly to get each record. So
> >> that's what I did.
> > 
> > The problem with that is that kind of API is that it, at least as far as
> > I can see, that it never can operate on incomplete/partial input. Your
> > need to buffer larger amounts of xlog somewhere and you need to be aware
> > of record boundaries. Both are things I dislike in a more generic user
> > than xlog.c.
> 
> I don't understand that argument. A typical large WAL record is split
> across 1-2 pages, maybe 3-4 at most, for an index page split record.
> That doesn't feel like much to me. In extreme cases, a WAL record can be
> much larger (e.g a commit record of a transaction with a huge number of
> subtransactions), but that should be rare in practice.
Well, imagine something like the walsender that essentially follows the flush 
position ideally without regard for record boundaries. It is nice to be able to 
send/analyze/filter as soon as possible without waiting till a page is full. 
And it sure would be nice to be able to read the data on the other side 
directly from the network, decompress it again, and only then store it to disk.

> The user of the facility doesn't need to be aware of record boundaries,
> that's the responsibility of the facility. I thought that's exactly the
> point of generalizing this thing, to make it unnecessary for the code
> that uses it to be aware of such things.
With the proposed API it seems pretty much a requirement to wait inside the 
callback. Thats not really nice if your process has other things to wait for as 
well.

In my proposal you can simply do something like:

XLogReaderRead(state);

DoSomeOtherWork();

if (CheckForForMessagesFromWalreceiver())   ProcessMessages();
else if (state->needs_input)   UseLatchOrSelectOnInputSocket();
else if (state->needs_output)   UseSelectOnOutputSocket();

but you can also do something like waiting on a Latch but *also* on other fds. 

> > If you don't want the capability to forward/filter the data and read
> > partial data without regard for record constraints/buffering your patch
> > seems to be quite a good start. It misses xlogreader.h though...
> 
> Ah sorry, patch with xlogreader.h attached.
Will look at it in a second.

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Next
From: Andres Freund
Date:
Subject: Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader