Re: XLog changes for 9.3 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: XLog changes for 9.3
Date
Msg-id 201206071836.37718.andres@2ndquadrant.com
Whole thread Raw
In response to Re: XLog changes for 9.3  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: XLog changes for 9.3
List pgsql-hackers
On Thursday, June 07, 2012 05:35:11 PM Heikki Linnakangas wrote:
> On 07.06.2012 17:18, Andres Freund wrote:
> > On Thursday, June 07, 2012 03:50:35 PM Heikki Linnakangas wrote:
> >> 3. Move the only field, xl_rem_len, from the continuation record header
> >> straight to the xlog page header, eliminating XLogContRecord altogether.
> >> This makes it easier to calculate in advance how much space a WAL record
> >> requires, as it no longer depends on how many pages it has to be split
> >> across. This wastes 4-8 bytes on every xlog page, but that's not much.
> > 
> > +1. I don't think this will waste a measureable amount in real-world
> > scenarios. A very big percentag of pages have continuation records.
> 
> Yeah, although the way I'm planning to do it, you'll waste 4 bytes (on
> 64-bit architectures) even when there is a continuation record, because
> of alignment:
> 
> typedef struct XLogPageHeaderData
> {
>      uint16      xlp_magic;     /* magic value for correctness checks */
>      uint16      xlp_info;      /* flag bits, see below */
>      TimeLineID  xlp_tli;       /* TimeLineID of first record on
>      XLogRecPtr  xlp_pageaddr;  /* XLOG address of this page */
> 
> +   uint32      xlp_rem_len;   /* bytes remaining of continued record */
>   } XLogPageHeaderData;
> 
> The page header is currently 16 bytes in length, so adding a 4-byte
> field to it bumps the aligned size to 24 bytes. Nevertheless, I think we
> can well live with that.
At that point we can just do the
#define SizeofXLogPageHeaderData (offsetof(XLogPageHeaderData, xlp_pageaddr) + 
sizeof(uint32))
dance. If the record can be smeared over two pages there is no point in 
storing it aligned. Then we don't waste any additional space in comparison to 
the current state.

> > If we do that we can remove all the aligment padding as well. Which would
> > be a problem for you anyway, wouldn't it?
> It's not a problem. You just MAXALIGN the size of the record when you
> calculate how much space it needs, and then all records become naturally
> MAXALIGNed. We could quite easily remove the alignment on-disk if we
> wanted to, ReadRecord() already always copies the record to an aligned
> buffer, but I wasn't planning to do that.
Whats the reasoning for having alignment on disk if the records aren't stored 
continually?

> >> These changes will help the XLogInsert scaling patch, by making the
> >> space calculations simpler. In essence, to reserve space for a WAL
> >> record of size X, you just need to do "bytepos += X".  There's a lot
> >> more details with that, like mapping from the contiguous byte position
> >> to an XLogRecPtr that takes page headers into account, and noticing
> >> RedoRecPtr changes safely, but it's a start.
> > 
> > Hm. Wouldn't you need to remove short/long page headers for that as well?
> 
> No, those are ok because they're predictable.
I haven't read your scalability patch, so I am not really sure what you 
need...
The "bytepos += X" from above isn't as easy that way. But yes, its not that 
complicated.

> Although it would make the
> mapping simpler. To convert from a contiguous xlog byte position that
> excludes all headers, to XLogRecPtr, you need to do something like this
> (I just made this up, probably has bugs, but it's about this complex):
> 
> #define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD)
> #define UsableBytesInSegment ((XLOG_SEG_SIZE / XLOG_BLCKSZ) *
> UsableBytesInPage - (SizeOfXLogLongPHD - SizeOfXLogShortPHD)
> 
> uint64 xlogrecptr;
> uint64 full_segments = bytepos / UsableBytesInSegment;
> int offset_in_segment = bytepos % UsableBytesInSegment;
> 
> xlogrecptr = full_segments * XLOG_SEG_SIZE;
> /* is it on the first page? */
> if (offset_in_segment < XLOG_BLCKSZ - SizeOfXLogLongPHD)
>     xlogrecptr += SizeOfXLogLongPHD + offset_in_segment;
> else
> {
>     /* first page is fully used */
>     xlogrecptr += XLOG_BLCKSZ;
>     /* add other full pages */
>     offset_in_segment -= XLOG_BLCKSZ - SizeOfXLogLongPHD;
>     xlogrecptr += (offset_in_segment / UsableBytesInPage) * XLOG_BLCKSZ;
>     /* and finally offset within the last page */
>     xlogrecptr += offset_in_segment % UsableBytesInPage;
> }
> /* finally convert the 64-bit xlogrecptr to a XLogRecPtr struct */
> XLogRecPtr.xlogid = xlogrecptr >> 32;
> XLogRecPtr.xrecoff = xlogrecptr & 0xffffffff;
Its a bit more complicated than that, records can span a good bit more than 
just two pages (even more than two segments) and you need to decide for every 
of those whether it has a long or a short header.

> Capsulated in a function, that's not too bad. But if we want to make
> that simpler, one idea would be to allocate the whole 1st page in each
> WAL segment for metadata. That way all the actual xlog pages would hold
> the same amount of xlog data.
Its a bit easier then, but you probably still need to loop over the size and 
subtract till you reached the final point. Its no problem to produce a 100MB 
wal record. But then thats probably nothing to design for.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: slow dropping of tables, DropRelFileNodeBuffers, tas
Next
From: Tom Lane
Date:
Subject: Re: "page is not marked all-visible" warning in regression tests