Re: XLog changes for 9.3 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: XLog changes for 9.3 |
Date | |
Msg-id | 201206071836.37718.andres@2ndquadrant.com Whole thread Raw |
In response to | Re: XLog changes for 9.3 (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>) |
Responses |
Re: XLog changes for 9.3
|
List | pgsql-hackers |
On Thursday, June 07, 2012 05:35:11 PM Heikki Linnakangas wrote: > On 07.06.2012 17:18, Andres Freund wrote: > > On Thursday, June 07, 2012 03:50:35 PM Heikki Linnakangas wrote: > >> 3. Move the only field, xl_rem_len, from the continuation record header > >> straight to the xlog page header, eliminating XLogContRecord altogether. > >> This makes it easier to calculate in advance how much space a WAL record > >> requires, as it no longer depends on how many pages it has to be split > >> across. This wastes 4-8 bytes on every xlog page, but that's not much. > > > > +1. I don't think this will waste a measureable amount in real-world > > scenarios. A very big percentag of pages have continuation records. > > Yeah, although the way I'm planning to do it, you'll waste 4 bytes (on > 64-bit architectures) even when there is a continuation record, because > of alignment: > > typedef struct XLogPageHeaderData > { > uint16 xlp_magic; /* magic value for correctness checks */ > uint16 xlp_info; /* flag bits, see below */ > TimeLineID xlp_tli; /* TimeLineID of first record on > XLogRecPtr xlp_pageaddr; /* XLOG address of this page */ > > + uint32 xlp_rem_len; /* bytes remaining of continued record */ > } XLogPageHeaderData; > > The page header is currently 16 bytes in length, so adding a 4-byte > field to it bumps the aligned size to 24 bytes. Nevertheless, I think we > can well live with that. At that point we can just do the #define SizeofXLogPageHeaderData (offsetof(XLogPageHeaderData, xlp_pageaddr) + sizeof(uint32)) dance. If the record can be smeared over two pages there is no point in storing it aligned. Then we don't waste any additional space in comparison to the current state. > > If we do that we can remove all the aligment padding as well. Which would > > be a problem for you anyway, wouldn't it? > It's not a problem. You just MAXALIGN the size of the record when you > calculate how much space it needs, and then all records become naturally > MAXALIGNed. We could quite easily remove the alignment on-disk if we > wanted to, ReadRecord() already always copies the record to an aligned > buffer, but I wasn't planning to do that. Whats the reasoning for having alignment on disk if the records aren't stored continually? > >> These changes will help the XLogInsert scaling patch, by making the > >> space calculations simpler. In essence, to reserve space for a WAL > >> record of size X, you just need to do "bytepos += X". There's a lot > >> more details with that, like mapping from the contiguous byte position > >> to an XLogRecPtr that takes page headers into account, and noticing > >> RedoRecPtr changes safely, but it's a start. > > > > Hm. Wouldn't you need to remove short/long page headers for that as well? > > No, those are ok because they're predictable. I haven't read your scalability patch, so I am not really sure what you need... The "bytepos += X" from above isn't as easy that way. But yes, its not that complicated. > Although it would make the > mapping simpler. To convert from a contiguous xlog byte position that > excludes all headers, to XLogRecPtr, you need to do something like this > (I just made this up, probably has bugs, but it's about this complex): > > #define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD) > #define UsableBytesInSegment ((XLOG_SEG_SIZE / XLOG_BLCKSZ) * > UsableBytesInPage - (SizeOfXLogLongPHD - SizeOfXLogShortPHD) > > uint64 xlogrecptr; > uint64 full_segments = bytepos / UsableBytesInSegment; > int offset_in_segment = bytepos % UsableBytesInSegment; > > xlogrecptr = full_segments * XLOG_SEG_SIZE; > /* is it on the first page? */ > if (offset_in_segment < XLOG_BLCKSZ - SizeOfXLogLongPHD) > xlogrecptr += SizeOfXLogLongPHD + offset_in_segment; > else > { > /* first page is fully used */ > xlogrecptr += XLOG_BLCKSZ; > /* add other full pages */ > offset_in_segment -= XLOG_BLCKSZ - SizeOfXLogLongPHD; > xlogrecptr += (offset_in_segment / UsableBytesInPage) * XLOG_BLCKSZ; > /* and finally offset within the last page */ > xlogrecptr += offset_in_segment % UsableBytesInPage; > } > /* finally convert the 64-bit xlogrecptr to a XLogRecPtr struct */ > XLogRecPtr.xlogid = xlogrecptr >> 32; > XLogRecPtr.xrecoff = xlogrecptr & 0xffffffff; Its a bit more complicated than that, records can span a good bit more than just two pages (even more than two segments) and you need to decide for every of those whether it has a long or a short header. > Capsulated in a function, that's not too bad. But if we want to make > that simpler, one idea would be to allocate the whole 1st page in each > WAL segment for metadata. That way all the actual xlog pages would hold > the same amount of xlog data. Its a bit easier then, but you probably still need to loop over the size and subtract till you reached the final point. Its no problem to produce a 100MB wal record. But then thats probably nothing to design for. Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: