Re: problems with making relfilenodes 56-bits - Mailing list pgsql-hackers
From | Matthias van de Meent |
---|---|
Subject | Re: problems with making relfilenodes 56-bits |
Date | |
Msg-id | CAEze2Wjc42N7vzODXfviuPt3cq-TcbynjouSRMw=LMZmmX7+0A@mail.gmail.com Whole thread Raw |
In response to | Re: problems with making relfilenodes 56-bits (Andres Freund <andres@anarazel.de>) |
Responses |
Re: problems with making relfilenodes 56-bits
|
List | pgsql-hackers |
On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote: > > Hi, > > On 2022-10-03 08:12:39 -0400, Robert Haas wrote: > > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote: > > > I think it'd be interesting to look at per-record-type stats between two > > > equivalent workload, to see where practical workloads suffer the most > > > (possibly with fpw=off, to make things more repeatable). > > > > I would expect, and Dilip's results seem to confirm, the effect to be > > pretty uniform: basically, nearly every record gets bigger by 4 bytes. > > That's because most records contain at least one block reference, and > > if they contain multiple block references, likely all but one will be > > marked BKPBLOCK_SAME_REL, so we pay the cost just once. > > But it doesn't really matter that much if an already large record gets a bit > bigger. Whereas it does matter if it's a small record. Focussing on optimizing > the record types where the increase is large seems like a potential way > forward to me, even if we can't find something generic. > > > > I thought about trying to buy back some space elsewhere, and I think > > that would be a reasonable approach to getting this committed if we > > could find a way to do it. However, I don't see a terribly obvious way > > of making it happen. > > I think there's plenty potential... > > > > Trying to do it by optimizing specific WAL record > > types seems like a real pain in the neck, because there's tons of > > different WAL records that all have the same problem. > > I am not so sure about that. Improving a bunch of the most frequent small > records might buy you back enough on just about every workload to be OK. > > I put the top record sizes for an installcheck run with full_page_writes off > at the bottom. Certainly our regression tests aren't generally > representative. But I think it still decently highlights how just improving a > few records could buy you back more than enough. > > > > Trying to do it in a generic way makes more sense, and the fact that we have > > 2 padding bytes available in XLogRecord seems like a place to start looking, > > but the way forward from there is not clear to me. > > Random idea: xl_prev is large. Store a full xl_prev in the page header, but > only store a 2 byte offset from the page header xl_prev within each record. With that small xl_prev we may not detect partial page writes in recycled segments; or other issues in the underlying file system. With small record sizes, the chance of returning incorrect data would be significant for small records (it would be approximately the chance of getting a record boundary on the underlying page boundary * chance of getting the same MAXALIGN-adjusted size record before the persistence boundary). That issue is part of the reason why my proposed change upthread still contains the full xl_prev. A different idea is removing most block_ids from the record, and optionally reducing per-block length fields to 1B. Used block ids are effectively always sequential, and we only allow 33+4 valid values, so we can use 2 bits to distinguish between 'block belonging to this ID field have at most 255B of data registered' and 'blocks up to this ID follow sequentially without own block ID'. That would save 2N-1 total bytes for N blocks. It is scraping the barrel, but I think it is quite possible. Lastly, we could add XLR_BLOCK_ID_DATA_MED for values >255 containing up to UINT16_MAX lengths. That would save 2 bytes for records that only just pass the 255B barrier, where 2B is still a fairly significant part of the record size. Kind regards, Matthias van de Meent
pgsql-hackers by date: