Re: problems with making relfilenodes 56-bits - Mailing list pgsql-hackers
From | Matthias van de Meent |
---|---|
Subject | Re: problems with making relfilenodes 56-bits |
Date | |
Msg-id | CAEze2Wjd3jY_UhhOGdGGnC6NO=+NmtNOmd=JaYv-v-nwBAiXXA@mail.gmail.com Whole thread Raw |
In response to | Re: problems with making relfilenodes 56-bits (Andres Freund <andres@anarazel.de>) |
Responses |
Re: problems with making relfilenodes 56-bits
|
List | pgsql-hackers |
On Wed, 5 Oct 2022 at 01:50, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-10-03 10:01:25 -0700, Andres Freund wrote: > > On 2022-10-03 08:12:39 -0400, Robert Haas wrote: > > > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote: > > > I thought about trying to buy back some space elsewhere, and I think > > > that would be a reasonable approach to getting this committed if we > > > could find a way to do it. However, I don't see a terribly obvious way > > > of making it happen. > > > > I think there's plenty potential... > > I light dusted off my old varint implementation from [1] and converted the > RelFileLocator and BlockNumber from fixed width integers to varint ones. This > isn't meant as a serious patch, but an experiment to see if this is a path > worth pursuing. > > A run of installcheck in a cluster with autovacuum=off, full_page_writes=off > (for increased reproducability) shows a decent saving: > > master: 241106544 - 230 MB > varint: 227858640 - 217 MB I think a signficant part of this improvement comes from the premise of starting with a fresh database. tablespace OID will indeed most likely be low, but database OID may very well be linearly distributed if concurrent workloads in the cluster include updating (potentially unlogged) TOASTed columns and the databases are not created in one "big bang" but over the lifetime of the cluster. In that case DBOID will consume 5B for a significant fraction of databases (anything with OID >=2^28). My point being: I don't think that we should have different WAL performance in databases which is dependent on which OID was assigned to that database. In addition; this varlen encoding of relfilenode would mean that performance would drop over time, as a relations' relfile locator is updated to something with a wider number (through VACUUM FULL or other relfilelocator cycling; e.g. re-importing a database). For maximum performance, you'd have to tune your database to have the lowest possible database, namespace and relfilelocator numbers; which (in older clusters) implies hacking into the catalogs - which seems like an antipattern. I would have much less issue with this if we had separate counters per database (and approximately incremental dbOid:s), but that's not the case right now. > The average record size goes from 102.7 to 95.7 bytes excluding the remaining > FPIs, 118.1 to 111.0 including FPIs. That's quite promising. > There's plenty other spots that could be converted (e.g. the record length > which rarely needs four bytes), this is just meant as a demonstration. Agreed. > I used pg_waldump --stats for that range of WAL to measure the CPU overhead. A > profile does show pg_varint_decode_uint64(), but partially that seems to be > offset by the reduced amount of bytes to CRC. Maybe a ~2% overhead remains. > > That would be tolerable, I think, because waldump --stats pretty much doesn't > do anything with the WAL. > > But I suspect there's plenty of optimization potential in the varint > code. Right now it e.g. stores data as big endian, and the bswap instructions > do show up. And a lot of my bit-maskery could be optimized too. One thing that comes to mind is that we will never see dbOid < 2^8 (and rarely < 2^14, nor spcOid less than 2^8 for that matter), so we'll probably waste at least one or two bits in the encoding of those values. That's not the end of the world, but it'd probably be better if we could improve on that - up to 6% of the field's disk usage would be wasted on an always-on bit. ---- Attached is a prototype patchset that reduces the WAL record size in many common cases. This is a prototype, as it fails tests due to a locking issue in prepared_xacts that I have not been able to find the source of yet. It also could use some more polishing, but the base case seems quite good. I haven't yet run the numbers though... 0001 - Extract xl_rminfo from xl_info See [0] for more info as to why that's useful, the patch was pulled from there. It is mainly used to reduce the size of 0002; and mostly consists of find-and-replace of rmgrs extracting their bits from xl_info. 0002 - Rework XLogRecord This makes many fields in the xlog header optional, reducing the size of many xlog records by several bytes. This implements the design I shared in my earlier message [1]. 0003 - Rework XLogRecordBlockHeader. This patch could be applied on current head, and saves some bytes in per-block data. It potentially saves some bytes per registered block/buffer in the WAL record (max 2 bytes for the first block, after that up to 3). See the patch's commit message in the patch for detailed information. Kind regards, Matthias van de Meent [0] https://postgr.es/m/CAEze2WgZti_Bgs-Aw3egsR5PJQpHcYZwZFCJND5MS-O_DX0-Hg%40mail.gmail.com [1] https://postgr.es/m/CAEze2WjOFzRzPMPYhH4odSa9OCF2XeZszE3jGJhJzrpdFmyLOw@mail.gmail.com
Attachment
pgsql-hackers by date: