problems with making relfilenodes 56-bits - Mailing list pgsql-hackers
| From | Robert Haas | 
|---|---|
| Subject | problems with making relfilenodes 56-bits | 
| Date | |
| Msg-id | CA+Tgmoaa9Yc9O-FP4vS_xTKf8Wgy8TzHpjnjN56_ShKE=jrP-Q@mail.gmail.com Whole thread Raw | 
| Responses | Re: problems with making relfilenodes 56-bits Re: problems with making relfilenodes 56-bits Re: problems with making relfilenodes 56-bits | 
| List | pgsql-hackers | 
OK, so the recent commit and revert of the 56-bit relfilenode patch revealed a few issues that IMHO need design-level input. Let me try to surface those here, starting a new thread to separate this discussion from the clutter: 1. Commit Record Alignment. ParseCommitRecord() and ParseAbortRecord() are dependent on every subsidiary structure that can be added to a commit or abort record requiring exactly 4-byte alignment. IMHO, this seems awfully fragile, even leaving the 56-bit patch aside. Prepare records seem to have a much saner scheme: they've also got a bunch of different things that can be stuck onto the main record, but they maxalign each top-level thing that they stick in there. So ParsePrepareRecord() doesn't have to make any icky alignment assumptions the way ParseCommitRecord() and ParseAbortRecord() do. Unfortuantely, that scheme doesn't work as well for commit records, because the very first top-level thing only needs 2 bytes. We're currently using 4, and it would obviously be nicer to cut that down to 2 than to have it go up to 8. We could try to rejigger things around somehow to avoid needing that 2-byte quantity in there as a separate toplevel item, but I'm not quite sure how to do that, or we could just copy everything to ensure alignment, but that seems kind of expensive. If we don't decide to do either of those things, we should at least better document, and preferably enforce via assets, the requirement that these structs be exactly 4-byte aligned, so that nobody else makes the same mistake in the future. 2. WAL Size. Block references in the WAL are by RelFileLocator, so if you make RelFileLocators bigger, WAL gets bigger. We'd have to test the exact impact of this, but it seems a bit scary: if you have a WAL stream with few FPIs doing DML on a narrow table, probably most records will contain 1 block reference (and occasionally more, but I guess most will use BKPBLOCK_SAME_REL) and adding 4 bytes to that block reference feels like it might add up to something significant. I don't really see any way around this, either: if you make relfilenode values wider, they take up more space. Perhaps there's a way to claw that back elsewhere, or we could do something really crazy like switch to variable-width representations of integer quantities in WAL records, but there doesn't seem to be any simple way forward other than, you know, deciding that we're willing to pay the cost of the additional WAL volume. 3. Sinval Message Size. Sinval messages are 16 bytes right now. They'll have to grow to 20 bytes if we do this. There's even less room for bit-squeezing here than there is for the WAL stuff. I'm skeptical that this really matters, but Tom seems concerned. 4. Other Uses of RelFileLocator. There are a bunch of structs I haven't looked into yet that also embed RelFileLocator, which may have their own issues with alignment, padding, and/or size: ginxlogSplit, ginxlogDeletePage, ginxlogUpdateMeta, gistxlogPageReuse, xl_heap_new_cid, xl_btree_reuse_page, LogicalRewriteMappingData, xl_smgr_truncate, xl_seq_rec, ReorderBufferChange, FileTag. I think a bunch of these are things that get written into WAL, but at least some of them seem like they probably don't get written into WAL enough to matter. Needs more investigation, though. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: