problems with making relfilenodes 56-bits - Mailing list pgsql-hackers

From Robert Haas
Subject problems with making relfilenodes 56-bits
Date
Msg-id CA+Tgmoaa9Yc9O-FP4vS_xTKf8Wgy8TzHpjnjN56_ShKE=jrP-Q@mail.gmail.com
Whole thread Raw
Responses Re: problems with making relfilenodes 56-bits
Re: problems with making relfilenodes 56-bits
Re: problems with making relfilenodes 56-bits
List pgsql-hackers
OK, so the recent commit and revert of the 56-bit relfilenode patch
revealed a few issues that IMHO need design-level input. Let me try to
surface those here, starting a new thread to separate this discussion
from the clutter:

1. Commit Record Alignment. ParseCommitRecord() and ParseAbortRecord()
are dependent on every subsidiary structure that can be added to a
commit or abort record requiring exactly 4-byte alignment. IMHO, this
seems awfully fragile, even leaving the 56-bit patch aside. Prepare
records seem to have a much saner scheme: they've also got a bunch of
different things that can be stuck onto the main record, but they
maxalign each top-level thing that they stick in there. So
ParsePrepareRecord() doesn't have to make any icky alignment
assumptions the way ParseCommitRecord() and ParseAbortRecord() do.
Unfortuantely, that scheme doesn't work as well for commit records,
because the very first top-level thing only needs 2 bytes. We're
currently using 4, and it would obviously be nicer to cut that down to
2 than to have it go up to 8. We could try to rejigger things around
somehow to avoid needing that 2-byte quantity in there as a separate
toplevel item, but I'm not quite sure how to do that, or we could just
copy everything to ensure alignment, but that seems kind of expensive.

If we don't decide to do either of those things, we should at least
better document, and preferably enforce via assets, the requirement
that these structs be exactly 4-byte aligned, so that nobody else
makes the same mistake in the future.

2. WAL Size. Block references in the WAL are by RelFileLocator, so if
you make RelFileLocators bigger, WAL gets bigger. We'd have to test
the exact impact of this, but it seems a bit scary: if you have a WAL
stream with few FPIs doing DML on a narrow table, probably most
records will contain 1 block reference (and occasionally more, but I
guess most will use BKPBLOCK_SAME_REL) and adding 4 bytes to that
block reference feels like it might add up to something significant. I
don't really see any way around this, either: if you make relfilenode
values wider, they take up more space. Perhaps there's a way to claw
that back elsewhere, or we could do something really crazy like switch
to variable-width representations of integer quantities in WAL
records, but there doesn't seem to be any simple way forward other
than, you know, deciding that we're willing to pay the cost of the
additional WAL volume.

3. Sinval Message Size. Sinval messages are 16 bytes right now.
They'll have to grow to 20 bytes if we do this. There's even less room
for bit-squeezing here than there is for the WAL stuff. I'm skeptical
that this really matters, but Tom seems concerned.

4. Other Uses of RelFileLocator. There are a bunch of structs I
haven't looked into yet that also embed RelFileLocator, which may have
their own issues with alignment, padding, and/or size: ginxlogSplit,
ginxlogDeletePage, ginxlogUpdateMeta, gistxlogPageReuse,
xl_heap_new_cid, xl_btree_reuse_page, LogicalRewriteMappingData,
xl_smgr_truncate, xl_seq_rec, ReorderBufferChange, FileTag. I think a
bunch of these are things that get written into WAL, but at least some
of them seem like they probably don't get written into WAL enough to
matter. Needs more investigation, though.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: more descriptive message for process termination due to max_slot_wal_keep_size
Next
From: Nathan Bossart
Date:
Subject: Re: Refactor UnpinBuffer()