Thread: problems with making relfilenodes 56-bits

problems with making relfilenodes 56-bits

From

Robert Haas

Date:

28 September 2022, 21:05:53

OK, so the recent commit and revert of the 56-bit relfilenode patch
revealed a few issues that IMHO need design-level input. Let me try to
surface those here, starting a new thread to separate this discussion
from the clutter:

1. Commit Record Alignment. ParseCommitRecord() and ParseAbortRecord()
are dependent on every subsidiary structure that can be added to a
commit or abort record requiring exactly 4-byte alignment. IMHO, this
seems awfully fragile, even leaving the 56-bit patch aside. Prepare
records seem to have a much saner scheme: they've also got a bunch of
different things that can be stuck onto the main record, but they
maxalign each top-level thing that they stick in there. So
ParsePrepareRecord() doesn't have to make any icky alignment
assumptions the way ParseCommitRecord() and ParseAbortRecord() do.
Unfortuantely, that scheme doesn't work as well for commit records,
because the very first top-level thing only needs 2 bytes. We're
currently using 4, and it would obviously be nicer to cut that down to
2 than to have it go up to 8. We could try to rejigger things around
somehow to avoid needing that 2-byte quantity in there as a separate
toplevel item, but I'm not quite sure how to do that, or we could just
copy everything to ensure alignment, but that seems kind of expensive.

If we don't decide to do either of those things, we should at least
better document, and preferably enforce via assets, the requirement
that these structs be exactly 4-byte aligned, so that nobody else
makes the same mistake in the future.

2. WAL Size. Block references in the WAL are by RelFileLocator, so if
you make RelFileLocators bigger, WAL gets bigger. We'd have to test
the exact impact of this, but it seems a bit scary: if you have a WAL
stream with few FPIs doing DML on a narrow table, probably most
records will contain 1 block reference (and occasionally more, but I
guess most will use BKPBLOCK_SAME_REL) and adding 4 bytes to that
block reference feels like it might add up to something significant. I
don't really see any way around this, either: if you make relfilenode
values wider, they take up more space. Perhaps there's a way to claw
that back elsewhere, or we could do something really crazy like switch
to variable-width representations of integer quantities in WAL
records, but there doesn't seem to be any simple way forward other
than, you know, deciding that we're willing to pay the cost of the
additional WAL volume.

3. Sinval Message Size. Sinval messages are 16 bytes right now.
They'll have to grow to 20 bytes if we do this. There's even less room
for bit-squeezing here than there is for the WAL stuff. I'm skeptical
that this really matters, but Tom seems concerned.

4. Other Uses of RelFileLocator. There are a bunch of structs I
haven't looked into yet that also embed RelFileLocator, which may have
their own issues with alignment, padding, and/or size: ginxlogSplit,
ginxlogDeletePage, ginxlogUpdateMeta, gistxlogPageReuse,
xl_heap_new_cid, xl_btree_reuse_page, LogicalRewriteMappingData,
xl_smgr_truncate, xl_seq_rec, ReorderBufferChange, FileTag. I think a
bunch of these are things that get written into WAL, but at least some
of them seem like they probably don't get written into WAL enough to
matter. Needs more investigation, though.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

From

Tom Lane

Date:

28 September 2022, 23:08:05

Robert Haas <robertmhaas@gmail.com> writes:
> 3. Sinval Message Size. Sinval messages are 16 bytes right now.
> They'll have to grow to 20 bytes if we do this. There's even less room
> for bit-squeezing here than there is for the WAL stuff. I'm skeptical
> that this really matters, but Tom seems concerned.

As far as that goes, I'm entirely prepared to accept a conclusion
that the benefits of widening relfilenodes justify whatever space
or speed penalties may exist there.  However, we cannot honestly
make that conclusion if we haven't measured said penalties.
The same goes for the other issues you raise here.

            regards, tom lane

Re: problems with making relfilenodes 56-bits

From

Peter Geoghegan

Date:

28 September 2022, 23:24:13

On Wed, Sep 28, 2022 at 4:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> As far as that goes, I'm entirely prepared to accept a conclusion
> that the benefits of widening relfilenodes justify whatever space
> or speed penalties may exist there.  However, we cannot honestly
> make that conclusion if we haven't measured said penalties.
> The same goes for the other issues you raise here.

I generally agree, but the devil is in the details.

I tend to agree with Robert that many individual WAL record types just
don't appear frequently enough to matter (it also helps that even the
per-record space overhead with wider 56-bit relfilenodes isn't so
bad). Just offhand I'd say that ginxlogSplit, ginxlogDeletePage,
ginxlogUpdateMeta, gistxlogPageReuse and xl_btree_reuse_page are
likely to be in this category (though would be nice to see some
numbers for those).

I'm much less sure about the other record types. Any WAL records with
a variable number of relfilenode entries seem like they might be more
of a problem. But I'm not ready to accept that that cannot be
ameliorated in some way. Just for example, it wouldn't be impossible to
do some kind of varbyte encoding for some record types. How many times
will the cluster actually need billions of relfilenodes? It has to
work, but maybe it can be suboptimal from a space overhead
perspective.

I'm not saying that we need to do anything fancy just yet. I'm
just saying that there definitely *are* options. Maybe it's not really
necessary to come up with something like a varbyte encoding, and maybe
the complexity it imposes just won't be worth it -- I really have no
opinion on that just yet.

--
Peter Geoghegan

Re: problems with making relfilenodes 56-bits

From

Matthias van de Meent

Date:

29 September 2022, 16:24:12

On Thu, 29 Sep 2022, 00:06 Robert Haas, <robertmhaas@gmail.com> wrote:
>
> 2. WAL Size. Block references in the WAL are by RelFileLocator, so if
> you make RelFileLocators bigger, WAL gets bigger. We'd have to test
> the exact impact of this, but it seems a bit scary: if you have a WAL
> stream with few FPIs doing DML on a narrow table, probably most
> records will contain 1 block reference (and occasionally more, but I
> guess most will use BKPBLOCK_SAME_REL) and adding 4 bytes to that
> block reference feels like it might add up to something significant. I
> don't really see any way around this, either: if you make relfilenode
> values wider, they take up more space. Perhaps there's a way to claw
> that back elsewhere, or we could do something really crazy like switch
> to variable-width representations of integer quantities in WAL
> records, but there doesn't seem to be any simple way forward other
> than, you know, deciding that we're willing to pay the cost of the
> additional WAL volume.

Re: WAL volume and record size optimization

I've been working off and on with WAL for some time now due to [0] and
the interest of Neon in the area, and I think we can reduce the size
of the base record by a significant margin:

Currently, our minimal WAL record is exactly 24 bytes: length (4B),
TransactionId (4B), previous record pointer (8B), flags (1B), redo
manager (1B), 2 bytes of padding and lastly the 4-byte CRC. Of these
fields, TransactionID could reasonably be omitted for certain WAL
records (as example: index insertions don't really need the XID).
Additionally, the length field could be made to be variable length,
and any padding is just plain bad (adding 4 bytes to all
insert/update/delete/lock records was frowned upon).

I'm working on a prototype patch for a more bare-bones WAL record
header of which the only required fields would be prevptr (8B), CRC
(4B), rmgr (1B) and flags (1B) for a minimal size of 14 bytes. I don't
yet know the performance of this, but the considering that there will
be a lot more conditionals in header decoding it might be slower for
any one backend, but faster overall (less overall IOps)

The flags field would be indications for additional information: [flag
name (bits): explanation (additional xlog header data in bytes)]
- len_size(0..1): xlog record size is at most xlrec_header_only (0B),
uint8_max(1B), uint16_max(2B), uint32_max(4B)
- has_xid (2): contains transaction ID of logging transaction (4B, or
probably 8B when we introduce 64-bit xids)
- has_cid (3): contains the command ID of the logging statement (4B)
(rationale for logging CID in [0], now in record header because XID is
included there as well, and both are required for consistent
snapshots.
- has_rminfo (4): has non-zero redo-manager flags field (1B)
(rationale for separate field [1], non-zero allows 1B space
optimization for one of each RMGR's operations)
- special_rel (5): pre-existing definition
- check_consistency (6): pre-existing definition
- unset (7): no meaning defined yet. Could be used for full record
compression, or other purposes.

A normal record header (XLOG record with at least some registered
data) would be only 15 to 17 bytes (0-1B rminfo + 1-2B in xl_len), and
one with XID only up to 21 bytes. So, when compared to the current
XLogRecord format, we would in general recover 2 or 3 bytes from the
xl_tot_len field, 1 or 2 bytes from the alignment hole, and
potentially the 4 bytes of the xid when that data is considered
useless during recovery, or physical or logical replication.

Kind regards,

Matthias van de Meent

[0] https://postgr.es/m/CAEze2WhmU8WciEgaVPZm71vxFBOpp8ncDc%3DSdEHHsW6HS%2Bk9zw%40mail.gmail.com
[1] https://postgr.es/m/20220715173731.6t3km5cww3f5ztfq%40awork3.anarazel.de

Re: problems with making relfilenodes 56-bits

From

Robert Haas

Date:

29 September 2022, 21:58:59

On Thu, Sep 29, 2022 at 12:24 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Currently, our minimal WAL record is exactly 24 bytes: length (4B),
> TransactionId (4B), previous record pointer (8B), flags (1B), redo
> manager (1B), 2 bytes of padding and lastly the 4-byte CRC. Of these
> fields, TransactionID could reasonably be omitted for certain WAL
> records (as example: index insertions don't really need the XID).
> Additionally, the length field could be made to be variable length,
> and any padding is just plain bad (adding 4 bytes to all
> insert/update/delete/lock records was frowned upon).

Right. I was shocked when I realized that we had two bytes of padding
in there, considering that numerous rmgrs are stealing bits from the
1-byte field that identifies the record type. My question was: why
aren't we exposing those 2 bytes for rmgr-type-specific use? Or for
something like xl_xact_commit, we could get rid of xl_xact_info if we
had those 2 bytes to work with.

Right now, I see that a bare commit record is 34 bytes which rounds
out to 40. With the trick above, we could shave off 4 bytes bringing
the size to 30 which would round to 32. That's a pretty significant
savings, although it'd be a lot better if we could get some kind of
savings for DML records which could be much higher frequency.

> I'm working on a prototype patch for a more bare-bones WAL record
> header of which the only required fields would be prevptr (8B), CRC
> (4B), rmgr (1B) and flags (1B) for a minimal size of 14 bytes. I don't
> yet know the performance of this, but the considering that there will
> be a lot more conditionals in header decoding it might be slower for
> any one backend, but faster overall (less overall IOps)
>
> The flags field would be indications for additional information: [flag
> name (bits): explanation (additional xlog header data in bytes)]
> - len_size(0..1): xlog record size is at most xlrec_header_only (0B),
> uint8_max(1B), uint16_max(2B), uint32_max(4B)
> - has_xid (2): contains transaction ID of logging transaction (4B, or
> probably 8B when we introduce 64-bit xids)
> - has_cid (3): contains the command ID of the logging statement (4B)
> (rationale for logging CID in [0], now in record header because XID is
> included there as well, and both are required for consistent
> snapshots.
> - has_rminfo (4): has non-zero redo-manager flags field (1B)
> (rationale for separate field [1], non-zero allows 1B space
> optimization for one of each RMGR's operations)
> - special_rel (5): pre-existing definition
> - check_consistency (6): pre-existing definition
> - unset (7): no meaning defined yet. Could be used for full record
> compression, or other purposes.

Interesting. One fly in the ointment here is that WAL records start on
8-byte boundaries (probably MAXALIGN boundaries, but I didn't check
the details). And after the 24-byte header, there's a 2-byte header
(or 5-byte header) introducing the payload data (see
XLR_BLOCK_ID_DATA_SHORT/LONG). So if the size of the actual payload
data is a multiple of 8, and is short enough that we use the short
data header, we waste 6 bytes. If the data length is a multiple of 4,
we waste 2 bytes. And those are probably really common cases. So the
big improvements probably come from saving 2 bytes or 6 bytes or 10
bytes, and saving say 3 or 5 is probably not much better than 2. Or at
least that's what I'm guessing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

From

Dilip Kumar

Date:

30 September 2022, 10:06:11

On Thu, Sep 29, 2022 at 2:36 AM Robert Haas <robertmhaas@gmail.com> wrote:

> 2. WAL Size. Block references in the WAL are by RelFileLocator, so if
> you make RelFileLocators bigger, WAL gets bigger. We'd have to test
> the exact impact of this, but it seems a bit scary

I have done some testing around this area to see the impact on WAL
size especially when WAL sizes are smaller, with a very simple test
with insert/update/delete I can see around an 11% increase in WAL size
[1] then I did some more test with pgbench with smaller scale
factor(1) there I do not see a significant increase in the WAL size
although it increases WAL size around 1-2%. [2].

[1]
checkpoint;
do $$
declare
 lsn1 pg_lsn;
 lsn2 pg_lsn;
 diff float;
begin
   select pg_current_wal_lsn() into lsn1;
   CREATE TABLE test(a int);
   for counter in 1..1000 loop
     INSERT INTO test values(1);
     UPDATE test set a=a+1;
     DELETE FROM test where a=1;
   end loop;
   DROP TABLE test;
   select pg_current_wal_lsn() into lsn2;
   select pg_wal_lsn_diff(lsn2, lsn1) into diff;
   raise notice '%', diff/1024;
end; $$;

wal generated head: 66199.09375 kB
wal generated patch: 73906.984375 kB
wal-size increase: 11%

[2]
./pgbench -i postgres
./pgbench -c1 -j1 -t 30000 -M prepared postgres
wal generated head: 30780 kB
wal generated patch: 31284 kB
wal-size increase: ~1-2%

I have done further analysis to know why on pgbench workload the wal
size is increasing by 1-2%.  So with waldump I could see that wal size
per transaction size increased from 566 (on head) to 590 (with patch),
so that is around 4% but when we see total wal size difference after
30k transaction then it is just 1-2%
and I think that is because there would be other records which are not
impacted like FPI

Conclusion:  So as suspected with very small WAL sizes with a very
targeted test case we can see a significant 11% increase in WAL size
but with pgbench kind of workload the increase in WAL size is very
less.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

From

Andres Freund

Date:

01 October 2022, 00:20:44

Hi,

On 2022-09-30 15:36:11 +0530, Dilip Kumar wrote:
> I have done some testing around this area to see the impact on WAL
> size especially when WAL sizes are smaller, with a very simple test
> with insert/update/delete I can see around an 11% increase in WAL size
> [1] then I did some more test with pgbench with smaller scale
> factor(1) there I do not see a significant increase in the WAL size
> although it increases WAL size around 1-2%. [2].

I think it'd be interesting to look at per-record-type stats between two
equivalent workload, to see where practical workloads suffer the most
(possibly with fpw=off, to make things more repeatable).

I think it'd be an OK tradeoff to optimize WAL usage for a few of the worst to
pay off for 56bit relfilenodes. The class of problems foreclosed is large
enough to "waste" "improvement potential" on this.

Greetings,

Andres Freund

Re: problems with making relfilenodes 56-bits

From

Peter Geoghegan

Date:

01 October 2022, 01:44:52

On Fri, Sep 30, 2022 at 5:20 PM Andres Freund <andres@anarazel.de> wrote:
> I think it'd be an OK tradeoff to optimize WAL usage for a few of the worst to
> pay off for 56bit relfilenodes. The class of problems foreclosed is large
> enough to "waste" "improvement potential" on this.

I agree overall.

A closely related but distinct question occurs to me: if we're going
to be "wasting" space on alignment padding in certain cases one way or
another, can we at least recognize those cases and take advantage at
the level of individual WAL record formats? In other words: So far
we've been discussing the importance of not going over a critical
threshold for certain WAL records. But it might also be valuable to
consider recognizing that that's inevitable, and that we might as well
make the most of it by including one or two other things.

This seems most likely to matter when managing the problem of negative
compression with per-WAL-record compression schemes for things like
arrays of page offset numbers [1]. If (say) a given compression scheme
"wastes" space for arrays of only 1-3 items, but we already know that
the relevant space will all be lost to alignment needed by code one
level down in any case, does it really count as waste? We're likely
always going to have some kind of negative compression, but you do get
to influence where and when the negative compression happens.

Not sure how relevant this will turn out to be, but seems worth
considering. More generally, thinking about how things work across
multiple layers of abstraction seems like it could be valuable in
other ways.

[1] https://postgr.es/m/CAH2-WzmLCn2Hx9tQLdmdb+9CkHKLyWD2bsz=PmRebc4dAxjy6g@mail.gmail.com
--
Peter Geoghegan

Re: problems with making relfilenodes 56-bits

From

Dilip Kumar

Date:

01 October 2022, 06:14:01

On Sat, Oct 1, 2022 at 5:50 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-09-30 15:36:11 +0530, Dilip Kumar wrote:
> > I have done some testing around this area to see the impact on WAL
> > size especially when WAL sizes are smaller, with a very simple test
> > with insert/update/delete I can see around an 11% increase in WAL size
> > [1] then I did some more test with pgbench with smaller scale
> > factor(1) there I do not see a significant increase in the WAL size
> > although it increases WAL size around 1-2%. [2].
>
> I think it'd be interesting to look at per-record-type stats between two
> equivalent workload, to see where practical workloads suffer the most
> (possibly with fpw=off, to make things more repeatable).

While testing pgbench, I dumped the wal sizes using waldump.  So in
pgbench case, most of the record sizes increased by 4 bytes as they
include single block references and the same is true for the other
test case I sent.  Here is the wal dump of what the sizes look like
for a single pgbench transaction[1].  Maybe for seeing these changes
with the different workloads we can run some of the files from the
regression test and compare the individual wal sizes.

Head:
rmgr: Heap        len (rec/tot):     54/    54, tx:        867, lsn:
0/02DD1280, prev 0/02DD1250, desc: LOCK off 44: xid 867: flags 0x01
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/5/16424 blk 226
rmgr: Heap        len (rec/tot):    171/   171, tx:        867, lsn:
0/02DD12B8, prev 0/02DD1280, desc: UPDATE off 44 xmax 867 flags 0x11 ;
new off 30 xmax 0, blkref #0: rel 1663/5/16424 blk 1639, blkref #1:
rel 1663/5/16424 blk 226
rmgr: Btree       len (rec/tot):     64/    64, tx:        867, lsn:
0/02DD1368, prev 0/02DD12B8, desc: INSERT_LEAF off 290, blkref #0: rel
1663/5/16432 blk 39
rmgr: Heap        len (rec/tot):     78/    78, tx:        867, lsn:
0/02DD13A8, prev 0/02DD1368, desc: HOT_UPDATE off 15 xmax 867 flags
0x10 ; new off 19 xmax 0, blkref #0: rel 1663/5/16427 blk 0
rmgr: Heap        len (rec/tot):     74/    74, tx:        867, lsn:
0/02DD13F8, prev 0/02DD13A8, desc: HOT_UPDATE off 9 xmax 867 flags
0x10 ; new off 10 xmax 0, blkref #0: rel 1663/5/16425 blk 0
rmgr: Heap        len (rec/tot):     79/    79, tx:        867, lsn:
0/02DD1448, prev 0/02DD13F8, desc: INSERT off 9 flags 0x08, blkref #0:
rel 1663/5/16434 blk 0
rmgr: Transaction len (rec/tot):     46/    46, tx:        867, lsn:
0/02DD1498, prev 0/02DD1448, desc: COMMIT 2022-10-01 11:24:03.464437
IST


Patch:
rmgr: Heap        len (rec/tot):     58/    58, tx:        818, lsn:
0/0218BEB0, prev 0/0218BE80, desc: LOCK off 34: xid 818: flags 0x01
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/5/100004 blk 522
rmgr: Heap        len (rec/tot):    175/   175, tx:        818, lsn:
0/0218BEF0, prev 0/0218BEB0, desc: UPDATE off 34 xmax 818 flags 0x11 ;
new off 8 xmax 0, blkref #0: rel 1663/5/100004 blk 1645, blkref #1:
rel 1663/5/100004 blk 522
rmgr: Btree       len (rec/tot):     68/    68, tx:        818, lsn:
0/0218BFA0, prev 0/0218BEF0, desc: INSERT_LEAF off 36, blkref #0: rel
1663/5/100010 blk 89
rmgr: Heap        len (rec/tot):     82/    82, tx:        818, lsn:
0/0218BFE8, prev 0/0218BFA0, desc: HOT_UPDATE off 66 xmax 818 flags
0x10 ; new off 90 xmax 0, blkref #0: rel 1663/5/100007 blk 0
rmgr: Heap        len (rec/tot):     78/    78, tx:        818, lsn:
0/0218C058, prev 0/0218BFE8, desc: HOT_UPDATE off 80 xmax 818 flags
0x10 ; new off 81 xmax 0, blkref #0: rel 1663/5/100005 blk 0
rmgr: Heap        len (rec/tot):     83/    83, tx:        818, lsn:
0/0218C0A8, prev 0/0218C058, desc: INSERT off 80 flags 0x08, blkref
#0: rel 1663/5/100011 blk 0
rmgr: Transaction len (rec/tot):     46/    46, tx:        818, lsn:
0/0218C100, prev 0/0218C0A8, desc: COMMIT 2022-10-01 11:11:03.564063
IST


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

From

Robert Haas

Date:

03 October 2022, 12:12:39

On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:
> I think it'd be interesting to look at per-record-type stats between two
> equivalent workload, to see where practical workloads suffer the most
> (possibly with fpw=off, to make things more repeatable).

I would expect, and Dilip's results seem to confirm, the effect to be
pretty uniform: basically, nearly every record gets bigger by 4 bytes.
That's because most records contain at least one block reference, and
if they contain multiple block references, likely all but one will be
marked BKPBLOCK_SAME_REL, so we pay the cost just once.

Because of alignment padding, the practical effect is probably that
about half of the records get bigger by 8 bytes and the other half
don't get bigger at all. But I see no reason to believe that things
are any better or worse than that. Most interesting record types are
going to contain some kind of variable-length payload, so the chances
that a 4 byte size increase pushes you across a MAXALIGN boundary seem
to be no better or worse than fifty-fifty.

> I think it'd be an OK tradeoff to optimize WAL usage for a few of the worst to
> pay off for 56bit relfilenodes. The class of problems foreclosed is large
> enough to "waste" "improvement potential" on this.

I thought about trying to buy back some space elsewhere, and I think
that would be a reasonable approach to getting this committed if we
could find a way to do it. However, I don't see a terribly obvious way
of making it happen. Trying to do it by optimizing specific WAL record
types seems like a real pain in the neck, because there's tons of
different WAL records that all have the same problem. Trying to do it
in a generic way makes more sense, and the fact that we have 2 padding
bytes available in XLogRecord seems like a place to start looking, but
the way forward from there is not clear to me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

From

Andres Freund

Date:

03 October 2022, 17:01:25

Hi,

On 2022-10-03 08:12:39 -0400, Robert Haas wrote:
> On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:
> > I think it'd be interesting to look at per-record-type stats between two
> > equivalent workload, to see where practical workloads suffer the most
> > (possibly with fpw=off, to make things more repeatable).
>
> I would expect, and Dilip's results seem to confirm, the effect to be
> pretty uniform: basically, nearly every record gets bigger by 4 bytes.
> That's because most records contain at least one block reference, and
> if they contain multiple block references, likely all but one will be
> marked BKPBLOCK_SAME_REL, so we pay the cost just once.

But it doesn't really matter that much if an already large record gets a bit
bigger. Whereas it does matter if it's a small record. Focussing on optimizing
the record types where the increase is large seems like a potential way
forward to me, even if we can't find something generic.


> I thought about trying to buy back some space elsewhere, and I think
> that would be a reasonable approach to getting this committed if we
> could find a way to do it. However, I don't see a terribly obvious way
> of making it happen.

I think there's plenty potential...


> Trying to do it by optimizing specific WAL record
> types seems like a real pain in the neck, because there's tons of
> different WAL records that all have the same problem.

I am not so sure about that. Improving a bunch of the most frequent small
records might buy you back enough on just about every workload to be OK.

I put the top record sizes for an installcheck run with full_page_writes off
at the bottom. Certainly our regression tests aren't generally
representative. But I think it still decently highlights how just improving a
few records could buy you back more than enough.


> Trying to do it in a generic way makes more sense, and the fact that we have
> 2 padding bytes available in XLogRecord seems like a place to start looking,
> but the way forward from there is not clear to me.

Random idea: xl_prev is large. Store a full xl_prev in the page header, but
only store a 2 byte offset from the page header xl_prev within each record.

Greetings,

Andres Freund

by total size:

Type                                           N      (%)          Record size      (%)             FPI size      (%)
    Combined size      (%)
 
----                                           -      ---          -----------      ---             --------      ---
    -------------      ---
 
Heap/INSERT                              1041666 ( 50.48)            106565255 ( 50.54)                    0 (  0.00)
        106565255 ( 43.92)
 
Btree/INSERT_LEAF                         352196 ( 17.07)             24067672 ( 11.41)                    0 (  0.00)
         24067672 (  9.92)
 
Heap/DELETE                               250852 ( 12.16)             13546008 (  6.42)                    0 (  0.00)
         13546008 (  5.58)
 
Hash/INSERT                               108499 (  5.26)              7811928 (  3.70)                    0 (  0.00)
          7811928 (  3.22)
 
Transaction/COMMIT                         16053 (  0.78)              6402657 (  3.04)                    0 (  0.00)
          6402657 (  2.64)
 
Gist/PAGE_UPDATE                           57225 (  2.77)              5217100 (  2.47)                    0 (  0.00)
          5217100 (  2.15)
 
Gin/UPDATE_META_PAGE                       23943 (  1.16)              4539970 (  2.15)                    0 (  0.00)
          4539970 (  1.87)
 
Gin/INSERT                                 27004 (  1.31)              3623998 (  1.72)                    0 (  0.00)
          3623998 (  1.49)
 
Gist/PAGE_SPLIT                              448 (  0.02)              3391244 (  1.61)                    0 (  0.00)
          3391244 (  1.40)
 
SPGist/ADD_LEAF                            38968 (  1.89)              3341696 (  1.58)                    0 (  0.00)
          3341696 (  1.38)
 
...
XLOG/FPI                                    7228 (  0.35)               378924 (  0.18)             29788166 ( 93.67)
         30167090 ( 12.43)
 
...
Gin/SPLIT                                    141 (  0.01)                13011 (  0.01)              1187588 (  3.73)
          1200599 (  0.49)
 
...
                                        --------                      --------                      --------
         --------
 
Total                                    2063609                     210848282 [86.89%]             31802766 [13.11%]
        242651048 [100%]
 

(Included XLOG/FPI and Gin/SPLIT to explain why there's FPIs despite running with fpw=off)

sorted by number of records:
Heap/INSERT                              1041666 ( 50.48)            106565255 ( 50.54)                    0 (  0.00)
        106565255 ( 43.92)
 
Btree/INSERT_LEAF                         352196 ( 17.07)             24067672 ( 11.41)                    0 (  0.00)
         24067672 (  9.92)
 
Heap/DELETE                               250852 ( 12.16)             13546008 (  6.42)                    0 (  0.00)
         13546008 (  5.58)
 
Hash/INSERT                               108499 (  5.26)              7811928 (  3.70)                    0 (  0.00)
          7811928 (  3.22)
 
Gist/PAGE_UPDATE                           57225 (  2.77)              5217100 (  2.47)                    0 (  0.00)
          5217100 (  2.15)
 
SPGist/ADD_LEAF                            38968 (  1.89)              3341696 (  1.58)                    0 (  0.00)
          3341696 (  1.38)
 
Gin/INSERT                                 27004 (  1.31)              3623998 (  1.72)                    0 (  0.00)
          3623998 (  1.49)
 
Gin/UPDATE_META_PAGE                       23943 (  1.16)              4539970 (  2.15)                    0 (  0.00)
          4539970 (  1.87)
 
Standby/LOCK                               18451 (  0.89)               775026 (  0.37)                    0 (  0.00)
           775026 (  0.32)
 
Transaction/COMMIT                         16053 (  0.78)              6402657 (  3.04)                    0 (  0.00)
          6402657 (  2.64)

Re: problems with making relfilenodes 56-bits

From

Matthias van de Meent

Date:

03 October 2022, 17:40:30

On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-10-03 08:12:39 -0400, Robert Haas wrote:
> > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:
> > > I think it'd be interesting to look at per-record-type stats between two
> > > equivalent workload, to see where practical workloads suffer the most
> > > (possibly with fpw=off, to make things more repeatable).
> >
> > I would expect, and Dilip's results seem to confirm, the effect to be
> > pretty uniform: basically, nearly every record gets bigger by 4 bytes.
> > That's because most records contain at least one block reference, and
> > if they contain multiple block references, likely all but one will be
> > marked BKPBLOCK_SAME_REL, so we pay the cost just once.
>
> But it doesn't really matter that much if an already large record gets a bit
> bigger. Whereas it does matter if it's a small record. Focussing on optimizing
> the record types where the increase is large seems like a potential way
> forward to me, even if we can't find something generic.
>
>
> > I thought about trying to buy back some space elsewhere, and I think
> > that would be a reasonable approach to getting this committed if we
> > could find a way to do it. However, I don't see a terribly obvious way
> > of making it happen.
>
> I think there's plenty potential...
>
>
> > Trying to do it by optimizing specific WAL record
> > types seems like a real pain in the neck, because there's tons of
> > different WAL records that all have the same problem.
>
> I am not so sure about that. Improving a bunch of the most frequent small
> records might buy you back enough on just about every workload to be OK.
>
> I put the top record sizes for an installcheck run with full_page_writes off
> at the bottom. Certainly our regression tests aren't generally
> representative. But I think it still decently highlights how just improving a
> few records could buy you back more than enough.
>
>
> > Trying to do it in a generic way makes more sense, and the fact that we have
> > 2 padding bytes available in XLogRecord seems like a place to start looking,
> > but the way forward from there is not clear to me.
>
> Random idea: xl_prev is large. Store a full xl_prev in the page header, but
> only store a 2 byte offset from the page header xl_prev within each record.

With that small xl_prev we may not detect partial page writes in
recycled segments; or other issues in the underlying file system. With
small record sizes, the chance of returning incorrect data would be
significant for small records (it would be approximately the chance of
getting a record boundary on the underlying page boundary * chance of
getting the same MAXALIGN-adjusted size record before the persistence
boundary). That issue is part of the reason why my proposed change
upthread still contains the full xl_prev.

A different idea is removing most block_ids from the record, and
optionally reducing per-block length fields to 1B. Used block ids are
effectively always sequential, and we only allow 33+4 valid values, so
we can use 2 bits to distinguish between 'block belonging to this ID
field have at most 255B of data registered' and 'blocks up to this ID
follow sequentially without own block ID'. That would save 2N-1 total
bytes for N blocks. It is scraping the barrel, but I think it is quite
possible.

Lastly, we could add XLR_BLOCK_ID_DATA_MED for values >255 containing
up to UINT16_MAX lengths. That would save 2 bytes for records that
only just pass the 255B barrier, where 2B is still a fairly
significant part of the record size.

Kind regards,

Matthias van de Meent

Re: problems with making relfilenodes 56-bits

From

Andres Freund

Date:

03 October 2022, 21:25:56

Hi,

On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:
> On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote:
> > Random idea: xl_prev is large. Store a full xl_prev in the page header, but
> > only store a 2 byte offset from the page header xl_prev within each record.
> 
> With that small xl_prev we may not detect partial page writes in
> recycled segments; or other issues in the underlying file system. With
> small record sizes, the chance of returning incorrect data would be
> significant for small records (it would be approximately the chance of
> getting a record boundary on the underlying page boundary * chance of
> getting the same MAXALIGN-adjusted size record before the persistence
> boundary). That issue is part of the reason why my proposed change
> upthread still contains the full xl_prev.

What exactly is the theory for this significant increase? I don't think
xl_prev provides a meaningful protection against torn pages in the first
place?

Greetings,

Andres Freund

Re: problems with making relfilenodes 56-bits

From

Matthias van de Meent

Date:

04 October 2022, 13:05:47

On Mon, 3 Oct 2022 at 23:26, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:
> > On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote:
> > > Random idea: xl_prev is large. Store a full xl_prev in the page header, but
> > > only store a 2 byte offset from the page header xl_prev within each record.
> >
> > With that small xl_prev we may not detect partial page writes in
> > recycled segments; or other issues in the underlying file system. With
> > small record sizes, the chance of returning incorrect data would be
> > significant for small records (it would be approximately the chance of
> > getting a record boundary on the underlying page boundary * chance of
> > getting the same MAXALIGN-adjusted size record before the persistence
> > boundary). That issue is part of the reason why my proposed change
> > upthread still contains the full xl_prev.
>
> What exactly is the theory for this significant increase? I don't think
> xl_prev provides a meaningful protection against torn pages in the first
> place?

XLog pages don't have checksums, so they do not provide torn page
protection capabilities on their own.
A singular xlog record is protected against torn page writes through
the checksum that covers the whole record - if only part of the record
was written, we can detect that through the mismatching checksum.
However, if records end at the tear boundary, we must know for certain
that any record that starts after the tear is the record that was
written after the one before the tear. Page-local references/offsets
would not work, because the record decoding doesn't know which xlog
page the record should be located on; it could be both the version of
the page before it was recycled, or the one after.
Currently, we can detect this because the value of xl_prev will point
to a record far in the past (i.e. not the expected value), but with a
page-local version of xl_prev we would be less likely to detect torn
pages (and thus be unable to handle this without risk of corruption)
due to the significant chance of the truncated xl_prev value being the
same in both the old and new record.

Example: Page { [ record A ] | tear boundary | [ record B ] } gets
recycled and receives a new record C at the place of A with the same
length.

With your proposal, record B would still be a valid record when it
follows C; as the page-local serial number/offset reference to the
previous record would still match after the torn write.
With the current situation and a full LSN in xl_prev, the mismatching
value in the xl_prev pointer allows us to detect this torn page write
and halt replay, before redoing an old (incorrect) record.

Kind regards,

Matthias van de Meent

PS. there are ideas floating around (I heard about this one from
Heikki) where we could concatenate WAL records into one combined
record that has only one shared xl_prev+crc; which would save these 12
bytes per record. However, that needs a lot of careful consideration
to make sure that the persistence guarantee of operations doesn't get
lost somewhere in the traffic.

Re: problems with making relfilenodes 56-bits

From

Andres Freund

Date:

04 October 2022, 15:34:09

Hi,

On 2022-10-04 15:05:47 +0200, Matthias van de Meent wrote:
> On Mon, 3 Oct 2022 at 23:26, Andres Freund <andres@anarazel.de> wrote:
> > On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:
> > > On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote:
> > > > Random idea: xl_prev is large. Store a full xl_prev in the page header, but
> > > > only store a 2 byte offset from the page header xl_prev within each record.
> > >
> > > With that small xl_prev we may not detect partial page writes in
> > > recycled segments; or other issues in the underlying file system. With
> > > small record sizes, the chance of returning incorrect data would be
> > > significant for small records (it would be approximately the chance of
> > > getting a record boundary on the underlying page boundary * chance of
> > > getting the same MAXALIGN-adjusted size record before the persistence
> > > boundary). That issue is part of the reason why my proposed change
> > > upthread still contains the full xl_prev.
> >
> > What exactly is the theory for this significant increase? I don't think
> > xl_prev provides a meaningful protection against torn pages in the first
> > place?
> 
> XLog pages don't have checksums, so they do not provide torn page
> protection capabilities on their own.
> A singular xlog record is protected against torn page writes through
> the checksum that covers the whole record - if only part of the record
> was written, we can detect that through the mismatching checksum.
> However, if records end at the tear boundary, we must know for certain
> that any record that starts after the tear is the record that was
> written after the one before the tear. Page-local references/offsets
> would not work, because the record decoding doesn't know which xlog
> page the record should be located on; it could be both the version of
> the page before it was recycled, or the one after.
> Currently, we can detect this because the value of xl_prev will point
> to a record far in the past (i.e. not the expected value), but with a
> page-local version of xl_prev we would be less likely to detect torn
> pages (and thus be unable to handle this without risk of corruption)
> due to the significant chance of the truncated xl_prev value being the
> same in both the old and new record.

Think this is addressable, see below.


> Example: Page { [ record A ] | tear boundary | [ record B ] } gets
> recycled and receives a new record C at the place of A with the same
> length.
> 
> With your proposal, record B would still be a valid record when it
> follows C; as the page-local serial number/offset reference to the
> previous record would still match after the torn write.
> With the current situation and a full LSN in xl_prev, the mismatching
> value in the xl_prev pointer allows us to detect this torn page write
> and halt replay, before redoing an old (incorrect) record.

In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
As you specified it, C has the same length as A, so B's xl_prev will be the
same whether it's a page local offset or the full 8 bytes.


The relevant protection against issues like this isn't xl_prev, it's the
CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
than the offset.


> PS. there are ideas floating around (I heard about this one from
> Heikki) where we could concatenate WAL records into one combined
> record that has only one shared xl_prev+crc; which would save these 12
> bytes per record. However, that needs a lot of careful consideration
> to make sure that the persistence guarantee of operations doesn't get
> lost somewhere in the traffic.

One version of that is to move the CRCs to the page header, make the pages
smaller (512 bytes / 4K, depending on the hardware), and to pad out partial
pages when flushing them out. Rewriting pages is bad for hardware and prevents
having multiple WAL IOs in flight at the same time.

Greetings,

Andres Freund

Re: problems with making relfilenodes 56-bits

From

Robert Haas

Date:

04 October 2022, 17:36:33

On Tue, Oct 4, 2022 at 11:34 AM Andres Freund <andres@anarazel.de> wrote:
> > Example: Page { [ record A ] | tear boundary | [ record B ] } gets
> > recycled and receives a new record C at the place of A with the same
> > length.
> >
> > With your proposal, record B would still be a valid record when it
> > follows C; as the page-local serial number/offset reference to the
> > previous record would still match after the torn write.
> > With the current situation and a full LSN in xl_prev, the mismatching
> > value in the xl_prev pointer allows us to detect this torn page write
> > and halt replay, before redoing an old (incorrect) record.
>
> In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
> As you specified it, C has the same length as A, so B's xl_prev will be the
> same whether it's a page local offset or the full 8 bytes.
>
> The relevant protection against issues like this isn't xl_prev, it's the
> CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
> than the offset.

I'm really confused. xl_prev *is* a full-width LSN currently, as I
understand it. So in the scenario that Matthias poses, let's say the
segment was previously 000000010000000400000025 and now it's
000000010000000400000049. So if a given chunk of the page is leftover
from when the page was 000000010000000400000025, it will have xl_prev
values like 4/25xxxxxx. If it's been rewritten since the segment was
recycled, it will have xl_prev values like 4/49xxxxxx. So, we can tell
whether record B has been overwritten with a new record since the
segment was recycled. But if we stored only 2 bytes in each xl_prev
field, that would no longer be possible.

So I'm lost. It seems like Matthias has correctly identified a real
hazard, and not some weird corner case but actually something that
will happen regularly. All you need is for the old segment that got
recycled to have a record stating at the same place where the page
tore, and for the previous record to have been the same length as the
one on the new page. Given that there's only <~1024 places on a page
where a record can start, and given that in many workloads the lengths
of WAL records will be fairly uniform, this doesn't seem unlikely at
all.

A way to up the chances of detecting this case would be to store only
2 or 4 bytes of xl_prev on disk, but arrange to include the full
xl_prev value in the xl_crc calculation. Then your chances of a
collision are about 2^-32, or maybe more if you posit that CRC is a
weak and crappy algorithm, but even then it's strictly better than
just hoping that there isn't a tear point at a record boundary where
the same length record precedes the tear in both the old and new WAL
segments. However, on the flip side, even if you assume that CRC is a
fantastic algorithm with beautiful and state-of-the-art bit mixing,
the chances of it failing to notice the problem are still >0, whereas
the current algorithm that compares the full xl_prev value is a sure
thing. Because xl_prev values are never repeated, it's certain that
when a segment is recycled, any values that were legal for the old one
aren't legal in the new one.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

From

Andres Freund

Date:

04 October 2022, 18:30:03

Hi,

On 2022-10-04 13:36:33 -0400, Robert Haas wrote:
> On Tue, Oct 4, 2022 at 11:34 AM Andres Freund <andres@anarazel.de> wrote:
> > > Example: Page { [ record A ] | tear boundary | [ record B ] } gets
> > > recycled and receives a new record C at the place of A with the same
> > > length.
> > >
> > > With your proposal, record B would still be a valid record when it
> > > follows C; as the page-local serial number/offset reference to the
> > > previous record would still match after the torn write.
> > > With the current situation and a full LSN in xl_prev, the mismatching
> > > value in the xl_prev pointer allows us to detect this torn page write
> > > and halt replay, before redoing an old (incorrect) record.
> >
> > In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
> > As you specified it, C has the same length as A, so B's xl_prev will be the
> > same whether it's a page local offset or the full 8 bytes.
> >
> > The relevant protection against issues like this isn't xl_prev, it's the
> > CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
> > than the offset.
>
> I'm really confused. xl_prev *is* a full-width LSN currently, as I
> understand it. So in the scenario that Matthias poses, let's say the
> segment was previously 000000010000000400000025 and now it's
> 000000010000000400000049. So if a given chunk of the page is leftover
> from when the page was 000000010000000400000025, it will have xl_prev
> values like 4/25xxxxxx. If it's been rewritten since the segment was
> recycled, it will have xl_prev values like 4/49xxxxxx. So, we can tell
> whether record B has been overwritten with a new record since the
> segment was recycled. But if we stored only 2 bytes in each xl_prev
> field, that would no longer be possible.

Oh, I think I misunderstood the scenario. I was thinking of cases where we
write out a bunch of pages, crash, only some of the pages made it to disk, we
then write new ones of the same length, and now find a record after the "real"
end of the WAL to be valid.  Not sure how I mentally swallowed the "recycled".

For the recycling scenario to be a problem we'll also need to crash, with
parts of the page ending up with the new contents and parts of the page ending
up with the old "pre recycling" content, correct? Because without a crash
we'll have zeroed out the remainder of the page (well, leaving walreceiver out
of the picture, grr).

However, this can easily happen without any record boundaries on the partially
recycled page, so we rely on the CRCs to protect against this.

Here I originally wrote a more in-depth explanation of the scenario I was
thinking about, where we alread rely on CRCs to protect us. But, ooph, I think
they don't reliably, with today's design. But maybe I'm missing more things
today. Consider the following sequence:

1) we write WAL like this:

[record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev B_lsn]

2) crash, the sectors with A and C made it to disk, the one with B didn't

3) We replay A, discover B is invalid (possibly zeroes or old contents),
   insert a new record B' with the same length. Now it looks like this:

[record A][tear boundary][record B', prev A_lsn][tear boundary][record C, prev B_lsn]

4) crash, the sector with B' makes it to disk

5) we replay A, B', C, because C has an xl_prev that's compatible with B'
   location and a valid CRC.

Oops.

I think this can happen both within a single page and across page boundaries.

I hope I am missing something here?

> A way to up the chances of detecting this case would be to store only
> 2 or 4 bytes of xl_prev on disk, but arrange to include the full
> xl_prev value in the xl_crc calculation.

Right, that's what I was suggesting as well.

> Then your chances of a collision are about 2^-32, or maybe more if you posit
> that CRC is a weak and crappy algorithm, but even then it's strictly better
> than just hoping that there isn't a tear point at a record boundary where
> the same length record precedes the tear in both the old and new WAL
> segments. However, on the flip side, even if you assume that CRC is a
> fantastic algorithm with beautiful and state-of-the-art bit mixing, the
> chances of it failing to notice the problem are still >0, whereas the
> current algorithm that compares the full xl_prev value is a sure
> thing. Because xl_prev values are never repeated, it's certain that when a
> segment is recycled, any values that were legal for the old one aren't legal
> in the new one.

Given that we already rely on the CRCs to detect corruption within a single
record spanning tear boundaries, this doesn't cause me a lot of heartburn. But
I suspect we might need to do something about the scenario I outlined above,
which likely would also increase the protection against this issue.

I think there might be reasonable ways to increase the guarantees based on the
2 byte xl_prev approach "alone". We don't have to store the offset from the
page header as a plain offset. What about storing something like:
  page_offset ^ (page_lsn >> wal_segsz_shift)

I think something like that'd result in prev_not_really_lsn typically not
simply matching after recycling. Of course it only provides so much
protection, given 16bits...

Greetings,

Andres Freund

Re: problems with making relfilenodes 56-bits

From

Robert Haas

Date:

04 October 2022, 18:53:34

On Tue, Oct 4, 2022 at 2:30 PM Andres Freund <andres@anarazel.de> wrote:
> Consider the following sequence:
>
> 1) we write WAL like this:
>
> [record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev B_lsn]
>
> 2) crash, the sectors with A and C made it to disk, the one with B didn't
>
> 3) We replay A, discover B is invalid (possibly zeroes or old contents),
>    insert a new record B' with the same length. Now it looks like this:
>
> [record A][tear boundary][record B', prev A_lsn][tear boundary][record C, prev B_lsn]
>
> 4) crash, the sector with B' makes it to disk
>
> 5) we replay A, B', C, because C has an xl_prev that's compatible with B'
>    location and a valid CRC.
>
> Oops.
>
> I think this can happen both within a single page and across page boundaries.
>
> I hope I am missing something here?

If you are, I don't know what it is off-hand. That seems like a
plausible scenario to me. It does require the OS to write things out
of order, and I don't know how likely that is in practice, but the
answer probably isn't zero.

> I think there might be reasonable ways to increase the guarantees based on the
> 2 byte xl_prev approach "alone". We don't have to store the offset from the
> page header as a plain offset. What about storing something like:
>   page_offset ^ (page_lsn >> wal_segsz_shift)
>
> I think something like that'd result in prev_not_really_lsn typically not
> simply matching after recycling. Of course it only provides so much
> protection, given 16bits...

Maybe. That does seem somewhat better, but I feel like it's hard to
reason about whether it's safe in absolute terms or just resistant to
the precise scenario Matthias postulated while remaining vulnerable to
slightly modified versions.

How about this: remove xl_prev. widen xl_crc to 64 bits. include the
CRC of the previous WAL record in the xl_crc calculation. That doesn't
cut quite as many bytes out of the record size as your proposal, but
it feels like it should strongly resist pretty much every attack of
this general type, with only the minor disadvantage that the more
expensive CRC calculation will destroy all hope of getting anything
committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

From

Andres Freund

Date:

04 October 2022, 23:49:52

Hi,

On 2022-10-03 10:01:25 -0700, Andres Freund wrote:
> On 2022-10-03 08:12:39 -0400, Robert Haas wrote:
> > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:
> > I thought about trying to buy back some space elsewhere, and I think
> > that would be a reasonable approach to getting this committed if we
> > could find a way to do it. However, I don't see a terribly obvious way
> > of making it happen.
>
> I think there's plenty potential...

I light dusted off my old varint implementation from [1] and converted the
RelFileLocator and BlockNumber from fixed width integers to varint ones. This
isn't meant as a serious patch, but an experiment to see if this is a path
worth pursuing.

A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
(for increased reproducability) shows a decent saving:

master: 241106544 - 230 MB
varint: 227858640 - 217 MB

The average record size goes from 102.7 to 95.7 bytes excluding the remaining
FPIs, 118.1 to 111.0 including FPIs.

There's plenty other spots that could be converted (e.g. the record length
which rarely needs four bytes), this is just meant as a demonstration.

I used pg_waldump --stats for that range of WAL to measure the CPU overhead. A
profile does show pg_varint_decode_uint64(), but partially that seems to be
offset by the reduced amount of bytes to CRC. Maybe a ~2% overhead remains.

That would be tolerable, I think, because waldump --stats pretty much doesn't
do anything with the WAL.

But I suspect there's plenty of optimization potential in the varint
code. Right now it e.g. stores data as big endian, and the bswap instructions
do show up. And a lot of my bit-maskery could be optimized too.

Greetings,

Andres Freund

Attachment

Re: problems with making relfilenodes 56-bits

From

Dilip Kumar

Date:

10 October 2022, 09:16:17

On Wed, Oct 5, 2022 at 5:19 AM Andres Freund <andres@anarazel.de> wrote:

>
> I light dusted off my old varint implementation from [1] and converted the
> RelFileLocator and BlockNumber from fixed width integers to varint ones. This
> isn't meant as a serious patch, but an experiment to see if this is a path
> worth pursuing.
>
> A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
> (for increased reproducability) shows a decent saving:
>
> master: 241106544 - 230 MB
> varint: 227858640 - 217 MB
>
> The average record size goes from 102.7 to 95.7 bytes excluding the remaining
> FPIs, 118.1 to 111.0 including FPIs.
>

I have also executed my original test after applying these patches on
top of the 56 bit relfilenode patch.  So earlier we saw the WAL size
increased by 11% (66199.09375 kB to 73906.984375 kB) and after this
patch now the WAL generated is 58179.2265625.  That means in this
particular example this patch is reducing the WAL size by 12% even
with the 56 bit relfilenode patch.

[1] https://www.postgresql.org/message-id/CAFiTN-uut%2B04AdwvBY_oK_jLvMkwXUpDJj5mXg--nek%2BucApPQ%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

From

Robert Haas

Date:

10 October 2022, 12:10:22

On Mon, Oct 10, 2022 at 5:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have also executed my original test after applying these patches on
> top of the 56 bit relfilenode patch.  So earlier we saw the WAL size
> increased by 11% (66199.09375 kB to 73906.984375 kB) and after this
> patch now the WAL generated is 58179.2265625.  That means in this
> particular example this patch is reducing the WAL size by 12% even
> with the 56 bit relfilenode patch.

That's a very promising result, but the question in my mind is how
much work would be required to bring this patch to a committable
state?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

From

Andres Freund

Date:

10 October 2022, 21:22:28

Hi,

On 2022-10-10 08:10:22 -0400, Robert Haas wrote:
> On Mon, Oct 10, 2022 at 5:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have also executed my original test after applying these patches on
> > top of the 56 bit relfilenode patch.  So earlier we saw the WAL size
> > increased by 11% (66199.09375 kB to 73906.984375 kB) and after this
> > patch now the WAL generated is 58179.2265625.  That means in this
> > particular example this patch is reducing the WAL size by 12% even
> > with the 56 bit relfilenode patch.
> 
> That's a very promising result, but the question in my mind is how
> much work would be required to bring this patch to a committable
> state?

The biggest part clearly is to review the variable width integer patch. It's
not a large amount of code, but probably more complicated than average.

One complication there is that currently the patch assumes:
 * Note that this function, for efficiency, reads 8 bytes, even if the
 * variable integer is less than 8 bytes long. The buffer has to be
 * allocated sufficiently large to account for that fact. The maximum
 * amount of memory read is 9 bytes.

We could make a less efficient version without that assumption, but I think it
might be easier to just guarantee it in the xlog*.c case.

Using it in xloginsert.c is pretty darn simple, code-wise. xlogreader is bit
harder, although not for intrinsic reasons - the logic underlying
COPY_HEADER_FIELD seems unneccessary complicated to me. The minimal solution
would likely be to just wrap the varint reads in another weird macro.

Leaving the code issues themselves aside, one important thing would be to
evaluate what the performance impacts of the varint encoding/decoding are as
part of "full" server. I suspect it'll vanish in the noise, but we'd need to
validate that.

Greetings,

Andres Freund

Re: problems with making relfilenodes 56-bits

From

Dilip Kumar

Date:

11 October 2022, 08:33:28

On Mon, Oct 10, 2022 at 5:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Oct 10, 2022 at 5:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have also executed my original test after applying these patches on
> > top of the 56 bit relfilenode patch.  So earlier we saw the WAL size
> > increased by 11% (66199.09375 kB to 73906.984375 kB) and after this
> > patch now the WAL generated is 58179.2265625.  That means in this
> > particular example this patch is reducing the WAL size by 12% even
> > with the 56 bit relfilenode patch.
>
> That's a very promising result, but the question in my mind is how
> much work would be required to bring this patch to a committable
> state?

Right, the results are promising.  I have done some more testing with
make installcheck WAL size (fpw=off) and I have seen a similar gain
with this patch.

1. Head: 272 MB
2. 56 bit RelfileLocator: 285 MB
3. 56 bit RelfileLocator + this patch: 261 MB

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

From

Matthias van de Meent

Date:

12 October 2022, 20:05:30

On Wed, 5 Oct 2022 at 01:50, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-10-03 10:01:25 -0700, Andres Freund wrote:
> > On 2022-10-03 08:12:39 -0400, Robert Haas wrote:
> > > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:
> > > I thought about trying to buy back some space elsewhere, and I think
> > > that would be a reasonable approach to getting this committed if we
> > > could find a way to do it. However, I don't see a terribly obvious way
> > > of making it happen.
> >
> > I think there's plenty potential...
>
> I light dusted off my old varint implementation from [1] and converted the
> RelFileLocator and BlockNumber from fixed width integers to varint ones. This
> isn't meant as a serious patch, but an experiment to see if this is a path
> worth pursuing.
>
> A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
> (for increased reproducability) shows a decent saving:
>
> master: 241106544 - 230 MB
> varint: 227858640 - 217 MB

I think a signficant part of this improvement comes from the premise
of starting with a fresh database. tablespace OID will indeed most
likely be low, but database OID may very well be linearly distributed
if concurrent workloads in the cluster include updating (potentially
unlogged) TOASTed columns and the databases are not created in one
"big bang" but over the lifetime of the cluster. In that case DBOID
will consume 5B for a significant fraction of databases (anything with
OID >=2^28).

My point being: I don't think that we should have different WAL
performance in databases which is dependent on which OID was assigned
to that database.
In addition; this varlen encoding of relfilenode would mean that
performance would drop over time, as a relations' relfile locator is
updated to something with a wider number (through VACUUM FULL or other
relfilelocator cycling; e.g. re-importing a database). For maximum
performance, you'd have to tune your database to have the lowest
possible database, namespace and relfilelocator numbers; which (in
older clusters) implies hacking into the catalogs - which seems like
an antipattern.

I would have much less issue with this if we had separate counters per
database (and approximately incremental dbOid:s), but that's not the
case right now.

> The average record size goes from 102.7 to 95.7 bytes excluding the remaining
> FPIs, 118.1 to 111.0 including FPIs.

That's quite promising.

> There's plenty other spots that could be converted (e.g. the record length
> which rarely needs four bytes), this is just meant as a demonstration.

Agreed.

> I used pg_waldump --stats for that range of WAL to measure the CPU overhead. A
> profile does show pg_varint_decode_uint64(), but partially that seems to be
> offset by the reduced amount of bytes to CRC. Maybe a ~2% overhead remains.
>
> That would be tolerable, I think, because waldump --stats pretty much doesn't
> do anything with the WAL.
>
> But I suspect there's plenty of optimization potential in the varint
> code. Right now it e.g. stores data as big endian, and the bswap instructions
> do show up. And a lot of my bit-maskery could be optimized too.

One thing that comes to mind is that we will never see dbOid < 2^8
(and rarely < 2^14, nor spcOid less than 2^8 for that matter), so
we'll probably waste at least one or two bits in the encoding of those
values. That's not the end of the world, but it'd probably be better
if we could improve on that - up to 6% of the field's disk usage would
be wasted on an always-on bit.

----

Attached is a prototype patchset that reduces the WAL record size in
many common cases. This is a prototype, as it fails tests due to a
locking issue in prepared_xacts that I have not been able to find the
source of yet. It also could use some more polishing, but the base
case seems quite good. I haven't yet run the numbers though...

0001 - Extract xl_rminfo from xl_info
See [0] for more info as to why that's useful, the patch was pulled
from there. It is mainly used to reduce the size of 0002; and mostly
consists of find-and-replace of rmgrs extracting their bits from
xl_info.

0002 - Rework XLogRecord
This makes many fields in the xlog header optional, reducing the size
of many xlog records by several bytes. This implements the design I
shared in my earlier message [1].

0003 - Rework XLogRecordBlockHeader.
This patch could be applied on current head, and saves some bytes in
per-block data. It potentially saves some bytes per registered
block/buffer in the WAL record (max 2 bytes for the first block, after
that up to 3). See the patch's commit message in the patch for
detailed information.

Kind regards,

Matthias van de Meent

[0] https://postgr.es/m/CAEze2WgZti_Bgs-Aw3egsR5PJQpHcYZwZFCJND5MS-O_DX0-Hg%40mail.gmail.com
[1] https://postgr.es/m/CAEze2WjOFzRzPMPYhH4odSa9OCF2XeZszE3jGJhJzrpdFmyLOw@mail.gmail.com

Attachment

Re: problems with making relfilenodes 56-bits

From

Andres Freund

Date:

12 October 2022, 21:13:31

Hi,

On 2022-10-12 22:05:30 +0200, Matthias van de Meent wrote:
> On Wed, 5 Oct 2022 at 01:50, Andres Freund <andres@anarazel.de> wrote:
> > On 2022-10-03 10:01:25 -0700, Andres Freund wrote:
> > > On 2022-10-03 08:12:39 -0400, Robert Haas wrote:
> > > > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:
> > > > I thought about trying to buy back some space elsewhere, and I think
> > > > that would be a reasonable approach to getting this committed if we
> > > > could find a way to do it. However, I don't see a terribly obvious way
> > > > of making it happen.
> > >
> > > I think there's plenty potential...
> >
> > I light dusted off my old varint implementation from [1] and converted the
> > RelFileLocator and BlockNumber from fixed width integers to varint ones. This
> > isn't meant as a serious patch, but an experiment to see if this is a path
> > worth pursuing.
> >
> > A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
> > (for increased reproducability) shows a decent saving:
> >
> > master: 241106544 - 230 MB
> > varint: 227858640 - 217 MB
> 
> I think a signficant part of this improvement comes from the premise
> of starting with a fresh database. tablespace OID will indeed most
> likely be low, but database OID may very well be linearly distributed
> if concurrent workloads in the cluster include updating (potentially
> unlogged) TOASTed columns and the databases are not created in one
> "big bang" but over the lifetime of the cluster. In that case DBOID
> will consume 5B for a significant fraction of databases (anything with
> OID >=2^28).
> 
> My point being: I don't think that we should have different WAL
> performance in databases which is dependent on which OID was assigned
> to that database.

To me this is raising the bar to an absurd level. Some minor space usage
increase after oid wraparound and for very large block numbers isn't a huge
issue - if you're in that situation you already have a huge amount of wal.


> 0002 - Rework XLogRecord
> This makes many fields in the xlog header optional, reducing the size
> of many xlog records by several bytes. This implements the design I
> shared in my earlier message [1].
> 
> 0003 - Rework XLogRecordBlockHeader.
> This patch could be applied on current head, and saves some bytes in
> per-block data. It potentially saves some bytes per registered
> block/buffer in the WAL record (max 2 bytes for the first block, after
> that up to 3). See the patch's commit message in the patch for
> detailed information.

The amount of complexity these two introduce seems quite substantial to
me. Both from an maintenance and a runtime perspective. I think we'd be better
off using building blocks like variable lengths encoded values than open
coding it in many places.

Greetings,

Andres Freund

Re: problems with making relfilenodes 56-bits

From

Matthias van de Meent

Date:

13 October 2022, 22:53:48

On Wed, 12 Oct 2022 at 23:13, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-10-12 22:05:30 +0200, Matthias van de Meent wrote:
> > On Wed, 5 Oct 2022 at 01:50, Andres Freund <andres@anarazel.de> wrote:
> > > I light dusted off my old varint implementation from [1] and converted the
> > > RelFileLocator and BlockNumber from fixed width integers to varint ones. This
> > > isn't meant as a serious patch, but an experiment to see if this is a path
> > > worth pursuing.
> > >
> > > A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
> > > (for increased reproducability) shows a decent saving:
> > >
> > > master: 241106544 - 230 MB
> > > varint: 227858640 - 217 MB
> >
> > I think a signficant part of this improvement comes from the premise
> > of starting with a fresh database. tablespace OID will indeed most
> > likely be low, but database OID may very well be linearly distributed
> > if concurrent workloads in the cluster include updating (potentially
> > unlogged) TOASTed columns and the databases are not created in one
> > "big bang" but over the lifetime of the cluster. In that case DBOID
> > will consume 5B for a significant fraction of databases (anything with
> > OID >=2^28).
> >
> > My point being: I don't think that we should have different WAL
> > performance in databases which is dependent on which OID was assigned
> > to that database.
>
> To me this is raising the bar to an absurd level. Some minor space usage
> increase after oid wraparound and for very large block numbers isn't a huge
> issue - if you're in that situation you already have a huge amount of wal.

I didn't want to block all varlen encoding, I just want to make clear
that I don't think it's great for performance testing and consistency
across installations if WAL size (and thus part of your performance)
is dependent on which actual database/relation/tablespace combination
you're running your workload in.

With the 56-bit relfilenode, the size of a block reference would
realistically differ between 7 bytes and 23 bytes:

- tblspc=0=1B
  db=16386=3B
  rel=797=2B (787 = 4 * default # of data relations in a fresh DB in PG14, + 1)
  block=0=1B

vs

- tsp>=2^28 = 5B
  dat>=2^28 =5B
  rel>=2^49 =8B
  block>=2^28 =5B

That's a difference of 16 bytes, of which only the block number can
realistically be directly influenced by the user ("just don't have
relations larger than X blocks").

If applied to Dilip's pgbench transaction data, that would imply a
minimum per transaction wal usage of 509 bytes, and a maximum per
transaction wal usage of 609 bytes. That is nearly a 20% difference in
WAL size based only on the location of your data, and I'm just not
comfortable with that. Users have little or zero control over the
internal IDs we assign to these fields, while it would affect
performance fairly significantly.

(difference % between min/max wal size is unchanged (within 0.1%)
after accounting for record alignment)

> > 0002 - Rework XLogRecord
> > This makes many fields in the xlog header optional, reducing the size
> > of many xlog records by several bytes. This implements the design I
> > shared in my earlier message [1].
> >
> > 0003 - Rework XLogRecordBlockHeader.
> > This patch could be applied on current head, and saves some bytes in
> > per-block data. It potentially saves some bytes per registered
> > block/buffer in the WAL record (max 2 bytes for the first block, after
> > that up to 3). See the patch's commit message in the patch for
> > detailed information.
>
> The amount of complexity these two introduce seems quite substantial to
> me. Both from an maintenance and a runtime perspective. I think we'd be better
> off using building blocks like variable lengths encoded values than open
> coding it in many places.

I guess that's true for length fields, but I don't think dynamic
header field presence (the 0002 rewrite, and the omission of
data_length in 0003) is that bad. We already have dynamic data
inclusion through block ids 25x; I'm not sure why we couldn't do that
more compactly with bitfields as indicators instead (hence the dynamic
header size).

As for complexity, I think my current patchset is mostly complex due
to a lack of tooling. Note that decoding makes common use of
COPY_HEADER_FIELD, which we don't really have an equivalent for in
XLogRecordAssemble. I think the code for 0002 would improve
significantly in readability if such construct would be available.

To reduce complexity in 0003, I could drop the 'repeat id'
optimization, as that reduces the complexity significantly, at the
cost of not saving that 1 byte per registered block after the first.

Kind regards,

Matthias van de Meent

Re: problems with making relfilenodes 56-bits

From

Robert Haas

Date:

17 October 2022, 21:14:21

On Wed, Oct 12, 2022 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
> > I think a signficant part of this improvement comes from the premise
> > of starting with a fresh database. tablespace OID will indeed most
> > likely be low, but database OID may very well be linearly distributed
> > if concurrent workloads in the cluster include updating (potentially
> > unlogged) TOASTed columns and the databases are not created in one
> > "big bang" but over the lifetime of the cluster. In that case DBOID
> > will consume 5B for a significant fraction of databases (anything with
> > OID >=2^28).
> >
> > My point being: I don't think that we should have different WAL
> > performance in databases which is dependent on which OID was assigned
> > to that database.
>
> To me this is raising the bar to an absurd level. Some minor space usage
> increase after oid wraparound and for very large block numbers isn't a huge
> issue - if you're in that situation you already have a huge amount of wal.

I have to admit that I worried about the same thing that Matthias
raises, more or less. But I don't know whether I'm right to be
worried. A variable-length representation of any kind is essentially a
gamble that values requiring fewer bytes will be more common than
values requiring more bytes, and by enough to justify the overhead
that the method has. And, you want it to be more common for each
individual user, not just overall. For example, more people are going
to have small relations than large ones, but nobody wants performance
to drop off a cliff when the relation passes a certain size threshold.
Now, it wouldn't drop off a cliff here, but what about someone with a
really big, append-only relation? Won't they just end up writing more
to WAL than with the present system?

Maybe not. They might still have some writes to relations other than
the very large, append-only relation, and then they could still win.
Also, if we assume that the overhead of the variable-length
representation is never more than 1 byte beyond what is needed to
represent the underlying quantity in the minimal number of bytes, they
are only going to lose if their relation is already more than half the
maximum theoretical size, and if that is the case, they are in danger
of hitting the size limit anyway. You can argue that there's still a
risk here, but it doesn't seem like that bad of a risk.

But the same thing is not so obvious for, let's say, database OIDs.
What if you just have one or a few databases, but due to the previous
history of the cluster, their OIDs just happen to be big? Then you're
just behind where you would have been without the patch. Granted, if
this happens to you, you will be in the minority, because most users
are likely to have small database OIDs, but the fact that other people
are writing less WAL on average isn't going to make you happy about
writing more WAL on average. And even for a user for which that
doesn't happen, it's not at all unlikely that the gains they see will
be less than what we see on a freshly-initdb'd database.

So I don't really know what the answer is here. I don't think this
technique sucks, but I don't think it's necessarily a categorical win
for every case, either. And it even seems hard to reason about which
cases are likely to be wins and which cases are likely to be losses.

> > 0002 - Rework XLogRecord
> > This makes many fields in the xlog header optional, reducing the size
> > of many xlog records by several bytes. This implements the design I
> > shared in my earlier message [1].
> >
> > 0003 - Rework XLogRecordBlockHeader.
> > This patch could be applied on current head, and saves some bytes in
> > per-block data. It potentially saves some bytes per registered
> > block/buffer in the WAL record (max 2 bytes for the first block, after
> > that up to 3). See the patch's commit message in the patch for
> > detailed information.
>
> The amount of complexity these two introduce seems quite substantial to
> me. Both from a maintenance and a runtime perspective. I think we'd be better
> off using building blocks like variable lengths encoded values than open
> coding it in many places.

I agree that this looks pretty ornate as written, but I think there
might be some good ideas in here, too. It is also easy to reason about
this kind of thing at least in terms of space consumption. It's a bit
harder to know how things will play out in terms of CPU cycles and
code complexity.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: problems with making relfilenodes 56-bits

From

Andres Freund

Date:

19 October 2022, 19:21:30

Hi,

On 2022-10-17 17:14:21 -0400, Robert Haas wrote:
> I have to admit that I worried about the same thing that Matthias
> raises, more or less. But I don't know whether I'm right to be
> worried. A variable-length representation of any kind is essentially a
> gamble that values requiring fewer bytes will be more common than
> values requiring more bytes, and by enough to justify the overhead
> that the method has. And, you want it to be more common for each
> individual user, not just overall. For example, more people are going
> to have small relations than large ones, but nobody wants performance
> to drop off a cliff when the relation passes a certain size threshold.
> Now, it wouldn't drop off a cliff here, but what about someone with a
> really big, append-only relation? Won't they just end up writing more
> to WAL than with the present system?

Perhaps. But I suspect it'd be a very small increase because they'd be using
bulk-insert paths in all likelihood anyway, if they managed to get to a very
large relation. And even in that case, if we e.g. were to make the record size
variable length, they'd still pretty much never reach that and it'd be an
overall win.

The number of people with that large relations, leaving partitioning aside
which'd still benefit as each relation is smaller, strikes me as a very small
percentage. And as you say, it's not like there's a cliff where everything
starts to be horrible.

> Maybe not. They might still have some writes to relations other than
> the very large, append-only relation, and then they could still win.
> Also, if we assume that the overhead of the variable-length
> representation is never more than 1 byte beyond what is needed to
> represent the underlying quantity in the minimal number of bytes, they
> are only going to lose if their relation is already more than half the
> maximum theoretical size, and if that is the case, they are in danger
> of hitting the size limit anyway. You can argue that there's still a
> risk here, but it doesn't seem like that bad of a risk.

Another thing here is that I suspect we ought to increase our relation size
beyond 4 byte * blocksize at some point - and then we'll have to use variable
encodings... Admittedly the amount of work needed to get there is substantial.

Somewhat relatedly, I think we, very slowly, should move towards wider OIDs as
well. Not having to deal with oid wraparound will be a significant win
(particularly for toast), but to keep the overhead reasonable, we're going to
need variable encodings.

> But the same thing is not so obvious for, let's say, database OIDs.
> What if you just have one or a few databases, but due to the previous
> history of the cluster, their OIDs just happen to be big? Then you're
> just behind where you would have been without the patch. Granted, if
> this happens to you, you will be in the minority, because most users
> are likely to have small database OIDs, but the fact that other people
> are writing less WAL on average isn't going to make you happy about
> writing more WAL on average. And even for a user for which that
> doesn't happen, it's not at all unlikely that the gains they see will
> be less than what we see on a freshly-initdb'd database.

I agree that going for variable width encodings on the basis of the database
oid field alone would be an unconvincing proposition. But variably encoding
database oids when we already variably encode other fields seems like a decent
bet. If you e.g. think of the 56-bit relfilenode field itself - obviously what
I was thinking about in the first place - it's going to be a win much more
often.

To really loose you'd not just have to have a large database oid, but also a
large tablespace and relation oid and a huge block number...

> So I don't really know what the answer is here. I don't think this
> technique sucks, but I don't think it's necessarily a categorical win
> for every case, either. And it even seems hard to reason about which
> cases are likely to be wins and which cases are likely to be losses.

True. I'm far less concerned than you or Matthias about increasing the size in
rare cases as long as it wins in the majority of cases. But that doesn't mean
every case is easy to consider.

> > > 0002 - Rework XLogRecord
> > > This makes many fields in the xlog header optional, reducing the size
> > > of many xlog records by several bytes. This implements the design I
> > > shared in my earlier message [1].
> > >
> > > 0003 - Rework XLogRecordBlockHeader.
> > > This patch could be applied on current head, and saves some bytes in
> > > per-block data. It potentially saves some bytes per registered
> > > block/buffer in the WAL record (max 2 bytes for the first block, after
> > > that up to 3). See the patch's commit message in the patch for
> > > detailed information.
> >
> > The amount of complexity these two introduce seems quite substantial to
> > me. Both from a maintenance and a runtime perspective. I think we'd be better
> > off using building blocks like variable lengths encoded values than open
> > coding it in many places.
> 
> I agree that this looks pretty ornate as written, but I think there
> might be some good ideas in here, too.

Agreed! Several of the ideas seem orthogonal to using variable encodings, so
this isn't really an either / or.

> It is also easy to reason about this kind of thing at least in terms of
> space consumption.

Hm, not for me, but...

Greetings,

Andres Freund

Re: problems with making relfilenodes 56-bits

From

Dilip Kumar

Date:

20 October 2022, 08:40:50

On Thu, Oct 20, 2022 at 12:51 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-10-17 17:14:21 -0400, Robert Haas wrote:
> > I have to admit that I worried about the same thing that Matthias
> > raises, more or less. But I don't know whether I'm right to be
> > worried. A variable-length representation of any kind is essentially a
> > gamble that values requiring fewer bytes will be more common than
> > values requiring more bytes, and by enough to justify the overhead
> > that the method has. And, you want it to be more common for each
> > individual user, not just overall. For example, more people are going
> > to have small relations than large ones, but nobody wants performance
> > to drop off a cliff when the relation passes a certain size threshold.
> > Now, it wouldn't drop off a cliff here, but what about someone with a
> > really big, append-only relation? Won't they just end up writing more
> > to WAL than with the present system?
>
> Perhaps. But I suspect it'd be a very small increase because they'd be using
> bulk-insert paths in all likelihood anyway, if they managed to get to a very
> large relation. And even in that case, if we e.g. were to make the record size
> variable length, they'd still pretty much never reach that and it'd be an
> overall win.

I think the number of cases where we will reduce the WAL size will be
far more than the cases where it will slightly increase the size.  And
also the number of bytes we save in winning cases is far bigger than
the number of bytes we increase.  So IMHO it seems like an overall win
at least from the WAL size reduction pov.  Do we have some number that
how much overhead it has for encoding/decoding?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com