Re: storing an explicit nonce - Mailing list pgsql-hackers

From Robert Haas
Subject Re: storing an explicit nonce
Date
Msg-id CA+TgmoapMh3FQHYoEjKWv81ZCFNaOZ5EEgsci0Bn_yTA=OHRLw@mail.gmail.com
Whole thread Raw
In response to Re: storing an explicit nonce  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
On Thu, May 27, 2021 at 12:15 PM Bruce Momjian <bruce@momjian.us> wrote:
> > Well, in the design where the nonce is stored in the page, there is no
> > need for every hint-type change to appear in the WAL at all. Once per
> > checkpoint cycle, you need to write a full page image, as we do for
> > checksums or wal_log_hints. The rest of the time, you can just bump
> > the nonce and rewrite the page, same as we do today.
>
> What is it about having the nonce be the LSN that doesn't allow that to
> happen?  Could we just create a dummy LSN record and assign that to the
> page and use that as a nonce.

I can't tell which of two possible proposals you are describing here.
If the LSN is used to derive the nonce, then one option is to just log
a WAL record every time we need a new nonce. As I understand it,
that's basically what you've already implemented, and we've discussed
the disadvantages of that approach at some length already. The basic
problems seem to be:

- It's potentially very expensive if page evictions are frequent,
which they will be whenever the workload is write-heavy and the
working set is larger than shared_buffers.
- If there's ever a situation where we need to write a page image
different from any page image written previously and we cannot at that
time write a WAL record to generate a new LSN for use as the nonce,
then the algorithm is broken entirely. Andres's latest post points out
- I think correctly - that this happens on standbys, because WAL
replay does not generate byte-identical results on standbys even if
you ignore hint bits.

The first point strikes me as a sufficiently serious performance
problem to justify giving up on this design, but that's a judgement
call. The second one seems like it breaks it entirely.

Now, there's another possible direction that is also suggested by your
remarks here: maybe you meant using a fake LSN in cases where we can't
use a real one. For example, suppose you decide to reserve half of the
LSN space - all LSNs with the high bit set, for example - for this
purpose. Well, you somehow need to ensure that you never use one of
those values more than once, so you might think of putting a counter
in shared memory. But now imagine a master with two standbys. How
would you avoid having the same counter value used on one standby and
also on the other standby? Even if they use the same counter for
different pages, it's a critical security flaw. And since those
standbys don't even need to know that the other one exists, that seems
pretty well impossible to avoid.

Now you might ask why we don't have the same problem if we store the
nonce in the special space. One difference is that if you store the
nonce explicitly, you can allow however much bit space you need in
order to guarantee uniqueness, whereas reserving half the LSN space
only gives you 63 bits. That's not enough to achieve uniqueness
without tight coordination. With 128 bits, you can do things like just
generate random values and assume they're vanishingly unlikely to
collide, or randomly generate half the value and use the other half as
a counter and be pretty safe. With 63 bits you just don't have enough
bit space available to reliably avoid collisions using algorithms of
that type, due to the birthday paradox. I think it would be adequate
for uniqueness if there were a single shared counter and every
allocation came from it, but again, as soon as you imagine a master
and a bunch of standbys, that breaks down.

Also, it's not entirely clear to me that you can avoid needing the LSN
space on the page for a real LSN at the same time you also need it for
a fake-LSN-being-used-as-a-nonce. We rely on the LSN field containing
the LSN of the last WAL record for the page in order to obey the
WAL-before-data rule, without which crash recovery will not work
reliably. Now, if you sometimes have to use that field for a nonce
that is a fake LSN, that means you no longer always have a place to
store the real LSN. I can't convince myself off-hand that it's
completely impossible to work around that problem, but it seems like
any attempt to do so would be complicated and fragile at best. I don't
think that's a direction that we want to go. Making crash recovery
work reliably is a hard problem where we've had lots of bugs despite
years of dedicated effort. TDE is also complex and has lots of
pitfalls of its own. If we take two things which are individually
complicated and hard to get right and intertwine them by making them
share bit-space, I think it drives the complexity up to a level where
we don't have much hope of getting things right.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Reducing the range of OIDs consumed by genbki.pl
Next
From: Andres Freund
Date:
Subject: Re: storing an explicit nonce