Re: storing an explicit nonce - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: storing an explicit nonce |
Date | |
Msg-id | CA+TgmoapMh3FQHYoEjKWv81ZCFNaOZ5EEgsci0Bn_yTA=OHRLw@mail.gmail.com Whole thread Raw |
In response to | Re: storing an explicit nonce (Bruce Momjian <bruce@momjian.us>) |
List | pgsql-hackers |
On Thu, May 27, 2021 at 12:15 PM Bruce Momjian <bruce@momjian.us> wrote: > > Well, in the design where the nonce is stored in the page, there is no > > need for every hint-type change to appear in the WAL at all. Once per > > checkpoint cycle, you need to write a full page image, as we do for > > checksums or wal_log_hints. The rest of the time, you can just bump > > the nonce and rewrite the page, same as we do today. > > What is it about having the nonce be the LSN that doesn't allow that to > happen? Could we just create a dummy LSN record and assign that to the > page and use that as a nonce. I can't tell which of two possible proposals you are describing here. If the LSN is used to derive the nonce, then one option is to just log a WAL record every time we need a new nonce. As I understand it, that's basically what you've already implemented, and we've discussed the disadvantages of that approach at some length already. The basic problems seem to be: - It's potentially very expensive if page evictions are frequent, which they will be whenever the workload is write-heavy and the working set is larger than shared_buffers. - If there's ever a situation where we need to write a page image different from any page image written previously and we cannot at that time write a WAL record to generate a new LSN for use as the nonce, then the algorithm is broken entirely. Andres's latest post points out - I think correctly - that this happens on standbys, because WAL replay does not generate byte-identical results on standbys even if you ignore hint bits. The first point strikes me as a sufficiently serious performance problem to justify giving up on this design, but that's a judgement call. The second one seems like it breaks it entirely. Now, there's another possible direction that is also suggested by your remarks here: maybe you meant using a fake LSN in cases where we can't use a real one. For example, suppose you decide to reserve half of the LSN space - all LSNs with the high bit set, for example - for this purpose. Well, you somehow need to ensure that you never use one of those values more than once, so you might think of putting a counter in shared memory. But now imagine a master with two standbys. How would you avoid having the same counter value used on one standby and also on the other standby? Even if they use the same counter for different pages, it's a critical security flaw. And since those standbys don't even need to know that the other one exists, that seems pretty well impossible to avoid. Now you might ask why we don't have the same problem if we store the nonce in the special space. One difference is that if you store the nonce explicitly, you can allow however much bit space you need in order to guarantee uniqueness, whereas reserving half the LSN space only gives you 63 bits. That's not enough to achieve uniqueness without tight coordination. With 128 bits, you can do things like just generate random values and assume they're vanishingly unlikely to collide, or randomly generate half the value and use the other half as a counter and be pretty safe. With 63 bits you just don't have enough bit space available to reliably avoid collisions using algorithms of that type, due to the birthday paradox. I think it would be adequate for uniqueness if there were a single shared counter and every allocation came from it, but again, as soon as you imagine a master and a bunch of standbys, that breaks down. Also, it's not entirely clear to me that you can avoid needing the LSN space on the page for a real LSN at the same time you also need it for a fake-LSN-being-used-as-a-nonce. We rely on the LSN field containing the LSN of the last WAL record for the page in order to obey the WAL-before-data rule, without which crash recovery will not work reliably. Now, if you sometimes have to use that field for a nonce that is a fake LSN, that means you no longer always have a place to store the real LSN. I can't convince myself off-hand that it's completely impossible to work around that problem, but it seems like any attempt to do so would be complicated and fragile at best. I don't think that's a direction that we want to go. Making crash recovery work reliably is a hard problem where we've had lots of bugs despite years of dedicated effort. TDE is also complex and has lots of pitfalls of its own. If we take two things which are individually complicated and hard to get right and intertwine them by making them share bit-space, I think it drives the complexity up to a level where we don't have much hope of getting things right. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: