On 23/06/2021 12:45, Thomas Munro wrote:
> On Wed, Jun 23, 2021 at 7:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Let's just add the lock there.
>
> +1, no doubt about that.
Committed that. Thanks for the report, Alexander!
>> ... What about the new kid on the block:
>> Persistent Memory? I found this article:
>> https://lwn.net/Articles/686150/. So at hardware level, Persistent
>> Memory only guarantees atomicity at cache line level (64 bytes). To
>> provide the traditional 512 byte sector atomicity, there's a feature in
>> Linux called BTT. Perhaps we should add a note to the docs that you
>> should enable that.
>
> Right, also called sector mode. I don't know enough about that to
> comment really, but... if my google-fu is serving me, you can't
> actually use interesting sector sizes like 8KB (you have to choose 512
> or 4096 bytes), so you'll have to pay for *two* synthetic atomic page
> schemes: BTT and our full page writes. That makes me wonder... if you
> need to leave full page writes on anyway, maybe it would be a better
> trade-off to do double writes of our special atomic files (relmapper
> files and control file) so that we could safely turn BTT off and avoid
> double-taxation for relation data. Just a thought. No pmem
> experience here, I could be way off.
Yeah, you wouldn't want to turn on BTT for anything else than the
pg_control file. That's the only place where we rely on sector
atomicity, I believe. For everything else, it just adds overhead. Not
sure how much overhead; maybe it doesn't matter in practice.
>> We haven't heard of broken control files from the field, so that doesn't
>> seem to be a problem in practice, at least not yet. Still, I would sleep
>> better if the control file had more redundancy. For example, have two
>> copies of it on disk. At startup, read both copies, and if they're both
>> valid, ignore the one with older timestamp. When updating it, write over
>> the older copy. That way, if you crash in the middle of updating it, the
>> old copy is still intact.
>
> +1, with a flush in between so that only one can be borked no matter
> how the storage works. It is interesting how few reports there are on
> the mailing list of a control file CRC check failures though, if I'm
> searching for the right thing[1].
>
> [1] https://www.postgresql.org/search/?m=1&q=calculated+CRC+checksum+does+not+match+value+stored+in+file&l=&d=-1&s=r
If anyone wants a write a patch for that, I'd be happy to review it. And
if anyone has access to a system with pmem hardware, it would be
interesting to try to reproduce a torn sector and broken control file by
pulling the power plug.
- Heikki