Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints |
Date | |
Msg-id | CA+TgmoYst3+34ANsDUO__Ydd3tt7y7-ob+0rU-iurgjg2xQkig@mail.gmail.com Whole thread Raw |
In response to | Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints (Bruce Momjian <bruce@momjian.us>) |
Responses |
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
|
List | pgsql-hackers |
On Wed, Feb 9, 2022 at 9:19 AM Bruce Momjian <bruce@momjian.us> wrote: > Honestly, I never understood why the checkpoint during CREATE DATABASE > was as problem --- we checkpoint by default every five minutes anyway, > so why is an additional two a problem --- it just means the next > checkpoint will do less work. It is hard to see how avoiding > checkpoints to add WAL writes, fscyncs, and replication traffic could be > a win. Try running pgbench with the --progress option and enough concurrent jobs to keep a moderately large system busy and watching what happens to the tps each time a checkpoint occurs. It's extremely dramatic, or at least it was the last time I ran such tests. I think that performance will sometimes drop by a factor of five or more when the checkpoint hits, and take multiple minutes to recover. I think your statement that doing an extra checkpoint "just means the next checkpoint will do less work" is kind of misleading. That's certainly true in some situations. But when the same pages are being dirtied over and over again, an extra checkpoint often means that the system will do MUCH MORE work, because every checkpoint triggers a new set of full-page writes over the actively-updated portion of the database. I think that very few people run systems with heavy write workloads with checkpoint_timeout=5m, precisely because of this issue. Almost every system I see has had that raised to at least 10m and sometimes 30m or more. It can make a massive difference. > I see the patch justification outlined here: > > https://www.postgresql.org/message-id/CAFiTN-sP6yLVTfjR42mEfvFwJ-SZ2iEtG1t0j=QX09X=BM+KWQ@mail.gmail.com > > TDE is mentioned as a value for this patch, but I don't see why it is > needed --- TDE can easily decrypt/encrypt the pages while they are > copied. That's true, but depending on what other design decisions we make, WAL-logging it might be a problem. Right now, when someone creates a new database, we log a single record that basically says "go copy the directory'". That's very different than what we normally do, which is to log changes to individual pages, or where required, small groups of pages (e.g. a single WAL record is written for an UPDATE even though it may touch two pages). The fact that in this case we only log a single WAL record for an operation that could touch an unbounded amount of data is why this needs special handling around checkpoints. It also introduces a certain amount of fragility into the system, because if for some reason the source directory on the standby doesn't exactly match the source directory on the primary, the new databases won't match either. Any errors that creep into the process can be propagated around to other places by a system like this. However, ordinarily that doesn't happen, which is why we've been able to use this system successfully for so many years. The other reason we've been able to use this successfully is that we're confident that we can perform exactly the same operation on the standby as we do on the primary knowing only the relevant directory names. If we say "copy this directory to there" we believe we'll be able to do that exactly the same way on the standby. Is that still true with TDE? Well, it depends. If the encryption can be performed knowing only the key and the identity of the block (database OID, tablespace OID, relfilenode, fork, block number) then it's true. But if the encryption needs to, for example, generate a random nonce for each block, then it's false. If you want the standby to be an exact copy of the master in a system where new blocks get random nonces, then you need to replicate the copy block-by-block, not as one gigantic operation, so that you can log the nonce you picked for each block. On the other hand, maybe you DON'T want the standby to be an exact copy of the master. If, for example, you imagine a system where the master and standby aren't even using the same key, then this is a lot less relevant. I can't predict whether PostgreSQL will get TDE in the future, and if it does, I can't predict what form it will take. Therefore any strong statement about whether this will benefit TDE or not seems to me to be pretty questionable - we don't know that it will be useful, and we don't know that it won't. But, like Dilip, I think the way we're WAL-logging CREATE DATABASE right now is a hack, and I *know* it can cause massive performance drops on busy systems. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: