Thread: basebackups seem to have serious issues with FILE_COPY in CREATE DATABASE
Hi, While doing some additional testing of (incremental) backups, I ran into a couple regular failures. After pulling my hair for a couple days, I realized the issue seems to affect regular backups, and incremental backups (which I've been trying to test) are likely innocent. I'm using a simple (and admittedly not very pretty) bash scripts that takes and verified backups, concurrently with this workload: 1) initialize a cluster 2) initialize pgbench in database 'db' 3) run short pgbench on 'db' 4) maybe do vacuum [full] on 'db' 5) drop a database 'db_copy' if it exists 6) create a database 'db_copy' by copying 'db' using one of the available strategies (file_copy, wal_log) 7) run short pgbench on 'db_copy' 8) maybe do vacuum [full] on 'db_copy' And concurrently with this, it takes a basebackup, starts a cluster on it (on a different port, ofc), and does various checks on that: a) verify checksums using pg_checksums (cluster has them enabled) b) run amcheck on tables/indexes on both databases c) SQL check (we expect all tables to be 'consistent' as if we did a PITR - in particular sum(balance) is expected to be the same value on all pgbench tables) on both databases I believe those are reasonable expectations - that we get a database with valid checksums, with non-broken tables/indexes, and that the database looks as a snapshot taken at a single instant. Unfortunately it doesn't take long for the tests to start failing with various strange symptoms on the db_copy database (I'm yet to see an issue on the 'db' database): i) amcheck fails with 'heap tuple lacks matching index tuple' ERROR: heap tuple (116195,22) from table "pgbench_accounts" lacks matching index tuple within index "pgbench_accounts_pkey" HINT: Retrying verification using the function bt_index_parent_check() might provide a more specific error. I've seen this with other tables/indexes too, e.g. system catalogs pg_statitics or toast tables, but 'accounts' is most common. ii) amcheck fails with 'could not open file' ERROR: could not open file "base/18121/18137": No such file or directory LINE 9: lateral verify_heapam(relation => c.oid, on_error_stop => f... ^ ERROR: could not open file "base/18121/18137": No such file or directory iii) failures in the SQL check, with different tables have different balance sums SQL check fails (db_copy) (account 156142 branches 136132 tellers 136132 history -42826) Sometimes this is preceded by amcheck issue, but not always. I guess this is not the behavior we expect :-( I've reproduced all of this on PG16 - I haven't tried with older releases, but I have no reason to assume pre-16 releases are not affected. With incremental backups I've observed a couple more symptoms, but those are most likely just fallout of this - not realizing the initial state is a bit wrong, and making it worse by applying the increments. The important observation is that this only happens if a database is created while the backup is running, and that it only happens with the FILE_COPY strategy - I've never seen this with WAL_LOG (which is the default since PG15). I don't recall any reports of similar issues from pre-15 releases, where FILE_COPY was the only available option - I'm not sure why is that. Either it didn't have this issue back then, or maybe people happen to not create databases concurrently with a backup very often. It's a race condition / timing issue, essentially. I have no ambition to investigate this part of the code much deeper, or invent a fix myself, at least not in foreseeable future. But it seems like something we probably should fix - subtly broken backups are not a great thing. I see there have been a couple threads proposing various improvements to FILE_COPY, that might make it more efficient/faster, namely using the filesystem cloning [1] or switching pg_upgrade to use it [2]. But having something that's (maybe) faster but not quite correct does not seem like a winning strategy to me ... Alternatively, if we don't have clear desire to fix it, maybe the right solution would be get rid of it? regards [1] https://www.postgresql.org/message-id/CA+hUKGLM+t+SwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w@mail.gmail.com [2] https://www.postgresql.org/message-id/Zl9ta3FtgdjizkJ5%40nathan -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Re: basebackups seem to have serious issues with FILE_COPY in CREATE DATABASE
From
Nathan Bossart
Date:
On Mon, Jun 24, 2024 at 04:12:38PM +0200, Tomas Vondra wrote: > The important observation is that this only happens if a database is > created while the backup is running, and that it only happens with the > FILE_COPY strategy - I've never seen this with WAL_LOG (which is the > default since PG15). My first thought is that this sounds related to the large comment in CreateDatabaseUsingFileCopy(): /* * We force a checkpoint before committing. This effectively means that * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to be * replayed (at least not in ordinary crash recovery; we still have to * make the XLOG entry for the benefit of PITR operations). This avoids * two nasty scenarios: * * #1: When PITR is off, we don't XLOG the contents of newly created * indexes; therefore the drop-and-recreate-whole-directory behavior of * DBASE_CREATE replay would lose such indexes. * * #2: Since we have to recopy the source database during DBASE_CREATE * replay, we run the risk of copying changes in it that were committed * after the original CREATE DATABASE command but before the system crash * that led to the replay. This is at least unexpected and at worst could * lead to inconsistencies, eg duplicate table names. * * (Both of these were real bugs in releases 8.0 through 8.0.3.) * * In PITR replay, the first of these isn't an issue, and the second is * only a risk if the CREATE DATABASE and subsequent template database * change both occur while a base backup is being taken. There doesn't * seem to be much we can do about that except document it as a * limitation. * * See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE * strategy that avoids these problems. */ > I don't recall any reports of similar issues from pre-15 releases, where > FILE_COPY was the only available option - I'm not sure why is that. > Either it didn't have this issue back then, or maybe people happen to > not create databases concurrently with a backup very often. It's a race > condition / timing issue, essentially. If it requires concurrent activity on the template database, I wouldn't be surprised at all that this is rare. > I see there have been a couple threads proposing various improvements to > FILE_COPY, that might make it more efficient/faster, namely using the > filesystem cloning [1] or switching pg_upgrade to use it [2]. But having > something that's (maybe) faster but not quite correct does not seem like > a winning strategy to me ... > > Alternatively, if we don't have clear desire to fix it, maybe the right > solution would be get rid of it? It would be unfortunate if we couldn't use this for pg_upgrade, especially if it is unaffected by these problems. -- nathan
Re: basebackups seem to have serious issues with FILE_COPY in CREATE DATABASE
From
Tomas Vondra
Date:
On 6/24/24 17:14, Nathan Bossart wrote: > On Mon, Jun 24, 2024 at 04:12:38PM +0200, Tomas Vondra wrote: >> The important observation is that this only happens if a database is >> created while the backup is running, and that it only happens with the >> FILE_COPY strategy - I've never seen this with WAL_LOG (which is the >> default since PG15). > > My first thought is that this sounds related to the large comment in > CreateDatabaseUsingFileCopy(): > > /* > * We force a checkpoint before committing. This effectively means that > * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to be > * replayed (at least not in ordinary crash recovery; we still have to > * make the XLOG entry for the benefit of PITR operations). This avoids > * two nasty scenarios: > * > * #1: When PITR is off, we don't XLOG the contents of newly created > * indexes; therefore the drop-and-recreate-whole-directory behavior of > * DBASE_CREATE replay would lose such indexes. > * > * #2: Since we have to recopy the source database during DBASE_CREATE > * replay, we run the risk of copying changes in it that were committed > * after the original CREATE DATABASE command but before the system crash > * that led to the replay. This is at least unexpected and at worst could > * lead to inconsistencies, eg duplicate table names. > * > * (Both of these were real bugs in releases 8.0 through 8.0.3.) > * > * In PITR replay, the first of these isn't an issue, and the second is > * only a risk if the CREATE DATABASE and subsequent template database > * change both occur while a base backup is being taken. There doesn't > * seem to be much we can do about that except document it as a > * limitation. > * > * See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE > * strategy that avoids these problems. > */ > Perhaps, the mentioned risks certainly seem like it might be related to the issues I'm observing. >> I don't recall any reports of similar issues from pre-15 releases, where >> FILE_COPY was the only available option - I'm not sure why is that. >> Either it didn't have this issue back then, or maybe people happen to >> not create databases concurrently with a backup very often. It's a race >> condition / timing issue, essentially. > > If it requires concurrent activity on the template database, I wouldn't be > surprised at all that this is rare. > Right. Although, "concurrent" here means a somewhat different thing. AFAIK there can't be a any changes concurrent with the CREATE DATABASE directly, because we make sure there are no connections: createdb: error: database creation failed: ERROR: source database "test" is being accessed by other users DETAIL: There is 1 other session using the database. But per the comment, it'd be a problem if there is activity after the database gets copied, but before the backup completes (which is where the replay will happen). >> I see there have been a couple threads proposing various improvements to >> FILE_COPY, that might make it more efficient/faster, namely using the >> filesystem cloning [1] or switching pg_upgrade to use it [2]. But having >> something that's (maybe) faster but not quite correct does not seem like >> a winning strategy to me ... >> >> Alternatively, if we don't have clear desire to fix it, maybe the right >> solution would be get rid of it? > > It would be unfortunate if we couldn't use this for pg_upgrade, especially > if it is unaffected by these problems. > Yeah. I wouldn't mind using FILE_COPY in contexts where we know it's safe, like pg_upgrade. I just don't want to let users to unknowingly step on this. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company