Thread: Bug in recovery of drop database?
It appears that if recovery ends on a drop database command the database is only partially removed -- the files are removed but the entry in pg_database remains, leading to this error: psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: cannot connect to invalid database "mytest" HINT: Use DROP DATABASE to drop invalid databases. Here's pg_database: $ test/pg/bin/psql postgres -c 'select oid, datname from pg_database order by oid'; oid | datname -------+----------- 1 | template1 4 | template0 5 | postgres 16384 | mytest (4 rows) And the base data dir: $ ls -lah test/data/base total 20K drwx------ 5 dev dialout 4.0K Feb 5 19:31 . drwx------ 19 dev dialout 4.0K Feb 5 19:31 .. drwx------ 2 dev dialout 4.0K Feb 5 19:30 1 drwx------ 2 dev dialout 4.0K Feb 5 19:30 4 drwx------ 2 dev dialout 4.0K Feb 5 19:32 5 Here is a full repro: test/pg/bin/pg_basebackup -c fast -X none -D test/backup/full -F plain test/pg/bin/createdb mytest test/pg/bin/psql mytest -t -c 'select now()' 2025-02-05 19:18:26.245167+00 test/pg/bin/dropdb mytest test/pg/bin/psql mytest psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: database "mytest" does not exist test/pg/bin/pg_ctl -D test/data -w stop rm -rf test/data cp -rp test/backup/full test/data touch test/data/recovery.signal echo "restore_command = 'cp /home/dev/test/archive/%f %p'" >> test/data/postgresql.auto.conf echo "recovery_target_time = '2025-02-05 19:18:26.245167+00'" >> test/data/postgresql.auto.conf test/pg/bin/pg_ctl -D test/data -w start test/pg/bin/psql mytest -c 'select count(*) from pgbench_accounts' psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: cannot connect to invalid database "mytest" HINT: Use DROP DATABASE to drop invalid databases. However, if I insert this command before the drop database: test/pg/bin/psql mytest -c 'create table test (id int)' Then recovery works fine since it is able to stop before the the drop database. This is an issue on HEAD as well as all current back branches. For completeness here is my configuration on this test system (though I have seen this issue in a variety of environments): wal_level = replica archive_mode = on archive_command = 'cp %p /home/dev/test/archive/%f' log_min_messages = debug1 max_wal_senders = 10 log_line_prefix = '%m %L ' autovacuum = off We have seen this several times now because a favorite test for users is to drop a database and see if they can recover it with PITR. PITR in a test database is always a bit dicey since little/no WAL activity means recovery doesn't have much to go on RE recovery targets, but at the least the database should be fully there or fully not there. Regards, -David
On 2/5/25 15:24, David Steele wrote: > > psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: > FATAL: cannot connect to invalid database "mytest" > HINT: Use DROP DATABASE to drop invalid databases. Since this error message started cropping up fairly recently I had a look at related commits and found this: c66a7d75 Handle DROP DATABASE getting interrupted This commit changed the error message so it is more obvious that something has gone wrong, but the underlying issue appears the same for recovery. Before c66a7d75 it looked like this: FATAL: database "mytest" does not exist DETAIL: The database subdirectory "base/16384" is missing. Which seems reasonable when just looking at the error and not the detail. Regards, -David
On 2/6/25 13:05, Andres Freund wrote: > On 2025-02-06 15:55:21 +0000, David Steele wrote: >> >> Before c66a7d75 it looked like this: >> >> FATAL: database "mytest" does not exist >> DETAIL: The database subdirectory "base/16384" is missing. >> >> Which seems reasonable when just looking at the error and not the detail. > > Yea, I don't think my commit fundamentally changed things, it just made the > issue a bit more apparent. The biggest change is that it actually allows to > "fix" the leaked space from SQL. > > To actually fix this problem we'd need to move the dropping of database files > to happen as part of transaction commit, similar to how relation files aren't > actually truncated the moment you truncate a table. That's not a trivial > change... Understatement. > WRT your repro: >> test/pg/bin/psql mytest -t -c 'select now()' >> 2025-02-05 19:18:26.245167+00 >> >> test/pg/bin/dropdb mytest >> test/pg/bin/psql mytest >> psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: >> FATAL: database "mytest" does not exist >> ... >> echo "recovery_target_time = '2025-02-05 19:18:26.245167+00'" >> >> test/data/postgresql.auto.conf >> ... >> test/pg/bin/psql mytest -c 'select count(*) from pgbench_accounts' >> psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: >> FATAL: cannot connect to invalid database "mytest" >> HINT: Use DROP DATABASE to drop invalid databases. > > I'd guess this is due to using the default recovery_target_inclusive=true. I got the same result with recovery_target_inclusive=false. I think you could get it to work with a more precise timestamp or better, xid or lsn. > The reason: > >> However, if I insert this command before the drop database: >> >> test/pg/bin/psql mytest -c 'create table test (id int)' >> >> Then recovery works fine since it is able to stop before the the drop >> database. This is an issue on HEAD as well as all current back branches. > > works, presumably is that now there's another transaction to stop at > *after*. Without the additional transaction the first complete transaction > with the recovery_target_time Given I'm grabbing the timestamp before the create table I'd expect it to stop before the table is created whether recovery_target_inclusive is true or false. I just gave it a try and sure enough either way I get: $ test/pg/bin/psql mytest -c 'select count(*) from test' ERROR: relation "test" does not exist > IOW, while the two-step-drop-database is an issue, even without it, your > reproducer doesn't seem like it actually would work, even if it were a single > step, to implement the below scenario: The repro works almost every time for me on all six active branches. It hasn't worked a few times, but in those cases I suspect I accidentally skipped a step. I'm running it manually each time. >> We have seen this several times now because a favorite test for users is to >> drop a database and see if they can recover it with PITR. PITR in a test >> database is always a bit dicey since little/no WAL activity means recovery >> doesn't have much to go on RE recovery targets, but at the least the >> database should be fully there or fully not there. > > Because the repro actually sets the recovery target to *after* the drop > database. > > Am I misunderstanding something? Without looking at the code, my guess is the dropping of files WAL record does not have a transaction timestamp that recovery can use so recovery blows by the dropping of files and stops before the entry is removed from pg_database. It sounds like this is more or less a known issue. If we decide not to fix it how about I at least document it as a caveat for drop database? Regards, -David
On 2/6/25 14:28, David Steele wrote: > On 2/6/25 13:05, Andres Freund wrote: > >> IOW, while the two-step-drop-database is an issue, even without it, your >> reproducer doesn't seem like it actually would work, even if it were a >> single >> step, to implement the below scenario: > > The repro works almost every time for me on all six active branches. It > hasn't worked a few times, but in those cases I suspect I accidentally > skipped a step. I'm running it manually each time. Sorry, I misunderstood you here -- I thought you were saying the reproducer doesn't work as is, which was confusing since it definitely does. But I think it would work just fine if recovery was able stop before the drop database. The timestamp is gathered before the drop database so I'd expect that timestamp to get me recovered to before the drop database whether it is inclusive or not. Certainly that's what the users expect and that's what happens if I add another transaction before the drop database. Time-targeted recovery, which is by far the most common, doesn't always work as one would hope depending on what WAL records are available to guide recovery. The test scenarios that users setup are usually very misleading because the lack of organic database activity often leads to unexpected results. I think in a database with normal activity this condition would be fairly hard to hit, but that doesn't make it any less a bug. And the fact that it is so easy to hit in common test scenarios makes users worry about whether Postgres is able to recover consistently. I've seen two reports of this recently which likely means even more people are running into it. Regards, -David