Thread: Bug in recovery of drop database?

Bug in recovery of drop database?

From
David Steele
Date:
It appears that if recovery ends on a drop database command the database 
is only partially removed -- the files are removed but the entry in 
pg_database remains, leading to this error:

psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: 
FATAL:  cannot connect to invalid database "mytest"
HINT:  Use DROP DATABASE to drop invalid databases.

Here's pg_database:

$ test/pg/bin/psql postgres -c 'select oid, datname from pg_database 
order by oid';
   oid  |  datname
-------+-----------
      1 | template1
      4 | template0
      5 | postgres
  16384 | mytest
(4 rows)

And the base data dir:

$ ls -lah test/data/base
total 20K
drwx------  5 dev dialout 4.0K Feb  5 19:31 .
drwx------ 19 dev dialout 4.0K Feb  5 19:31 ..
drwx------  2 dev dialout 4.0K Feb  5 19:30 1
drwx------  2 dev dialout 4.0K Feb  5 19:30 4
drwx------  2 dev dialout 4.0K Feb  5 19:32 5

Here is a full repro:

test/pg/bin/pg_basebackup -c fast -X none -D test/backup/full -F plain
test/pg/bin/createdb mytest
test/pg/bin/psql mytest -t -c 'select now()'
  2025-02-05 19:18:26.245167+00

test/pg/bin/dropdb mytest
test/pg/bin/psql mytest
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: 
FATAL:  database "mytest" does not exist

test/pg/bin/pg_ctl -D test/data -w stop
rm -rf test/data
cp -rp test/backup/full test/data
touch test/data/recovery.signal
echo "restore_command = 'cp /home/dev/test/archive/%f %p'" >> 
test/data/postgresql.auto.conf
echo "recovery_target_time = '2025-02-05 19:18:26.245167+00'" >> 
test/data/postgresql.auto.conf
test/pg/bin/pg_ctl -D test/data -w start

test/pg/bin/psql mytest -c 'select count(*) from pgbench_accounts'
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: 
FATAL:  cannot connect to invalid database "mytest"
HINT:  Use DROP DATABASE to drop invalid databases.

However, if I insert this command before the drop database:

test/pg/bin/psql mytest -c 'create table test (id int)'

Then recovery works fine since it is able to stop before the the drop 
database. This is an issue on HEAD as well as all current back branches.

For completeness here is my configuration on this test system (though I 
have seen this issue in a variety of environments):

wal_level = replica
archive_mode = on
archive_command = 'cp %p /home/dev/test/archive/%f'
log_min_messages = debug1
max_wal_senders = 10
log_line_prefix = '%m %L '
autovacuum = off

We have seen this several times now because a favorite test for users is 
to drop a database and see if they can recover it with PITR. PITR in a 
test database is always a bit dicey since little/no WAL activity means 
recovery doesn't have much to go on RE recovery targets, but at the 
least the database should be fully there or fully not there.

Regards,
-David



Re: Bug in recovery of drop database?

From
David Steele
Date:
On 2/5/25 15:24, David Steele wrote:
> 
> psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: 
> FATAL:  cannot connect to invalid database "mytest"
> HINT:  Use DROP DATABASE to drop invalid databases.

Since this error message started cropping up fairly recently I had a 
look at related commits and found this:

c66a7d75 Handle DROP DATABASE getting interrupted

This commit changed the error message so it is more obvious that 
something has gone wrong, but the underlying issue appears the same for 
recovery.

Before c66a7d75 it looked like this:

FATAL:  database "mytest" does not exist
DETAIL:  The database subdirectory "base/16384" is missing.

Which seems reasonable when just looking at the error and not the detail.

Regards,
-David



Re: Bug in recovery of drop database?

From
David Steele
Date:
On 2/6/25 13:05, Andres Freund wrote:
> On 2025-02-06 15:55:21 +0000, David Steele wrote:
>>
>> Before c66a7d75 it looked like this:
>>
>> FATAL:  database "mytest" does not exist
>> DETAIL:  The database subdirectory "base/16384" is missing.
>>
>> Which seems reasonable when just looking at the error and not the detail.
> 
> Yea, I don't think my commit fundamentally changed things, it just made the
> issue a bit more apparent.  The biggest change is that it actually allows to
> "fix" the leaked space from SQL.
> 
> To actually fix this problem we'd need to move the dropping of database files
> to happen as part of transaction commit, similar to how relation files aren't
> actually truncated the moment you truncate a table.  That's not a trivial
> change...

Understatement.

> WRT your repro:
>> test/pg/bin/psql mytest -t -c 'select now()'
>>   2025-02-05 19:18:26.245167+00
>>
>> test/pg/bin/dropdb mytest
>> test/pg/bin/psql mytest
>> psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed:
>> FATAL:  database "mytest" does not exist
>> ...
>> echo "recovery_target_time = '2025-02-05 19:18:26.245167+00'" >>
>> test/data/postgresql.auto.conf
>> ...
>> test/pg/bin/psql mytest -c 'select count(*) from pgbench_accounts'
>> psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed:
>> FATAL:  cannot connect to invalid database "mytest"
>> HINT:  Use DROP DATABASE to drop invalid databases.
> 
> I'd guess this is due to using the default recovery_target_inclusive=true.

I got the same result with recovery_target_inclusive=false. I think you 
could get it to work with a more precise timestamp or better, xid or lsn.

> The reason:
> 
>> However, if I insert this command before the drop database:
>>
>> test/pg/bin/psql mytest -c 'create table test (id int)'
>>
>> Then recovery works fine since it is able to stop before the the drop
>> database. This is an issue on HEAD as well as all current back branches.
> 
> works, presumably is that now there's another transaction to stop at
> *after*. Without the additional transaction the first complete transaction
> with the recovery_target_time

Given I'm grabbing the timestamp before the create table I'd expect it 
to stop before the table is created whether recovery_target_inclusive is 
true or false. I just gave it a try and sure enough either way I get:

$ test/pg/bin/psql mytest -c 'select count(*) from test'
ERROR:  relation "test" does not exist

> IOW, while the two-step-drop-database is an issue, even without it, your
> reproducer doesn't seem like it actually would work, even if it were a single
> step, to implement the below scenario:

The repro works almost every time for me on all six active branches. It 
hasn't worked a few times, but in those cases I suspect I accidentally 
skipped a step. I'm running it manually each time.

>> We have seen this several times now because a favorite test for users is to
>> drop a database and see if they can recover it with PITR. PITR in a test
>> database is always a bit dicey since little/no WAL activity means recovery
>> doesn't have much to go on RE recovery targets, but at the least the
>> database should be fully there or fully not there.
> 
> Because the repro actually sets the recovery target to *after* the drop
> database.
> 
> Am I misunderstanding something?

Without looking at the code, my guess is the dropping of files WAL 
record does not have a transaction timestamp that recovery can use so 
recovery blows by the dropping of files and stops before the entry is 
removed from pg_database.

It sounds like this is more or less a known issue. If we decide not to 
fix it how about I at least document it as a caveat for drop database?

Regards,
-David



Re: Bug in recovery of drop database?

From
David Steele
Date:
On 2/6/25 14:28, David Steele wrote:
> On 2/6/25 13:05, Andres Freund wrote:
> 
>> IOW, while the two-step-drop-database is an issue, even without it, your
>> reproducer doesn't seem like it actually would work, even if it were a 
>> single
>> step, to implement the below scenario:
> 
> The repro works almost every time for me on all six active branches. It 
> hasn't worked a few times, but in those cases I suspect I accidentally 
> skipped a step. I'm running it manually each time.

Sorry, I misunderstood you here -- I thought you were saying the 
reproducer doesn't work as is, which was confusing since it definitely does.

But I think it would work just fine if recovery was able stop before the 
drop database. The timestamp is gathered before the drop database so I'd 
expect that timestamp to get me recovered to before the drop database 
whether it is inclusive or not. Certainly that's what the users expect 
and that's what happens if I add another transaction before the drop 
database.

Time-targeted recovery, which is by far the most common, doesn't always 
work as one would hope depending on what WAL records are available to 
guide recovery. The test scenarios that users setup are usually very 
misleading because the lack of organic database activity often leads to 
unexpected results.

I think in a database with normal activity this condition would be 
fairly hard to hit, but that doesn't make it any less a bug. And the 
fact that it is so easy to hit in common test scenarios makes users 
worry about whether Postgres is able to recover consistently. I've seen 
two reports of this recently which likely means even more people are 
running into it.

Regards,
-David