Re: BUG #17226: hot standby mode, create database and drop database cause standby restrat failed - Mailing list pgsql-bugs

From Masahiko Sawada
Subject Re: BUG #17226: hot standby mode, create database and drop database cause standby restrat failed
Date
Msg-id CAD21AoDXGFKmRNPj-OuqTu28rEKMhWfJm0-p52gts=UKwjp-zw@mail.gmail.com
Whole thread Raw
In response to BUG #17226: hot standby mode, create database and drop database cause standby restrat failed  (PG Bug reporting form <noreply@postgresql.org>)
List pgsql-bugs
On Wed, Oct 13, 2021 at 6:49 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      17226
> Logged by:          Bo Chen
> Email address:      bchen90@163.com
> PostgreSQL version: 11.8
> Operating system:   Suse linux 11.3 x86_64
> Description:
>
> Hi, all
>
> Recently, I got a FATAL while restarts standby, which can be reproduced by
> the following steps, based on pg 11:
> 1. create a database test
> 2. create another database test1 using test as template
> 3. drop database test
> 4. stop standby using 'pg_ctl stop -mi'
> 5. start standby, start failed
> 6. stop primary using 'pg_ctl stop -mi'
> 7. start primary, start OK
>
> The starting log for standby failed is:
>
> 2021-10-13 15:57:43.294 CST startup 2420219 1/0 0 %LOG:  00000: redo starts
> at 0/6000024 xlog.c (7043 StartupXLOG)
> 2021-10-13 15:57:43.297 CST startup 2420219 1/0 0 %FATAL:  58P01: could not
> open directory "base/16385": No such file or directory fd.c (2714
> ReadDirExtended)
> 2021-10-13 15:57:43.297 CST startup 2420219 1/0 0 %CONTEXT:  WAL redo at
> 0/7001200 for Database/CREATE: copy dir 16385/1663 to 16386/1663
> 2021-10-13 15:57:43.298 CST postmaster 2420180 0 %LOG:  00000: startup
> process (PID 2420219) exited with exit code 1 postmaster.c (3772
> LogChildExit)
>
>
> I think the reason for the above error is as follows:
> For standby, the redo of XLOG_DBASE_DROP and XLOG_DBASE_CREATE just remove
> and copy the databse directory but not peform checkpoint like create or drop
> database in primary. So, when we restart,  it redo from the log of create
> database but now the directory of the source database has been deleted, so
> redo failed.

Thank you for the report!

I could reproduce this issue. If WAL records are produced in the
following order and the standby restarts the crash recovery from #2
after replaying all records, the replay of #3 fails due to missing the
database directory (base/16384) used as a template:

1. CREATEDB copy 1/1663 to 16384/1663
2. CHECKPOINT (and becomes the restart point for the standby)
3. CREATEDB copy 16384/1663 to 16385/1663
4. DROPDB 16384/1663

This issue is related to the discussion on pgsql-hackers[1] and the
patch is already proposed[2]. I’ve applied v12-0003 patch and
confirmed that it fixes this issue as well (although not looked at how
to fix this issue). So probably this issue can be fixed together with
that issue.

Regards,

[1] https://www.postgresql.org/message-id/CAEET0ZGx9AvioViLf7nbR_8tH9-%3D27DN5xWJ2P9-ROH16e4JUA%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/SA1PR05MB85821DEC48A6841277BDBF57C3F29%40SA1PR05MB8582.namprd05.prod.outlook.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: Inconsistent behavior of pg_dump/pg_restore on DEFAULT PRIVILEGES
Next
From: Masahiko Sawada
Date:
Subject: Re: Inconsistent behavior of pg_dump/pg_restore on DEFAULT PRIVILEGES