Thread: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
"Ryo Matsumura (Fujitsu)"
Date:
Hi, hackers.

I found that CREATE DATABASE occurs lost of DDL result after crash server.
The direct cause is that the checkpoint skips sync for page of template1's main fork
because buffer status is not marked as BM_PERMANENT in BufferAlloc().

Have you any knowledge about it?

Reproduction:
1) Do initdb.
2) Start server.
3) Connect to 'postgres' database.
4) Execute CREATE DATABASE statement with WAL_LOG strategy.
5) Connect to 'template1' database.
6) Update pg_class by executing some DDL.
7) Do checkpoint.
8) Crash server.

[src/backend/storage/buffer/bufmgr.c]
1437     if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
1438         buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
1439     else
             !!! walk this route !!!
1440         buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;

The above is occured by the following call.
The argument 'permanent' of ReadBufferWithoutRelcache() is passed to
BufferAlloc() as 'relpersistence'.

[src/backend/commands/]
 298         buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
 299                                         RBM_NORMAL, bstrategy, false);

Regards
Ryo Matsumura



Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Justin Pryzby
Date:
On Wed, Feb 15, 2023 at 04:49:38AM +0000, Ryo Matsumura (Fujitsu) wrote:
> Hi, hackers.
> 
> I found that CREATE DATABASE occurs lost of DDL result after crash server.
> The direct cause is that the checkpoint skips sync for page of template1's main fork
> because buffer status is not marked as BM_PERMANENT in BufferAlloc().

I had some trouble reproducing this when running the commands by hand.

But it reproduces fine like this:

$ ./tmp_install/usr/local/pgsql/bin/postgres -D ./testrun/regress/regress/tmp_check/data& sleep 2; psql -h /tmp
postgres-c "DROP DATABASE IF EXISTS j" -c "CREATE DATABASE j STRATEGY wal_log" && psql -h /tmp template1 -c "CREATE
TABLEt(i int)" -c "INSERT INTO t SELECT generate_series(1,9)" -c CHECKPOINT; kill -9 %1; wait;
./tmp_install/usr/local/pgsql/bin/postgres-D ./testrun/regress/regress/tmp_check/data& sleep 9; psql -h /tmp template1
-c"table t"; kill %1
 
[1] 29069
2023-02-15 10:10:27.584 CST postmaster[29069] LOG:  starting PostgreSQL 16devel on x86_64-linux, compiled by gcc-9.4.0,
64-bit
2023-02-15 10:10:27.584 CST postmaster[29069] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2023-02-15 10:10:27.663 CST postmaster[29069] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-02-15 10:10:27.728 CST startup[29074] LOG:  database system was shut down at 2023-02-15 10:10:13 CST
2023-02-15 10:10:27.780 CST postmaster[29069] LOG:  database system is ready to accept connections
NOTICE:  database "j" does not exist, skipping
DROP DATABASE
CREATE DATABASE
CREATE TABLE
INSERT 0 9
2023-02-15 10:10:30.160 CST checkpointer[29072] LOG:  checkpoint starting: immediate force wait
2023-02-15 10:10:30.740 CST checkpointer[29072] LOG:  checkpoint complete: wrote 943 buffers (5.8%); 0 WAL file(s)
added,0 removed, 0 recycled; write=0.070 s, sync=0.369 s, total=0.581 s; sync files=268, longest=0.274 s, average=0.002
s;distance=4322 kB, estimate=4322 kB; lsn=0/BA9E8A0, redo lsn=0/BA9E868
 
CHECKPOINT
[1]+  Killed                  ./tmp_install/usr/local/pgsql/bin/postgres -D ./testrun/regress/regress/tmp_check/data
[1] 29088
2023-02-15 10:10:31.664 CST postmaster[29088] LOG:  starting PostgreSQL 16devel on x86_64-linux, compiled by gcc-9.4.0,
64-bit
2023-02-15 10:10:31.665 CST postmaster[29088] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2023-02-15 10:10:31.724 CST postmaster[29088] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-02-15 10:10:31.780 CST startup[29094] LOG:  database system was interrupted; last known up at 2023-02-15 10:10:30
CST
2023-02-15 10:10:33.888 CST startup[29094] LOG:  database system was not properly shut down; automatic recovery in
progress
2023-02-15 10:10:33.934 CST startup[29094] LOG:  redo starts at 0/BA9E868
2023-02-15 10:10:33.934 CST startup[29094] LOG:  invalid record length at 0/BA9E918: wanted 24, got 0
2023-02-15 10:10:33.934 CST startup[29094] LOG:  redo done at 0/BA9E8A0 system usage: CPU: user: 0.00 s, system: 0.00
s,elapsed: 0.00 s
 
2023-02-15 10:10:34.073 CST checkpointer[29092] LOG:  checkpoint starting: end-of-recovery immediate wait
2023-02-15 10:10:34.275 CST checkpointer[29092] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added,
0removed, 0 recycled; write=0.035 s, sync=0.026 s, total=0.257 s; sync files=2, longest=0.019 s, average=0.013 s;
distance=0kB, estimate=0 kB; lsn=0/BA9E918, redo lsn=0/BA9E918
 
2023-02-15 10:10:34.321 CST postmaster[29088] LOG:  database system is ready to accept connections
2023-02-15 10:10:39.893 CST client backend[29110] psql ERROR:  relation "t" does not exist at character 7
2023-02-15 10:10:39.893 CST client backend[29110] psql STATEMENT:  table t
ERROR:  relation "t" does not exist



Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Nathan Bossart
Date:
On Wed, Feb 15, 2023 at 04:49:38AM +0000, Ryo Matsumura (Fujitsu) wrote:
> The above is occured by the following call.
> The argument 'permanent' of ReadBufferWithoutRelcache() is passed to
> BufferAlloc() as 'relpersistence'.
> 
> [src/backend/commands/]
>  298         buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
>  299                                         RBM_NORMAL, bstrategy, false);

Indeed, setting that to true (as per the attached patch) seems to fix this.
I don't see any reason this _shouldn't_ be true from what I've read so far.
We're reading pg_class, which will probably never be UNLOGGED.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Dilip Kumar
Date:
On Thu, Feb 16, 2023 at 5:37 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Feb 15, 2023 at 04:49:38AM +0000, Ryo Matsumura (Fujitsu) wrote:
> > The above is occured by the following call.
> > The argument 'permanent' of ReadBufferWithoutRelcache() is passed to
> > BufferAlloc() as 'relpersistence'.
> >
> > [src/backend/commands/]
> >  298         buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno,
> >  299                                         RBM_NORMAL, bstrategy, false);
>
> Indeed, setting that to true (as per the attached patch) seems to fix this.
> I don't see any reason this _shouldn't_ be true from what I've read so far.
> We're reading pg_class, which will probably never be UNLOGGED.

Yes, there is no reason to pass this as false, seems like this is
passed false by mistake.  And your patch fixes the issue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Michael Paquier
Date:
On Thu, Feb 16, 2023 at 10:24:13AM +0530, Dilip Kumar wrote:
> Yes, there is no reason to pass this as false, seems like this is
> passed false by mistake.  And your patch fixes the issue.

So, if I am understanding this stuff right, this issue can create data
corruption once a DDL updates any pages of pg_class stored in a
template database that gets copied by this routine.  In this case, the
patch sent makes sure that any page copied will get written once a
checkpoint kicks in.  Shouldn't we have at least a regression test for
such a scenario?  The issue can happen when updating a template
database after creating a database from it, which is less worrying
than the initial impression I got, still I'd like to think that we
should have some coverage as of the special logic this code path
relies on for pg_class when reading its buffers.

I have not given much attention to this area, but I am a bit
suspicious that enforcing the default as WAL_LOG was a good idea for
15~, TBH.  We are usually much more conservative when it comes to
such choices, switching to the new behavior after a few years would
have been wiser..
--
Michael

Attachment

Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Nathan Bossart
Date:
On Thu, Feb 16, 2023 at 02:26:55PM +0900, Michael Paquier wrote:
> So, if I am understanding this stuff right, this issue can create data
> corruption once a DDL updates any pages of pg_class stored in a
> template database that gets copied by this routine.  In this case, the
> patch sent makes sure that any page copied will get written once a
> checkpoint kicks in.  Shouldn't we have at least a regression test for
> such a scenario?  The issue can happen when updating a template
> database after creating a database from it, which is less worrying
> than the initial impression I got, still I'd like to think that we
> should have some coverage as of the special logic this code path
> relies on for pg_class when reading its buffers.

I was able to quickly hack together a TAP test that seems to reliably fail
before the fix and pass afterwards.  There's probably a better place for
it, though...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Dilip Kumar
Date:
On Thu, Feb 16, 2023 at 12:11 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Thu, Feb 16, 2023 at 02:26:55PM +0900, Michael Paquier wrote:
> > So, if I am understanding this stuff right, this issue can create data
> > corruption once a DDL updates any pages of pg_class stored in a
> > template database that gets copied by this routine.  In this case, the
> > patch sent makes sure that any page copied will get written once a
> > checkpoint kicks in.  Shouldn't we have at least a regression test for
> > such a scenario?  The issue can happen when updating a template
> > database after creating a database from it, which is less worrying
> > than the initial impression I got, still I'd like to think that we
> > should have some coverage as of the special logic this code path
> > relies on for pg_class when reading its buffers.
>
> I was able to quickly hack together a TAP test that seems to reliably fail
> before the fix and pass afterwards.  There's probably a better place for
> it, though...

I think the below change is not relevant to this bug right?

diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 209118a639..6e9f8a7c7f 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -39,6 +39,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/100_bugs.pl',
     ],
   },
 }

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Robert Haas
Date:
On Thu, Feb 16, 2023 at 12:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I think the below change is not relevant to this bug right?
>
> diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
> index 209118a639..6e9f8a7c7f 100644
> --- a/src/test/recovery/meson.build
> +++ b/src/test/recovery/meson.build
> @@ -39,6 +39,7 @@ tests += {
>        't/031_recovery_conflict.pl',
>        't/032_relfilenode_reuse.pl',
>        't/033_replay_tsp_drops.pl',
> +      't/100_bugs.pl',
>      ],
>    },
>  }

Why not? The patch creates 100_bugs.pl so it also adds it to meson.build.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Michael Paquier
Date:
On Thu, Feb 16, 2023 at 12:37:57PM +0530, Robert Haas wrote:
> Why not? The patch creates 100_bugs.pl so it also adds it to meson.build.

It would not matter for REL_15_STABLE, but on HEAD all the new TAP
test files have to be added in their respective meson.build.  If you
don't do that, the CFBot would not test it, neither would a local
build with meson.

Adding a test in src/test/recovery/ is OK by me.  This is a recovery
case, and that's a..  Bug.  Another possible name would be something
like 0XX_create_database.pl, now that's me being picky.
--
Michael

Attachment

Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Dilip Kumar
Date:
On Thu, Feb 16, 2023 at 12:38 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Feb 16, 2023 at 12:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I think the below change is not relevant to this bug right?
> >
> > diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
> > index 209118a639..6e9f8a7c7f 100644
> > --- a/src/test/recovery/meson.build
> > +++ b/src/test/recovery/meson.build
> > @@ -39,6 +39,7 @@ tests += {
> >        't/031_recovery_conflict.pl',
> >        't/032_relfilenode_reuse.pl',
> >        't/033_replay_tsp_drops.pl',
> > +      't/100_bugs.pl',
> >      ],
> >    },
> >  }
>
> Why not? The patch creates 100_bugs.pl so it also adds it to meson.build.

Yeah my bad, I somehow assumed this was an existing file.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Andres Freund
Date:
On 2023-02-16 12:37:57 +0530, Robert Haas wrote:
> The patch creates 100_bugs.pl

What's the story behind 100_bugs.pl? This name clearly is copied from
src/test/subscription/t/100_bugs.pl - but I've never understood why that is
outside of the normal numbering space.



Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Amit Kapila
Date:
On Fri, Feb 17, 2023 at 2:59 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-02-16 12:37:57 +0530, Robert Haas wrote:
> > The patch creates 100_bugs.pl
>
> What's the story behind 100_bugs.pl? This name clearly is copied from
> src/test/subscription/t/100_bugs.pl - but I've never understood why that is
> outside of the normal numbering space.
>

Yeah, I have also previously wondered about this name for
src/test/subscription/t/100_bugs.pl. My guess is that it has been kept
to distinguish it from the other feature tests which have numbering
starting from 001.

-- 
With Regards,
Amit Kapila.



Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Peter Eisentraut
Date:
On 16.02.23 22:29, Andres Freund wrote:
> What's the story behind 100_bugs.pl? This name clearly is copied from
> src/test/subscription/t/100_bugs.pl - but I've never understood why that is
> outside of the normal numbering space.

Mainly to avoid awkwardness for backpatching.  The number of tests in 
src/test/subscription/ varies quite a bit across branches.



Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Nathan Bossart
Date:
On Fri, Feb 17, 2023 at 03:13:32PM +0100, Peter Eisentraut wrote:
> On 16.02.23 22:29, Andres Freund wrote:
>> What's the story behind 100_bugs.pl? This name clearly is copied from
>> src/test/subscription/t/100_bugs.pl - but I've never understood why that is
>> outside of the normal numbering space.
> 
> Mainly to avoid awkwardness for backpatching.  The number of tests in
> src/test/subscription/ varies quite a bit across branches.

I'm happy to move this new test to wherever folks think it should go.  I'll
look around to see if I can find a better place, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Michael Paquier
Date:
On Fri, Feb 17, 2023 at 02:35:30PM -0800, Nathan Bossart wrote:
> I'm happy to move this new test to wherever folks think it should go.  I'll
> look around to see if I can find a better place, too.

I think that src/test/recovery/ is the best fit, because this stresses
a code path for WAL replay on pg_class for the template db.  The name
is not specific enough, though, why not just using something like
0NN_create_database.pl?
--
Michael

Attachment

RE: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
"Ryo Matsumura (Fujitsu)"
Date:
> On Thu, Feb 16, 2023 at 10:24:13AM +0530, Dilip Kumar wrote:
> > Yes, there is no reason to pass this as false, seems like this is
> > passed false by mistake.  And your patch fixes the issue.

On Thu, Feb 16, 2023 at 02:26:55PM +0900, Michael Paquier wrote:
> So, if I am understanding this stuff right, this issue can create data
> corruption once a DDL updates any pages of pg_class stored in a
> template database that gets copied by this routine.  In this case, the
> patch sent makes sure that any page copied will get written once a
> checkpoint kicks in.

Thank you for comment and patch.
I think that the patch for dbcommand.c is fixed.
So I apply to my environment.

> I have not given much attention to this area, but I am a bit
> suspicious that enforcing the default as WAL_LOG was a good idea for
> 15~, TBH.  We are usually much more conservative when it comes to
> such choices, switching to the new behavior after a few years would
> have been wiser..

I think so too. I was surprised that new strategy is default.

Regards
Ryo Matsumura



Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Nathan Bossart
Date:
On Mon, Feb 20, 2023 at 05:02:01PM +0900, Michael Paquier wrote:
> On Fri, Feb 17, 2023 at 02:35:30PM -0800, Nathan Bossart wrote:
>> I'm happy to move this new test to wherever folks think it should go.  I'll
>> look around to see if I can find a better place, too.
> 
> I think that src/test/recovery/ is the best fit, because this stresses
> a code path for WAL replay on pg_class for the template db.  The name
> is not specific enough, though, why not just using something like
> 0NN_create_database.pl?

Okay.  I've renamed the test file as suggested in v3.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Michael Paquier
Date:
On Mon, Feb 20, 2023 at 04:43:22PM -0800, Nathan Bossart wrote:
> On Mon, Feb 20, 2023 at 05:02:01PM +0900, Michael Paquier wrote:
>> On Fri, Feb 17, 2023 at 02:35:30PM -0800, Nathan Bossart wrote:
>>> I'm happy to move this new test to wherever folks think it should go.  I'll
>>> look around to see if I can find a better place, too.
>>
>> I think that src/test/recovery/ is the best fit, because this stresses
>> a code path for WAL replay on pg_class for the template db.  The name
>> is not specific enough, though, why not just using something like
>> 0NN_create_database.pl?
>
> Okay.  I've renamed the test file as suggested in v3.

The test enforces a checkpoint after the table creation on the
template, so what about testing it also without a checkpoint, like the
extended version attached?  I have tweaked a few things in the test,
while on it.
--
Michael

Attachment

Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Nathan Bossart
Date:
On Tue, Feb 21, 2023 at 03:13:10PM +0900, Michael Paquier wrote:
> The test enforces a checkpoint after the table creation on the
> template, so what about testing it also without a checkpoint, like the
> extended version attached?  I have tweaked a few things in the test,
> while on it.

What is the purpose of testing it without the checkpoint?  Other than that
question, the patch looks reasonable to me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Michael Paquier
Date:
On Tue, Feb 21, 2023 at 10:00:11AM -0800, Nathan Bossart wrote:
> What is the purpose of testing it without the checkpoint?

Perhaps none, I was wondering whether it would be worth testing that
with the flush phase, but perhaps that's just extra cycles wasted at
this point.

> Other than that
> question, the patch looks reasonable to me.

Okay, applied and backpatched with a minimal test set, then.  I have
kept the tweaks I did to the tests with extra comments.
--
Michael

Attachment

Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

From
Nathan Bossart
Date:
On Wed, Feb 22, 2023 at 12:30:20PM +0900, Michael Paquier wrote:
> Okay, applied and backpatched with a minimal test set, then.  I have
> kept the tweaks I did to the tests with extra comments.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com