Thread: DDL result is lost by CREATE DATABASE with WAL_LOG strategy
Hi, hackers. I found that CREATE DATABASE occurs lost of DDL result after crash server. The direct cause is that the checkpoint skips sync for page of template1's main fork because buffer status is not marked as BM_PERMANENT in BufferAlloc(). Have you any knowledge about it? Reproduction: 1) Do initdb. 2) Start server. 3) Connect to 'postgres' database. 4) Execute CREATE DATABASE statement with WAL_LOG strategy. 5) Connect to 'template1' database. 6) Update pg_class by executing some DDL. 7) Do checkpoint. 8) Crash server. [src/backend/storage/buffer/bufmgr.c] 1437 if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM) 1438 buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE; 1439 else !!! walk this route !!! 1440 buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE; The above is occured by the following call. The argument 'permanent' of ReadBufferWithoutRelcache() is passed to BufferAlloc() as 'relpersistence'. [src/backend/commands/] 298 buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno, 299 RBM_NORMAL, bstrategy, false); Regards Ryo Matsumura
On Wed, Feb 15, 2023 at 04:49:38AM +0000, Ryo Matsumura (Fujitsu) wrote: > Hi, hackers. > > I found that CREATE DATABASE occurs lost of DDL result after crash server. > The direct cause is that the checkpoint skips sync for page of template1's main fork > because buffer status is not marked as BM_PERMANENT in BufferAlloc(). I had some trouble reproducing this when running the commands by hand. But it reproduces fine like this: $ ./tmp_install/usr/local/pgsql/bin/postgres -D ./testrun/regress/regress/tmp_check/data& sleep 2; psql -h /tmp postgres-c "DROP DATABASE IF EXISTS j" -c "CREATE DATABASE j STRATEGY wal_log" && psql -h /tmp template1 -c "CREATE TABLEt(i int)" -c "INSERT INTO t SELECT generate_series(1,9)" -c CHECKPOINT; kill -9 %1; wait; ./tmp_install/usr/local/pgsql/bin/postgres-D ./testrun/regress/regress/tmp_check/data& sleep 9; psql -h /tmp template1 -c"table t"; kill %1 [1] 29069 2023-02-15 10:10:27.584 CST postmaster[29069] LOG: starting PostgreSQL 16devel on x86_64-linux, compiled by gcc-9.4.0, 64-bit 2023-02-15 10:10:27.584 CST postmaster[29069] LOG: listening on IPv4 address "127.0.0.1", port 5432 2023-02-15 10:10:27.663 CST postmaster[29069] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2023-02-15 10:10:27.728 CST startup[29074] LOG: database system was shut down at 2023-02-15 10:10:13 CST 2023-02-15 10:10:27.780 CST postmaster[29069] LOG: database system is ready to accept connections NOTICE: database "j" does not exist, skipping DROP DATABASE CREATE DATABASE CREATE TABLE INSERT 0 9 2023-02-15 10:10:30.160 CST checkpointer[29072] LOG: checkpoint starting: immediate force wait 2023-02-15 10:10:30.740 CST checkpointer[29072] LOG: checkpoint complete: wrote 943 buffers (5.8%); 0 WAL file(s) added,0 removed, 0 recycled; write=0.070 s, sync=0.369 s, total=0.581 s; sync files=268, longest=0.274 s, average=0.002 s;distance=4322 kB, estimate=4322 kB; lsn=0/BA9E8A0, redo lsn=0/BA9E868 CHECKPOINT [1]+ Killed ./tmp_install/usr/local/pgsql/bin/postgres -D ./testrun/regress/regress/tmp_check/data [1] 29088 2023-02-15 10:10:31.664 CST postmaster[29088] LOG: starting PostgreSQL 16devel on x86_64-linux, compiled by gcc-9.4.0, 64-bit 2023-02-15 10:10:31.665 CST postmaster[29088] LOG: listening on IPv4 address "127.0.0.1", port 5432 2023-02-15 10:10:31.724 CST postmaster[29088] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2023-02-15 10:10:31.780 CST startup[29094] LOG: database system was interrupted; last known up at 2023-02-15 10:10:30 CST 2023-02-15 10:10:33.888 CST startup[29094] LOG: database system was not properly shut down; automatic recovery in progress 2023-02-15 10:10:33.934 CST startup[29094] LOG: redo starts at 0/BA9E868 2023-02-15 10:10:33.934 CST startup[29094] LOG: invalid record length at 0/BA9E918: wanted 24, got 0 2023-02-15 10:10:33.934 CST startup[29094] LOG: redo done at 0/BA9E8A0 system usage: CPU: user: 0.00 s, system: 0.00 s,elapsed: 0.00 s 2023-02-15 10:10:34.073 CST checkpointer[29092] LOG: checkpoint starting: end-of-recovery immediate wait 2023-02-15 10:10:34.275 CST checkpointer[29092] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0removed, 0 recycled; write=0.035 s, sync=0.026 s, total=0.257 s; sync files=2, longest=0.019 s, average=0.013 s; distance=0kB, estimate=0 kB; lsn=0/BA9E918, redo lsn=0/BA9E918 2023-02-15 10:10:34.321 CST postmaster[29088] LOG: database system is ready to accept connections 2023-02-15 10:10:39.893 CST client backend[29110] psql ERROR: relation "t" does not exist at character 7 2023-02-15 10:10:39.893 CST client backend[29110] psql STATEMENT: table t ERROR: relation "t" does not exist
On Wed, Feb 15, 2023 at 04:49:38AM +0000, Ryo Matsumura (Fujitsu) wrote: > The above is occured by the following call. > The argument 'permanent' of ReadBufferWithoutRelcache() is passed to > BufferAlloc() as 'relpersistence'. > > [src/backend/commands/] > 298 buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno, > 299 RBM_NORMAL, bstrategy, false); Indeed, setting that to true (as per the attached patch) seems to fix this. I don't see any reason this _shouldn't_ be true from what I've read so far. We're reading pg_class, which will probably never be UNLOGGED. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Feb 16, 2023 at 5:37 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Feb 15, 2023 at 04:49:38AM +0000, Ryo Matsumura (Fujitsu) wrote: > > The above is occured by the following call. > > The argument 'permanent' of ReadBufferWithoutRelcache() is passed to > > BufferAlloc() as 'relpersistence'. > > > > [src/backend/commands/] > > 298 buf = ReadBufferWithoutRelcache(rnode, MAIN_FORKNUM, blkno, > > 299 RBM_NORMAL, bstrategy, false); > > Indeed, setting that to true (as per the attached patch) seems to fix this. > I don't see any reason this _shouldn't_ be true from what I've read so far. > We're reading pg_class, which will probably never be UNLOGGED. Yes, there is no reason to pass this as false, seems like this is passed false by mistake. And your patch fixes the issue. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Feb 16, 2023 at 10:24:13AM +0530, Dilip Kumar wrote: > Yes, there is no reason to pass this as false, seems like this is > passed false by mistake. And your patch fixes the issue. So, if I am understanding this stuff right, this issue can create data corruption once a DDL updates any pages of pg_class stored in a template database that gets copied by this routine. In this case, the patch sent makes sure that any page copied will get written once a checkpoint kicks in. Shouldn't we have at least a regression test for such a scenario? The issue can happen when updating a template database after creating a database from it, which is less worrying than the initial impression I got, still I'd like to think that we should have some coverage as of the special logic this code path relies on for pg_class when reading its buffers. I have not given much attention to this area, but I am a bit suspicious that enforcing the default as WAL_LOG was a good idea for 15~, TBH. We are usually much more conservative when it comes to such choices, switching to the new behavior after a few years would have been wiser.. -- Michael
Attachment
On Thu, Feb 16, 2023 at 02:26:55PM +0900, Michael Paquier wrote: > So, if I am understanding this stuff right, this issue can create data > corruption once a DDL updates any pages of pg_class stored in a > template database that gets copied by this routine. In this case, the > patch sent makes sure that any page copied will get written once a > checkpoint kicks in. Shouldn't we have at least a regression test for > such a scenario? The issue can happen when updating a template > database after creating a database from it, which is less worrying > than the initial impression I got, still I'd like to think that we > should have some coverage as of the special logic this code path > relies on for pg_class when reading its buffers. I was able to quickly hack together a TAP test that seems to reliably fail before the fix and pass afterwards. There's probably a better place for it, though... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Feb 16, 2023 at 12:11 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Thu, Feb 16, 2023 at 02:26:55PM +0900, Michael Paquier wrote: > > So, if I am understanding this stuff right, this issue can create data > > corruption once a DDL updates any pages of pg_class stored in a > > template database that gets copied by this routine. In this case, the > > patch sent makes sure that any page copied will get written once a > > checkpoint kicks in. Shouldn't we have at least a regression test for > > such a scenario? The issue can happen when updating a template > > database after creating a database from it, which is less worrying > > than the initial impression I got, still I'd like to think that we > > should have some coverage as of the special logic this code path > > relies on for pg_class when reading its buffers. > > I was able to quickly hack together a TAP test that seems to reliably fail > before the fix and pass afterwards. There's probably a better place for > it, though... I think the below change is not relevant to this bug right? diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build index 209118a639..6e9f8a7c7f 100644 --- a/src/test/recovery/meson.build +++ b/src/test/recovery/meson.build @@ -39,6 +39,7 @@ tests += { 't/031_recovery_conflict.pl', 't/032_relfilenode_reuse.pl', 't/033_replay_tsp_drops.pl', + 't/100_bugs.pl', ], }, } -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Feb 16, 2023 at 12:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I think the below change is not relevant to this bug right? > > diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build > index 209118a639..6e9f8a7c7f 100644 > --- a/src/test/recovery/meson.build > +++ b/src/test/recovery/meson.build > @@ -39,6 +39,7 @@ tests += { > 't/031_recovery_conflict.pl', > 't/032_relfilenode_reuse.pl', > 't/033_replay_tsp_drops.pl', > + 't/100_bugs.pl', > ], > }, > } Why not? The patch creates 100_bugs.pl so it also adds it to meson.build. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Feb 16, 2023 at 12:37:57PM +0530, Robert Haas wrote: > Why not? The patch creates 100_bugs.pl so it also adds it to meson.build. It would not matter for REL_15_STABLE, but on HEAD all the new TAP test files have to be added in their respective meson.build. If you don't do that, the CFBot would not test it, neither would a local build with meson. Adding a test in src/test/recovery/ is OK by me. This is a recovery case, and that's a.. Bug. Another possible name would be something like 0XX_create_database.pl, now that's me being picky. -- Michael
Attachment
On Thu, Feb 16, 2023 at 12:38 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Feb 16, 2023 at 12:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I think the below change is not relevant to this bug right? > > > > diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build > > index 209118a639..6e9f8a7c7f 100644 > > --- a/src/test/recovery/meson.build > > +++ b/src/test/recovery/meson.build > > @@ -39,6 +39,7 @@ tests += { > > 't/031_recovery_conflict.pl', > > 't/032_relfilenode_reuse.pl', > > 't/033_replay_tsp_drops.pl', > > + 't/100_bugs.pl', > > ], > > }, > > } > > Why not? The patch creates 100_bugs.pl so it also adds it to meson.build. Yeah my bad, I somehow assumed this was an existing file. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 2023-02-16 12:37:57 +0530, Robert Haas wrote: > The patch creates 100_bugs.pl What's the story behind 100_bugs.pl? This name clearly is copied from src/test/subscription/t/100_bugs.pl - but I've never understood why that is outside of the normal numbering space.
On Fri, Feb 17, 2023 at 2:59 AM Andres Freund <andres@anarazel.de> wrote: > > On 2023-02-16 12:37:57 +0530, Robert Haas wrote: > > The patch creates 100_bugs.pl > > What's the story behind 100_bugs.pl? This name clearly is copied from > src/test/subscription/t/100_bugs.pl - but I've never understood why that is > outside of the normal numbering space. > Yeah, I have also previously wondered about this name for src/test/subscription/t/100_bugs.pl. My guess is that it has been kept to distinguish it from the other feature tests which have numbering starting from 001. -- With Regards, Amit Kapila.
On 16.02.23 22:29, Andres Freund wrote: > What's the story behind 100_bugs.pl? This name clearly is copied from > src/test/subscription/t/100_bugs.pl - but I've never understood why that is > outside of the normal numbering space. Mainly to avoid awkwardness for backpatching. The number of tests in src/test/subscription/ varies quite a bit across branches.
On Fri, Feb 17, 2023 at 03:13:32PM +0100, Peter Eisentraut wrote: > On 16.02.23 22:29, Andres Freund wrote: >> What's the story behind 100_bugs.pl? This name clearly is copied from >> src/test/subscription/t/100_bugs.pl - but I've never understood why that is >> outside of the normal numbering space. > > Mainly to avoid awkwardness for backpatching. The number of tests in > src/test/subscription/ varies quite a bit across branches. I'm happy to move this new test to wherever folks think it should go. I'll look around to see if I can find a better place, too. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Feb 17, 2023 at 02:35:30PM -0800, Nathan Bossart wrote: > I'm happy to move this new test to wherever folks think it should go. I'll > look around to see if I can find a better place, too. I think that src/test/recovery/ is the best fit, because this stresses a code path for WAL replay on pg_class for the template db. The name is not specific enough, though, why not just using something like 0NN_create_database.pl? -- Michael
Attachment
RE: DDL result is lost by CREATE DATABASE with WAL_LOG strategy
From
"Ryo Matsumura (Fujitsu)"
Date:
> On Thu, Feb 16, 2023 at 10:24:13AM +0530, Dilip Kumar wrote: > > Yes, there is no reason to pass this as false, seems like this is > > passed false by mistake. And your patch fixes the issue. On Thu, Feb 16, 2023 at 02:26:55PM +0900, Michael Paquier wrote: > So, if I am understanding this stuff right, this issue can create data > corruption once a DDL updates any pages of pg_class stored in a > template database that gets copied by this routine. In this case, the > patch sent makes sure that any page copied will get written once a > checkpoint kicks in. Thank you for comment and patch. I think that the patch for dbcommand.c is fixed. So I apply to my environment. > I have not given much attention to this area, but I am a bit > suspicious that enforcing the default as WAL_LOG was a good idea for > 15~, TBH. We are usually much more conservative when it comes to > such choices, switching to the new behavior after a few years would > have been wiser.. I think so too. I was surprised that new strategy is default. Regards Ryo Matsumura
On Mon, Feb 20, 2023 at 05:02:01PM +0900, Michael Paquier wrote: > On Fri, Feb 17, 2023 at 02:35:30PM -0800, Nathan Bossart wrote: >> I'm happy to move this new test to wherever folks think it should go. I'll >> look around to see if I can find a better place, too. > > I think that src/test/recovery/ is the best fit, because this stresses > a code path for WAL replay on pg_class for the template db. The name > is not specific enough, though, why not just using something like > 0NN_create_database.pl? Okay. I've renamed the test file as suggested in v3. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Feb 20, 2023 at 04:43:22PM -0800, Nathan Bossart wrote: > On Mon, Feb 20, 2023 at 05:02:01PM +0900, Michael Paquier wrote: >> On Fri, Feb 17, 2023 at 02:35:30PM -0800, Nathan Bossart wrote: >>> I'm happy to move this new test to wherever folks think it should go. I'll >>> look around to see if I can find a better place, too. >> >> I think that src/test/recovery/ is the best fit, because this stresses >> a code path for WAL replay on pg_class for the template db. The name >> is not specific enough, though, why not just using something like >> 0NN_create_database.pl? > > Okay. I've renamed the test file as suggested in v3. The test enforces a checkpoint after the table creation on the template, so what about testing it also without a checkpoint, like the extended version attached? I have tweaked a few things in the test, while on it. -- Michael
Attachment
On Tue, Feb 21, 2023 at 03:13:10PM +0900, Michael Paquier wrote: > The test enforces a checkpoint after the table creation on the > template, so what about testing it also without a checkpoint, like the > extended version attached? I have tweaked a few things in the test, > while on it. What is the purpose of testing it without the checkpoint? Other than that question, the patch looks reasonable to me. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Feb 21, 2023 at 10:00:11AM -0800, Nathan Bossart wrote: > What is the purpose of testing it without the checkpoint? Perhaps none, I was wondering whether it would be worth testing that with the flush phase, but perhaps that's just extra cycles wasted at this point. > Other than that > question, the patch looks reasonable to me. Okay, applied and backpatched with a minimal test set, then. I have kept the tweaks I did to the tests with extra comments. -- Michael
Attachment
On Wed, Feb 22, 2023 at 12:30:20PM +0900, Michael Paquier wrote: > Okay, applied and backpatched with a minimal test set, then. I have > kept the tweaks I did to the tests with extra comments. Thanks! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com