Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Date
Msg-id 20220802175043.GA13682@telsasoft.com
Whole thread Raw
In response to Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Tue, Mar 29, 2022 at 11:55:05AM -0400, Robert Haas wrote:
> On Mon, Mar 28, 2022 at 3:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > smgrcreate() as we would for most WAL records or whether it should be
> > adopting the new system introduced by
> > 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
> > over here:
> >
> > http://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com
> >
> > But apart from that question your adaptations here look reasonable to me.
> 
> That commit having been reverted, I committed v6 instead. Let's see
> what breaks...

There's a crash

2022-07-31 01:22:51.437 CDT client backend[13362] [unknown] PANIC:  could not open critical system index 2662

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007efe27999801 in __GI_abort () at abort.c:79
#2  0x00005583891941dc in errfinish (filename=<optimized out>, filename@entry=0x558389420437 "relcache.c",
lineno=lineno@entry=4328,
    funcname=funcname@entry=0x558389421680 <__func__.33178> "load_critical_index") at elog.c:675
#3  0x00005583891713ef in load_critical_index (indexoid=indexoid@entry=2662, heapoid=heapoid@entry=1259) at
relcache.c:4328
#4  0x0000558389172667 in RelationCacheInitializePhase3 () at relcache.c:4103
#5  0x00005583891b93a4 in InitPostgres (in_dbname=in_dbname@entry=0x55838a50d468 "a", dboid=dboid@entry=0,
username=username@entry=0x55838a50d448"pryzbyj", useroid=useroid@entry=0,
 
    load_session_libraries=<optimized out>, override_allow_connections=override_allow_connections@entry=false,
out_dbname=0x0)at postinit.c:1087
 
#6  0x0000558388daa7bb in PostgresMain (dbname=0x55838a50d468 "a", username=username@entry=0x55838a50d448 "pryzbyj") at
postgres.c:4081
#7  0x0000558388b9f423 in BackendRun (port=port@entry=0x55838a505dd0) at postmaster.c:4490
#8  0x0000558388ba6e07 in BackendStartup (port=port@entry=0x55838a505dd0) at postmaster.c:4218
#9  0x0000558388ba747f in ServerLoop () at postmaster.c:1808
#10 0x0000558388ba8f93 in PostmasterMain (argc=7, argv=<optimized out>) at postmaster.c:1480
#11 0x0000558388840e1f in main (argc=7, argv=0x55838a4dc000) at main.c:197

while :; do psql -qh /tmp postgres -c "DROP DATABASE a" -c "CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log";
done
# Run this for a few loops and then ^C or hold down ^C until it stops,
# and then connect to postgres and try to connect to 'a':
postgres=# \c a
2022-07-31 01:22:51.437 CDT client backend[13362] [unknown] PANIC:  could not open critical system index 2662

Unfortunately, that isn't very consistent, and you have have to run it a bunch
of times...

I don't know if it's an issue of any significance that CREATE DATABASE / ^C
leaves behind a broken database, but it is an issue that the cluster crashes.

While struggling to reproduce that problem, I also hit this warning, which may
or may not be the same.  I added an abort() after WARNING in aset.c to get a
backtrace.

WARNING:  problem in alloc set PortalContext: bogus aset link in block 0x55a63f2f9d60, chunk 0x55a63f2fb138

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No existe el archivo o el directorio.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f81144f1801 in __GI_abort () at abort.c:79
#2  0x000055a63c834c5d in AllocSetCheck (context=context@entry=0x55a63f26fea0) at aset.c:1491
#3  0x000055a63c835b09 in AllocSetDelete (context=0x55a63f26fea0) at aset.c:638
#4  0x000055a63c854322 in MemoryContextDelete (context=0x55a63f26fea0) at mcxt.c:252
#5  0x000055a63c8591d5 in PortalDrop (portal=portal@entry=0x55a63f2bb7a0, isTopCommit=isTopCommit@entry=false) at
portalmem.c:596
#6  0x000055a63c3e4a7b in exec_simple_query (query_string=query_string@entry=0x55a63f24db90 "CREATE DATABASE a TEMPLATE
postgresSTRATEGY wal_log ;") at postgres.c:1253
 
#7  0x000055a63c3e7fc1 in PostgresMain (dbname=<optimized out>, username=username@entry=0x55a63f279448 "pryzbyj") at
postgres.c:4505
#8  0x000055a63c1dc423 in BackendRun (port=port@entry=0x55a63f271dd0) at postmaster.c:4490
#9  0x000055a63c1e3e07 in BackendStartup (port=port@entry=0x55a63f271dd0) at postmaster.c:4218
#10 0x000055a63c1e447f in ServerLoop () at postmaster.c:1808
#11 0x000055a63c1e5f93 in PostmasterMain (argc=7, argv=<optimized out>) at postmaster.c:1480
#12 0x000055a63be7de1f in main (argc=7, argv=0x55a63f248000) at main.c:197

I reproduced that by running this a couple dozen times in an interactive psql.
It doesn't seem to affect STRATEGY=file_copy.

SET statement_timeout=0; DROP DATABASE a; SET statement_timeout='60ms'; CREATE DATABASE a TEMPLATE postgres STRATEGY
wal_log; \c a \c postgres
 

Also, if I understand correctly, this patch seems to assume that nobody is
connected to the source database.  But what's actually enforced is just that
nobody *else* is connected.  Is it any issue that the current DB can be used as
a source?  Anyway, both of the above problems are reproducible using a
different database.

|postgres=# CREATE DATABASE new TEMPLATE postgres STRATEGY wal_log;
|CREATE DATABASE

-- 
Justin



pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: pg_auth_members.grantor is bunk
Next
From: Tom Lane
Date:
Subject: Re: Cutting test runtime for src/test/modules/snapshot_too_old