Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Date
Msg-id CA+TgmobZufK6PcCRpvjP6dJqOBO-woWZF0cqMHWGvzkbgeaBKw@mail.gmail.com
Whole thread Raw
In response to Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
List pgsql-hackers
On Tue, Aug 2, 2022 at 1:50 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> Unfortunately, that isn't very consistent, and you have have to run it a bunch
> of times...

I was eventually able to reproduce this in part by using the
interactive psql method you describe. It didn't crash, but it did spit
out a bunch of funny error messages:

postgres=# SET statement_timeout=0; DROP DATABASE a; SET
statement_timeout='60ms'; CREATE DATABASE a TEMPLATE postgres STRATEGY
wal_log ; \c a \c postgres
SET
ERROR:  database "a" does not exist
SET
ERROR:  canceling statement due to statement timeout
WARNING:  problem in alloc set PortalContext: req size > alloc size
for chunk 0x7f99508911f0 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: bad size 0 for chunk
0x7f99508911f0 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: bad single-chunk
0x7f9950891208 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: found inconsistent
memory block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: req size > alloc size
for chunk 0x7f99508911f0 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: bad size 0 for chunk
0x7f99508911f0 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: bad single-chunk
0x7f9950891208 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: found inconsistent
memory block 0x7f9950890800
connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:
database "a" does not exist
Previous connection kept
postgres=# select * from pg_database;
 oid |  datname  | datdba | encoding | datlocprovider | datistemplate
| datallowconn | datconnlimit | datfrozenxid | datminmxid |
dattablespace | datcollate  |  datctype   | daticulocale |
datcollversion |           datacl

-----+-----------+--------+----------+----------------+---------------+--------------+--------------+--------------+------------+---------------+-------------+-------------+--------------+----------------+----------------------------
   5 | postgres  |     10 |        6 | c              | f
| t            |           -1 |          718 |          1 |
1663 | en_US.UTF-8 | en_US.UTF-8 |              |                |
   1 | template1 |     10 |        6 | c              | t
| t            |           -1 |          718 |          1 |
1663 | en_US.UTF-8 | en_US.UTF-8 |              |                |
{=c/rhaas,rhaas=CTc/rhaas}
   4 | template0 |     10 |        6 | c              | t
| f            |           -1 |          718 |          1 |
1663 | en_US.UTF-8 | en_US.UTF-8 |              |                |
{=c/rhaas,rhaas=CTc/rhaas}
(3 rows)

I then set backtrace_functions='AllocSetCheck' and reproduced it
again, which led to stack traces like this:

2022-08-02 16:50:32.490 EDT [98814] WARNING:  problem in alloc set
PortalContext: bad single-chunk 0x7f9950886608 in block 0x7f9950885c00
2022-08-02 16:50:32.490 EDT [98814] BACKTRACE:
2   postgres                            0x000000010cd37ef5 AllocSetCheck + 549
3   postgres                            0x000000010cd37730 AllocSetReset + 48
4   postgres                            0x000000010cd3f6f1
MemoryContextResetOnly + 81
5   postgres                            0x000000010cd378b9 AllocSetDelete + 73
6   postgres                            0x000000010cd41e09 PortalDrop + 425
7   postgres                            0x000000010cd427bb
AtCleanup_Portals + 203
8   postgres                            0x000000010c86476d
CleanupTransaction + 29
9   postgres                            0x000000010c865d4f
AbortCurrentTransaction + 63
10  postgres                            0x000000010cba1395 PostgresMain + 885
11  postgres                            0x000000010caf5472 PostmasterMain + 7586
12  postgres                            0x000000010ca31e3d main + 2205
13  libdyld.dylib                       0x00007fff699afcc9 start + 1
14  ???                                 0x0000000000000001 0x0 + 1

I recompiled with -O0 and hacked the code that emits the BACKTRACE:
bit to go into an infinite loop if it's hit, which enabled me to hook
up a debugger at the point of the failure. The debugger says:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x000000010e98a157
postgres`send_message_to_server_log(edata=0x000000010ec0f658) at
elog.c:2916:4
    frame #1: 0x000000010e9866d6 postgres`EmitErrorReport at elog.c:1537:3
    frame #2: 0x000000010e986016 postgres`errfinish(filename="aset.c",
lineno=1470, funcname="AllocSetCheck") at elog.c:592:2
    frame #3: 0x000000010e9c8465
postgres`AllocSetCheck(context=0x00007ff77c80d200) at aset.c:1469:5
    frame #4: 0x000000010e9c7c05
postgres`AllocSetDelete(context=0x00007ff77c80d200) at aset.c:638:2
    frame #5: 0x000000010e9d368b
postgres`MemoryContextDelete(context=0x00007ff77c80d200) at
mcxt.c:252:2
  * frame #6: 0x000000010e9d705b
postgres`PortalDrop(portal=0x00007ff77e028920, isTopCommit=false) at
portalmem.c:596:2
    frame #7: 0x000000010e9d7e0e postgres`AtCleanup_Portals at portalmem.c:907:3
    frame #8: 0x000000010e22030d postgres`CleanupTransaction at xact.c:2890:2
    frame #9: 0x000000010e2219da postgres`AbortCurrentTransaction at
xact.c:3328:4
    frame #10: 0x000000010e763237
postgres`PostgresMain(dbname="postgres", username="rhaas") at
postgres.c:4232:3
    frame #11: 0x000000010e6625aa
postgres`BackendRun(port=0x00007ff77c1042c0) at postmaster.c:4490:2
    frame #12: 0x000000010e661b18
postgres`BackendStartup(port=0x00007ff77c1042c0) at
postmaster.c:4218:3
    frame #13: 0x000000010e66088a postgres`ServerLoop at postmaster.c:1808:7
    frame #14: 0x000000010e65def2 postgres`PostmasterMain(argc=1,
argv=0x00007ff77ae05cf0) at postmaster.c:1480:11
    frame #15: 0x000000010e50521f postgres`main(argc=1,
argv=0x00007ff77ae05cf0) at main.c:197:3
    frame #16: 0x00007fff699afcc9 libdyld.dylib`start + 1
(lldb) fr sel 6
frame #6: 0x000000010e9d705b
postgres`PortalDrop(portal=0x00007ff77e028920, isTopCommit=false) at
portalmem.c:596:2
   593 MemoryContextDelete(portal->holdContext);
   594
   595 /* release subsidiary storage */
-> 596 MemoryContextDelete(portal->portalContext);
   597
   598 /* release portal struct (it's in TopPortalContext) */
   599 pfree(portal);
(lldb) fr sel 3
frame #3: 0x000000010e9c8465
postgres`AllocSetCheck(context=0x00007ff77c80d200) at aset.c:1469:5
   1466 * Check chunk size
   1467 */
   1468 if (dsize > chsize)
-> 1469 elog(WARNING, "problem in alloc set %s: req size > alloc size
for chunk %p in block %p",
   1470 name, chunk, block);
   1471 if (chsize < (1 << ALLOC_MINBITS))
   1472 elog(WARNING, "problem in alloc set %s: bad size %zu for chunk
%p in block %p",
(lldb) p dsize
(Size) $3 = 20
(lldb) p chsize
(Size) $4 = 0

It seems like CreateDatabaseUsingWalLog() must be doing something that
corrupts PortalContext, but at the moment I'm not sure what that thing
could be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: jsonpath syntax extensions
Next
From: Jacob Champion
Date:
Subject: Re: [PATCH] minor reloption regression tests improvement