Thread: BUG #17731: Server doesn't start after abnormal shutdown while creating unlogged tables

BUG #17731: Server doesn't start after abnormal shutdown while creating unlogged tables

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17731
Logged by:          Egor Chindyaskin
Email address:      kyzevan23@mail.ru
PostgreSQL version: 15.1
Operating system:   Ubuntu 22.04
Description:

Hello, Alexander Lakhin and I have found certain conditions and a sequence
of actions under which the server does not start after it has been
abnormally stopped.
By running the following script on branch REL_15_STABLE, we get а
non-starting server:

for ((n=1;n<=10;n++)); do
(for ((i=1;i<=1000;i++)); do printf "CREATE UNLOGGED TABLE unlogged$i(f1
text);"; done | psql > psql$n.log 2>&1) &
done
sleep 1
killall -9 postgres
wait

pg_ctl -D "$PGDB" stop
pg_ctl -D "$PGDB" start

And we get the following log messages:

...
2022-12-26 12:23:02.146 +07 [14540] LOG:  redo done at 0/1A35880 system
usage: CPU: user: 0.05 s, system: 0.02 s, elapsed: 0.08 s
2022-12-26 12:23:02.149 +07 [14540] FATAL:  could not create file
"base/16384/18790": File exists
2022-12-26 12:23:02.151 +07 [14537] LOG:  startup process (PID 14540) exited
with exit code 1
2022-12-26 12:23:02.151 +07 [14537] LOG:  terminating any other active
server processes
2022-12-26 12:23:02.151 +07 [14537] LOG:  shutting down due to startup
process failure
2022-12-26 12:23:02.153 +07 [14537] LOG:  database system is shut down
 stopped waiting
pg_ctl: could not start server


For unlogged tables and indexes init forks are created to simulate truncate on
server startup. In StartupXLOG() every main fork, for which corresponding init
fork exists, is deleted before replaying WAL, and then new main fork is created
by copying init fork:

ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
...
PerformWalRecovery();
...
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);

So in case before WAL recovery main fork exists and init fork isn't, and during
recovery init fork is created, we get this problem. The second
ResetUnloggedRelations() call sees just created init fork and tries to create a
main fork from it expecting that the old main fork was already deleted by the
first ResetUnloggedRelations() call, but it wasn't because the main fork hasn't
corresponding init fork at that moment yet.

If you try to start server again, it will start successfully, as this time both
init and main forks will present from the beginning.

The situation, when main fork of the unlogged table is present but init fork is
not, can happen because main fork file isn't actually being deleted when table
is dropped. It's deletion is being postponed until the next checkpoint. See
comment before mdunlink():

/* ...
 *           Leaving the empty file in place prevents that relfilenumber
 * from being reused.  The scenario this protects us from is:
 * 1. We delete a relation (and commit, and actually remove its file).
 * 2. We create a new relation, which by chance gets the same relfilenumber as
 *   the just-deleted one (OIDs must've wrapped around for that to happen).
 * 3. We crash before another checkpoint occurs.
 * ...
 */

Theoretically, this applies to all versions, but the script somehow doesn't lead
to an error on REL_11_STABLE. I haven't investigated it yet.

I see two solutions: 1) keep init fork files until the next checkpoint as well
as main fork files, 2) ignore (rewrite if exists) presence of an empty main
fork file when copying from init fork. I found the latter less elegant so I
implemented the first one. The patch is attached.


Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/

Attachment
At Mon, 24 Apr 2023 15:59:38 +0300, Karina Litskevich <litskevichkarina@gmail.com> wrote in 
> So in case before WAL recovery main fork exists and init fork isn't, and during
> recovery init fork is created, we get this problem. The second
> ResetUnloggedRelations() call sees just created init fork and tries to create a
> main fork from it expecting that the old main fork was already deleted by the
> first ResetUnloggedRelations() call, but it wasn't because the main fork hasn't
> corresponding init fork at that moment yet.

Seems right.

> Theoretically, this applies to all versions, but the script somehow doesn't lead
> to an error on REL_11_STABLE. I haven't investigated it yet.
> 
> I see two solutions: 1) keep init fork files until the next checkpoint as well
> as main fork files, 2) ignore (rewrite if exists) presence of an empty main
> fork file when copying from init fork. I found the latter less elegant so I
> implemented the first one. The patch is attached.

The init-fork related code has some other issues with crash-restart. A
minor one is that the crash of the creating transaction for a unlogged
relation leaves orphan init fork files.  I haven't fully chased the
specific issue rased here, but I think the common cause in the cases
is that the file operations around unlogged files are not fully
transactional.  There is a proposed patchset [1], the first patch of
which makes storage file creation and deletion transactional and
crash-safe.  As far as I see it seems to fix this case, too.

The latest version of it that posed to this ML [2] needs a rebase and
some fix for now, though.  (I'll post a rebased version, soon.)

As for the proposed patch, I haven't looked closely, but I don't think
delaying init-file removal is the right approach. The reason of the
delay, as mentioned, is someone might be accessing the file (causing
deletion failure on some platforms). Init-fork files don't fall into
that category.

regards.

[1] https://commitfest.postgresql.org/43/3461/

[2] https://www.postgresql.org/message-id/20230317.151634.1038632016265639446.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



On Mon, Apr 24, 2023 at 03:59:38PM +0300, Karina Litskevich wrote:
> For unlogged tables and indexes init forks are created to simulate truncate on
> server startup. In StartupXLOG() every main fork, for which corresponding init
> fork exists, is deleted before replaying WAL, and then new main fork is created
> by copying init fork:
>
> ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
> ...
> PerformWalRecovery();
> ...
> ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
>
> So in case before WAL recovery main fork exists and init fork isn't, and during
> recovery init fork is created, we get this problem. The second
> ResetUnloggedRelations() call sees just created init fork and tries to create a
> main fork from it expecting that the old main fork was already deleted by the
> first ResetUnloggedRelations() call, but it wasn't because the main fork hasn't
> corresponding init fork at that moment yet.
>
> If you try to start server again, it will start successfully, as this time both
> init and main forks will present from the beginning.

So, from what I read, what you basically mean is a sequence like that:
1) create unlogged table.
2) drop it.
3) Stop the server in immediate mode before the next checkpoint has
the time to finish cleaning up the main fork still lying around.  At
this point the server has the truncated main fork, but not the init
fork as it has already been removed.
4) Restart server, recovery begins.
5) ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP) happens, sees
only what looks like a main fork, thinks there is nothing to do
because there is no init fork.
6) Begin WAL redo,
7) Replay the record that created the init fork.
8) Finish recovery.
9) ResetUnloggedRelations(UNLOGGED_RELATION_INIT) sees both the init
fork and the main fork.  We would do a copy_dir() from the init file
to the main fork, that fails on EEXIST.

Between points 7 and 8, there is something I am not really following,
though.  The deletion of all the forks of an unlogged table should be
replayed as well until we reach consistency, no?  At redo, the cleanup
of the forks is done when the COMMIT record of the transaction that
dropped the table is replayed, rather than delayed at checkpoint as a
sync request.  Hence, the init fork previously created should not
exist to begin with, no?  Am I missing something?
--
Michael

Attachment
пн, 1 мая 2023 г. в 10:33, Michael Paquier <michael@paquier.xyz>:
> So, from what I read, what you basically mean is a sequence like that:
> 1) create unlogged table.
> 2) drop it.

If we talk about the script provided by Egor, the table is being deleted,
because the transaction that tried to create it aborts.

> 3) Stop the server in immediate mode before the next checkpoint has
> the time to finish cleaning up the main fork still lying around.  At
> this point the server has the truncated main fork, but not the init
> fork as it has already been removed.
> 4) Restart server, recovery begins.
> 5) ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP) happens, sees
> only what looks like a main fork, thinks there is nothing to do
> because there is no init fork.
> 6) Begin WAL redo,
> 7) Replay the record that created the init fork.
> 8) Finish recovery.
> 9) ResetUnloggedRelations(UNLOGGED_RELATION_INIT) sees both the init
> fork and the main fork.  We would do a copy_dir() from the init file
> to the main fork, that fails on EEXIST.

That's right.

> Between points 7 and 8, there is something I am not really following,
> though.  The deletion of all the forks of an unlogged table should be
> replayed as well until we reach consistency, no?  At redo, the cleanup
> of the forks is done when the COMMIT record of the transaction that
> dropped the table is replayed, rather than delayed at checkpoint as a
> sync request.  Hence, the init fork previously created should not
> exist to begin with, no?  Am I missing something?

Recovery replays all the WAL records until the last one present. If the server
was stopped immediately, there could be some records, that represent operations
of the transaction, which commit or abort record wasn't flushed at the moment
of the stop. As I understand, recovery works well for operations on tuples,
since in their headers transaction IDs are stored and it could be figured out
whether given transaction was committed or not. So when we replay for example
insert, tuples are inserted, but they won't be visible, if we don't replay
commit. But the same doesn't work for operations on files. If a file was
created during replay, it's still just a file. In our case, we expect it to
be deleted by abort, but abort record wasn't flushed, so the file is present.

As I see now, the problem of operations on files is more than just about this
particular case. So my fix is not the right solution, and we should rather
consider Kyotaro's patch.

Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/



вт, 25 апр. 2023 г. в 06:33, Kyotaro Horiguchi <horikyota.ntt@gmail.com>:
> The init-fork related code has some other issues with crash-restart. A
> minor one is that the crash of the creating transaction for a unlogged
> relation leaves orphan init fork files.  I haven't fully chased the
> specific issue rased here, but I think the common cause in the cases
> is that the file operations around unlogged files are not fully
> transactional.  There is a proposed patchset [1], the first patch of
> which makes storage file creation and deletion transactional and
> crash-safe.  As far as I see it seems to fix this case, too.

I tried v28 of your patch and it fixes the problem. It seems to me that you are
right, there are other issues with file creation and deletion, and your patch
is a more proper way to fix them.

Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/



On Mon, May 15, 2023 at 03:03:48PM +0300, Karina Litskevich wrote:
> пн, 1 мая 2023 г. в 10:33, Michael Paquier <michael@paquier.xyz>:
> Recovery replays all the WAL records until the last one present. If the server
> was stopped immediately, there could be some records, that represent operations
> of the transaction, which commit or abort record wasn't flushed at the moment
> of the stop.

Yes, that's kind of the key point I was missing when looking first at
this thread: under a DDL that's run in a self-transaction, if we have
the files but not the records able to delete them, then boom.

> As I see now, the problem of operations on files is more than just about this
> particular case. So my fix is not the right solution, and we should rather
> consider Kyotaro's patch.

Yes, right.  Orphan files are a much larger problem, and the unlogged
case is one symptom behind a much larger thing, in look for a solution
that requires, as far as I can see from the discussions, something
much more invasive than what would be acceptable for a backpatch.
--
Michael

Attachment