Thread: BUG #17731: Server doesn't start after abnormal shutdown while creating unlogged tables
BUG #17731: Server doesn't start after abnormal shutdown while creating unlogged tables
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17731 Logged by: Egor Chindyaskin Email address: kyzevan23@mail.ru PostgreSQL version: 15.1 Operating system: Ubuntu 22.04 Description: Hello, Alexander Lakhin and I have found certain conditions and a sequence of actions under which the server does not start after it has been abnormally stopped. By running the following script on branch REL_15_STABLE, we get а non-starting server: for ((n=1;n<=10;n++)); do (for ((i=1;i<=1000;i++)); do printf "CREATE UNLOGGED TABLE unlogged$i(f1 text);"; done | psql > psql$n.log 2>&1) & done sleep 1 killall -9 postgres wait pg_ctl -D "$PGDB" stop pg_ctl -D "$PGDB" start And we get the following log messages: ... 2022-12-26 12:23:02.146 +07 [14540] LOG: redo done at 0/1A35880 system usage: CPU: user: 0.05 s, system: 0.02 s, elapsed: 0.08 s 2022-12-26 12:23:02.149 +07 [14540] FATAL: could not create file "base/16384/18790": File exists 2022-12-26 12:23:02.151 +07 [14537] LOG: startup process (PID 14540) exited with exit code 1 2022-12-26 12:23:02.151 +07 [14537] LOG: terminating any other active server processes 2022-12-26 12:23:02.151 +07 [14537] LOG: shutting down due to startup process failure 2022-12-26 12:23:02.153 +07 [14537] LOG: database system is shut down stopped waiting pg_ctl: could not start server
Re: BUG #17731: Server doesn't start after abnormal shutdown while creating unlogged tables
From
Karina Litskevich
Date:
For unlogged tables and indexes init forks are created to simulate truncate on server startup. In StartupXLOG() every main fork, for which corresponding init fork exists, is deleted before replaying WAL, and then new main fork is created by copying init fork: ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP); ... PerformWalRecovery(); ... ResetUnloggedRelations(UNLOGGED_RELATION_INIT); So in case before WAL recovery main fork exists and init fork isn't, and during recovery init fork is created, we get this problem. The second ResetUnloggedRelations() call sees just created init fork and tries to create a main fork from it expecting that the old main fork was already deleted by the first ResetUnloggedRelations() call, but it wasn't because the main fork hasn't corresponding init fork at that moment yet. If you try to start server again, it will start successfully, as this time both init and main forks will present from the beginning. The situation, when main fork of the unlogged table is present but init fork is not, can happen because main fork file isn't actually being deleted when table is dropped. It's deletion is being postponed until the next checkpoint. See comment before mdunlink(): /* ... * Leaving the empty file in place prevents that relfilenumber * from being reused. The scenario this protects us from is: * 1. We delete a relation (and commit, and actually remove its file). * 2. We create a new relation, which by chance gets the same relfilenumber as * the just-deleted one (OIDs must've wrapped around for that to happen). * 3. We crash before another checkpoint occurs. * ... */ Theoretically, this applies to all versions, but the script somehow doesn't lead to an error on REL_11_STABLE. I haven't investigated it yet. I see two solutions: 1) keep init fork files until the next checkpoint as well as main fork files, 2) ignore (rewrite if exists) presence of an empty main fork file when copying from init fork. I found the latter less elegant so I implemented the first one. The patch is attached. Best regards, Karina Litskevich Postgres Professional: http://postgrespro.com/
Attachment
Re: BUG #17731: Server doesn't start after abnormal shutdown while creating unlogged tables
From
Kyotaro Horiguchi
Date:
At Mon, 24 Apr 2023 15:59:38 +0300, Karina Litskevich <litskevichkarina@gmail.com> wrote in > So in case before WAL recovery main fork exists and init fork isn't, and during > recovery init fork is created, we get this problem. The second > ResetUnloggedRelations() call sees just created init fork and tries to create a > main fork from it expecting that the old main fork was already deleted by the > first ResetUnloggedRelations() call, but it wasn't because the main fork hasn't > corresponding init fork at that moment yet. Seems right. > Theoretically, this applies to all versions, but the script somehow doesn't lead > to an error on REL_11_STABLE. I haven't investigated it yet. > > I see two solutions: 1) keep init fork files until the next checkpoint as well > as main fork files, 2) ignore (rewrite if exists) presence of an empty main > fork file when copying from init fork. I found the latter less elegant so I > implemented the first one. The patch is attached. The init-fork related code has some other issues with crash-restart. A minor one is that the crash of the creating transaction for a unlogged relation leaves orphan init fork files. I haven't fully chased the specific issue rased here, but I think the common cause in the cases is that the file operations around unlogged files are not fully transactional. There is a proposed patchset [1], the first patch of which makes storage file creation and deletion transactional and crash-safe. As far as I see it seems to fix this case, too. The latest version of it that posed to this ML [2] needs a rebase and some fix for now, though. (I'll post a rebased version, soon.) As for the proposed patch, I haven't looked closely, but I don't think delaying init-file removal is the right approach. The reason of the delay, as mentioned, is someone might be accessing the file (causing deletion failure on some platforms). Init-fork files don't fall into that category. regards. [1] https://commitfest.postgresql.org/43/3461/ [2] https://www.postgresql.org/message-id/20230317.151634.1038632016265639446.horikyota.ntt%40gmail.com -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BUG #17731: Server doesn't start after abnormal shutdown while creating unlogged tables
From
Michael Paquier
Date:
On Mon, Apr 24, 2023 at 03:59:38PM +0300, Karina Litskevich wrote: > For unlogged tables and indexes init forks are created to simulate truncate on > server startup. In StartupXLOG() every main fork, for which corresponding init > fork exists, is deleted before replaying WAL, and then new main fork is created > by copying init fork: > > ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP); > ... > PerformWalRecovery(); > ... > ResetUnloggedRelations(UNLOGGED_RELATION_INIT); > > So in case before WAL recovery main fork exists and init fork isn't, and during > recovery init fork is created, we get this problem. The second > ResetUnloggedRelations() call sees just created init fork and tries to create a > main fork from it expecting that the old main fork was already deleted by the > first ResetUnloggedRelations() call, but it wasn't because the main fork hasn't > corresponding init fork at that moment yet. > > If you try to start server again, it will start successfully, as this time both > init and main forks will present from the beginning. So, from what I read, what you basically mean is a sequence like that: 1) create unlogged table. 2) drop it. 3) Stop the server in immediate mode before the next checkpoint has the time to finish cleaning up the main fork still lying around. At this point the server has the truncated main fork, but not the init fork as it has already been removed. 4) Restart server, recovery begins. 5) ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP) happens, sees only what looks like a main fork, thinks there is nothing to do because there is no init fork. 6) Begin WAL redo, 7) Replay the record that created the init fork. 8) Finish recovery. 9) ResetUnloggedRelations(UNLOGGED_RELATION_INIT) sees both the init fork and the main fork. We would do a copy_dir() from the init file to the main fork, that fails on EEXIST. Between points 7 and 8, there is something I am not really following, though. The deletion of all the forks of an unlogged table should be replayed as well until we reach consistency, no? At redo, the cleanup of the forks is done when the COMMIT record of the transaction that dropped the table is replayed, rather than delayed at checkpoint as a sync request. Hence, the init fork previously created should not exist to begin with, no? Am I missing something? -- Michael
Attachment
Re: BUG #17731: Server doesn't start after abnormal shutdown while creating unlogged tables
From
Karina Litskevich
Date:
пн, 1 мая 2023 г. в 10:33, Michael Paquier <michael@paquier.xyz>: > So, from what I read, what you basically mean is a sequence like that: > 1) create unlogged table. > 2) drop it. If we talk about the script provided by Egor, the table is being deleted, because the transaction that tried to create it aborts. > 3) Stop the server in immediate mode before the next checkpoint has > the time to finish cleaning up the main fork still lying around. At > this point the server has the truncated main fork, but not the init > fork as it has already been removed. > 4) Restart server, recovery begins. > 5) ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP) happens, sees > only what looks like a main fork, thinks there is nothing to do > because there is no init fork. > 6) Begin WAL redo, > 7) Replay the record that created the init fork. > 8) Finish recovery. > 9) ResetUnloggedRelations(UNLOGGED_RELATION_INIT) sees both the init > fork and the main fork. We would do a copy_dir() from the init file > to the main fork, that fails on EEXIST. That's right. > Between points 7 and 8, there is something I am not really following, > though. The deletion of all the forks of an unlogged table should be > replayed as well until we reach consistency, no? At redo, the cleanup > of the forks is done when the COMMIT record of the transaction that > dropped the table is replayed, rather than delayed at checkpoint as a > sync request. Hence, the init fork previously created should not > exist to begin with, no? Am I missing something? Recovery replays all the WAL records until the last one present. If the server was stopped immediately, there could be some records, that represent operations of the transaction, which commit or abort record wasn't flushed at the moment of the stop. As I understand, recovery works well for operations on tuples, since in their headers transaction IDs are stored and it could be figured out whether given transaction was committed or not. So when we replay for example insert, tuples are inserted, but they won't be visible, if we don't replay commit. But the same doesn't work for operations on files. If a file was created during replay, it's still just a file. In our case, we expect it to be deleted by abort, but abort record wasn't flushed, so the file is present. As I see now, the problem of operations on files is more than just about this particular case. So my fix is not the right solution, and we should rather consider Kyotaro's patch. Best regards, Karina Litskevich Postgres Professional: http://postgrespro.com/
Re: BUG #17731: Server doesn't start after abnormal shutdown while creating unlogged tables
From
Karina Litskevich
Date:
вт, 25 апр. 2023 г. в 06:33, Kyotaro Horiguchi <horikyota.ntt@gmail.com>: > The init-fork related code has some other issues with crash-restart. A > minor one is that the crash of the creating transaction for a unlogged > relation leaves orphan init fork files. I haven't fully chased the > specific issue rased here, but I think the common cause in the cases > is that the file operations around unlogged files are not fully > transactional. There is a proposed patchset [1], the first patch of > which makes storage file creation and deletion transactional and > crash-safe. As far as I see it seems to fix this case, too. I tried v28 of your patch and it fixes the problem. It seems to me that you are right, there are other issues with file creation and deletion, and your patch is a more proper way to fix them. Best regards, Karina Litskevich Postgres Professional: http://postgrespro.com/
Re: BUG #17731: Server doesn't start after abnormal shutdown while creating unlogged tables
From
Michael Paquier
Date:
On Mon, May 15, 2023 at 03:03:48PM +0300, Karina Litskevich wrote: > пн, 1 мая 2023 г. в 10:33, Michael Paquier <michael@paquier.xyz>: > Recovery replays all the WAL records until the last one present. If the server > was stopped immediately, there could be some records, that represent operations > of the transaction, which commit or abort record wasn't flushed at the moment > of the stop. Yes, that's kind of the key point I was missing when looking first at this thread: under a DDL that's run in a self-transaction, if we have the files but not the records able to delete them, then boom. > As I see now, the problem of operations on files is more than just about this > particular case. So my fix is not the right solution, and we should rather > consider Kyotaro's patch. Yes, right. Orphan files are a much larger problem, and the unlogged case is one symptom behind a much larger thing, in look for a solution that requires, as far as I can see from the discussions, something much more invasive than what would be acceptable for a backpatch. -- Michael