I noticed another missing fsync() with unlogged tables, similar to the
one at [1].
RelationCopyStorage does this:
> /*
> * When we WAL-logged rel pages, we must nonetheless fsync them. The
> * reason is that since we're copying outside shared buffers, a CHECKPOINT
> * occurring during the copy has no way to flush the previously written
> * data to disk (indeed it won't know the new rel even exists). A crash
> * later on would replay WAL from the checkpoint, therefore it wouldn't
> * replay our earlier WAL entries. If we do not fsync those pages here,
> * they might still not be on disk when the crash occurs.
> */
> if (use_wal || copying_initfork)
> smgrimmedsync(dst, forkNum);
That 'copying_initfork' condition is wrong. The first hint of that is
that 'use_wal' is always true for an init fork. I believe this was meant
to check for 'relpersistence == RELPERSISTENCE_UNLOGGED'. Otherwise,
this bad thing can happen:
1. Create an unlogged table
2. ALTER TABLE unlogged_tbl SET TABLESPACE ... -- This calls
RelationCopyStorage
3. a checkpoint happens while the command is running
4. After the ALTER TABLE has finished, shut down postgres cleanly.
5. Lose power
When you recover, the unlogged table is not reset, because it was a
clean postgres shutdown. But the part of the file that was copied after
the checkpoint at step 3 was never fsync'd, so part of the file is lost.
I was able to reproduce with a VM that I kill to simulate step 5.
This is the same scenario that the smgrimmedsync() call above protects
from for WAL-logged relations. But we got the condition wrong: instead
of worrying about the init fork, we need to call smgrimmedsync() for all
*other* forks of an unlogged relation.
Fortunately the fix is trivial, see attached. I suspect we have similar
problems in a few other places, though. end_heap_rewrite(), _bt_load(),
and gist_indexsortbuild have a similar-looking smgrimmedsync() calls,
with no consideration for unlogged relations at all. I haven't tested
those, but they look wrong to me.
I'm also attaching the scripts I used to reproduce this, although they
will require some manual fiddling to run.
[1]
https://www.postgresql.org/message-id/d47d8122-415e-425c-d0a2-e0160829702d%40iki.fi
--
Heikki Linnakangas
Neon (https://neon.tech)