Unlogged relation copy is not fsync'd - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Unlogged relation copy is not fsync'd
Date
Msg-id 65e94fc8-ce1d-dd02-3be3-fda0fe8f2965@iki.fi
Whole thread Raw
Responses Re: Unlogged relation copy is not fsync'd
Re: Unlogged relation copy is not fsync'd
Re: Unlogged relation copy is not fsync'd
List pgsql-hackers
I noticed another missing fsync() with unlogged tables, similar to the 
one at [1].

RelationCopyStorage does this:

>     /*
>      * When we WAL-logged rel pages, we must nonetheless fsync them.  The
>      * reason is that since we're copying outside shared buffers, a CHECKPOINT
>      * occurring during the copy has no way to flush the previously written
>      * data to disk (indeed it won't know the new rel even exists).  A crash
>      * later on would replay WAL from the checkpoint, therefore it wouldn't
>      * replay our earlier WAL entries. If we do not fsync those pages here,
>      * they might still not be on disk when the crash occurs.
>      */
>     if (use_wal ||  copying_initfork)
>         smgrimmedsync(dst, forkNum);

That 'copying_initfork' condition is wrong. The first hint of that is 
that 'use_wal' is always true for an init fork. I believe this was meant 
to check for 'relpersistence == RELPERSISTENCE_UNLOGGED'. Otherwise, 
this bad thing can happen:

1. Create an unlogged table
2. ALTER TABLE unlogged_tbl SET TABLESPACE ... -- This calls 
RelationCopyStorage
3. a checkpoint happens while the command is running
4. After the ALTER TABLE has finished, shut down postgres cleanly.
5. Lose power

When you recover, the unlogged table is not reset, because it was a 
clean postgres shutdown. But the part of the file that was copied after 
the checkpoint at step 3 was never fsync'd, so part of the file is lost. 
I was able to reproduce with a VM that I kill to simulate step 5.

This is the same scenario that the smgrimmedsync() call above protects 
from for WAL-logged relations. But we got the condition wrong: instead 
of worrying about the init fork, we need to call smgrimmedsync() for all 
*other* forks of an unlogged relation.

Fortunately the fix is trivial, see attached. I suspect we have similar 
problems in a few other places, though. end_heap_rewrite(), _bt_load(), 
and gist_indexsortbuild have a similar-looking smgrimmedsync() calls, 
with no consideration for unlogged relations at all. I haven't tested 
those, but they look wrong to me.

I'm also attaching the scripts I used to reproduce this, although they 
will require some manual fiddling to run.

[1] 
https://www.postgresql.org/message-id/d47d8122-415e-425c-d0a2-e0160829702d%40iki.fi

-- 
Heikki Linnakangas
Neon (https://neon.tech)
Attachment

pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: [PATCH] Add XMLText function (SQL/XML X038)
Next
From: Peter Eisentraut
Date:
Subject: Format list of catalog files in makefile vertically