xlog checkpoint depends on sync() ... seems unsafe - Mailing list pgsql-hackers

From Tom Lane
Subject xlog checkpoint depends on sync() ... seems unsafe
Date
Msg-id 17863.984456661@sss.pgh.pa.us
Whole thread Raw
List pgsql-hackers
I wrote a couple days ago:

: BTW, can we really trust checkpoint to mean that all data file changes
: are down on disk?  I see that the actual implementation of checkpoint is
: 
:     write out all dirty shmem buffers;
:     sync();
:     if (IsUnderPostmaster)
:         sleep(2);
:     sync();
:     write checkpoint record to XLOG;
:     fsync XLOG;
: 
: Now HP's man page for sync() says
: 
:      The writing, although scheduled, is not necessarily complete upon
:      return from sync.

The more I think about this, the more disturbed I get.  It seems clear
that this sequence is capable of writing out the checkpoint record
before all dirty data pages have reached disk.  If we suffer a crash
before the data pages do reach disk, then on restart we will not realize
we need to redo the changes to those pages.  This seems an awfully large
hole for what is claimed to be a bulletproof xlog technology.

I feel that checkpoint should not use sync(2) at all, but should instead
depend on fsync'ing the data files --- since fsync doesn't return until
the write is done, this is considerably more secure.  (Of course disk
drive write reordering could still mess you up, but at least
kernel-level failures won't put your data at risk.)

One way to do this would be to maintain a hashtable in shared memory
of data files that have been written to since the last checkpoint.
We'd need to set a limit on the size of the hashtable (say a few hundred
entries) --- if it overflows, remove the oldest entry and fsync that
file before forgetting it.  However that seems moderately complex,
and probably too risky to do just before release.  Spinlock contention
on the hashtable could be a problem too.

I thought about having checkpoint physically scan the $PGDATA/base/*
directories and fsync every file found in them, but that seems mighty
slow and ugly.

Is there another way?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: xlog loose ends, continued
Next
From: "Mikheev, Vadim"
Date:
Subject: RE: xlog loose ends, continued