Jeff Janes wrote:
> 12538 2014-06-17 12:10:02.925 PDT:LOG: JJ deleting 0xb66b20 5183
> 12498 UPDATE 2014-06-17 12:10:03.188 PDT:DETAIL: Could not open file
> "pg_multixact/members/5183": No such file or directory.
> 12561 UPDATE 2014-06-17 12:10:04.473 PDT:DETAIL: Could not open file
> "pg_multixact/members/5183": No such file or directory.
> 12572 UPDATE 2014-06-17 12:10:04.475 PDT:DETAIL: Could not open file
> "pg_multixact/members/5183": No such file or directory.
>
> This problem was initially fairly easy to reproduce, but since I
> started adding instrumentation specifically to catch it, it has become
> devilishly hard to reproduce.
I think I see the problem here now, after letting this test rig run for
a while.
First, the fact that there are holes in members/ files because of the
random order in deletion, in itself, seems harmless, because the files
remaining in between will be deleted by a future vacuum.
Now, the real problem is that we delete files during vacuum, but the
state that marks those file as safely deletable is written as part of a
checkpoint record, not by vacuum itself (vacuum writes its state in
pg_database, but a checkpoint derives its info from a shared memory
variable.) Taken together, this means that if there's a crash between
the vacuum that does a deletion and the next checkpoint, we might
attempt to read an offset file that is not supposed to be part of the
live range -- but we forgot that because we didn't reach the point where
we save the shmem state to disk.
It seems to me that we need to keep the offsets files around until a
checkpoint has written the "oldest" number to WAL. In other words we
need additional state in shared memory: (a) what we currently store
which is the oldest number as computed by vacuum (not safe to delete,
but it's the number that the next checkpoint must write), and (b) the
oldest number that the last checkpoint wrote (the safe deletion point).
Another thing I noticed is that more than one vacuum process can try to
run deletion simultaneously, at least if they crash frequently while
they were doing deletion. I don't see that this is troublesome, even
though they might attempt to delete the same files.
Finally, I noticed that we first read the oldest offset file, then
determine the member files to delete; then delete offset files, then
delete member files. Which means that we would delete offset files that
correspond to member files that we keep (assuming there is a crash in
the middle of deletion). It seems to me we should first delete members,
then delete offsets, if we wanted to be very safe about it. I don't
think this really would matters much, if we were to do things safely as
described above.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services