Thread: Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby
Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby
From
Jeff Janes
Date:
(Redirected from Bugs to Hackers, as it seems not to be the same bug as the original one.) On Mon, Jun 9, 2014 at 3:50 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Jeff Janes wrote: > >> This patch seems to solve a problem I've also been having with non-existent >> "pg_multixact/members/13D35" files in my testing of torn page write and fk >> locks again recent 9.4. However, sometimes I've found the problem even >> before multi wrap around occured, and my reading of this thread suggests >> that the current issue would not show up in that case, so I don't know how >> much comfort to take in the knowledge that the problem seems to have gone >> away--as they say, things that go away by themselves without explanation >> also tend to come back by themselves. > > I think this problem is rather related to the code touched by commit > 722acf51a0d074d19782ad7e97ebe3fdb63fbb87, which is about removing > unneeded files in pg_multixact/members. I checked your files, and it > seems that pg_multixact/members files that should still exist are gone; > for instance, pg_controldata says that the oldest multi is 2586314407, > but if I run > select * from pg_get_multixact_members('that value') > I get an error saying that pg_multixact/members/218B does not exist. > > It seems that there are a lot of files that were improperly removed. > I expect that existing files should be sequential, but there are a lot > of holes in a listing of pg_multixact/members. I think the holes are because a truncation was in progress and interrupted by a 'pg_ctl stop -mi' before it completed. The files are considered in an apparently random order, so if it is interrupted midway through it will leave a collection of holes that would have been closed up had it been allowed to continue. I instrumented the code to log file deletion (attached patch) and I see that the it starts complaining about the absence of 5183 just a fraction of a second after it was deleted, so the deletion must have been inappropriate. 12538 2014-06-17 12:10:02.925 PDT:LOG: JJ deleting 0xb66b20 5183 12498 UPDATE 2014-06-17 12:10:03.188 PDT:DETAIL: Could not open file "pg_multixact/members/5183": No such file or directory. 12561 UPDATE 2014-06-17 12:10:04.473 PDT:DETAIL: Could not open file "pg_multixact/members/5183": No such file or directory. 12572 UPDATE 2014-06-17 12:10:04.475 PDT:DETAIL: Could not open file "pg_multixact/members/5183": No such file or directory. This problem was initially fairly easy to reproduce, but since I started adding instrumentation specifically to catch it, it has become devilishly hard to reproduce. I think my next step will be to also log each of the values which goes into the complex if (...) expression that decides on the deletion. Cheers, Jeff
Attachment
Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby
From
Alvaro Herrera
Date:
Jeff Janes wrote: > 12538 2014-06-17 12:10:02.925 PDT:LOG: JJ deleting 0xb66b20 5183 > 12498 UPDATE 2014-06-17 12:10:03.188 PDT:DETAIL: Could not open file > "pg_multixact/members/5183": No such file or directory. > 12561 UPDATE 2014-06-17 12:10:04.473 PDT:DETAIL: Could not open file > "pg_multixact/members/5183": No such file or directory. > 12572 UPDATE 2014-06-17 12:10:04.475 PDT:DETAIL: Could not open file > "pg_multixact/members/5183": No such file or directory. > > This problem was initially fairly easy to reproduce, but since I > started adding instrumentation specifically to catch it, it has become > devilishly hard to reproduce. I think I see the problem here now, after letting this test rig run for a while. First, the fact that there are holes in members/ files because of the random order in deletion, in itself, seems harmless, because the files remaining in between will be deleted by a future vacuum. Now, the real problem is that we delete files during vacuum, but the state that marks those file as safely deletable is written as part of a checkpoint record, not by vacuum itself (vacuum writes its state in pg_database, but a checkpoint derives its info from a shared memory variable.) Taken together, this means that if there's a crash between the vacuum that does a deletion and the next checkpoint, we might attempt to read an offset file that is not supposed to be part of the live range -- but we forgot that because we didn't reach the point where we save the shmem state to disk. It seems to me that we need to keep the offsets files around until a checkpoint has written the "oldest" number to WAL. In other words we need additional state in shared memory: (a) what we currently store which is the oldest number as computed by vacuum (not safe to delete, but it's the number that the next checkpoint must write), and (b) the oldest number that the last checkpoint wrote (the safe deletion point). Another thing I noticed is that more than one vacuum process can try to run deletion simultaneously, at least if they crash frequently while they were doing deletion. I don't see that this is troublesome, even though they might attempt to delete the same files. Finally, I noticed that we first read the oldest offset file, then determine the member files to delete; then delete offset files, then delete member files. Which means that we would delete offset files that correspond to member files that we keep (assuming there is a crash in the middle of deletion). It seems to me we should first delete members, then delete offsets, if we wanted to be very safe about it. I don't think this really would matters much, if we were to do things safely as described above. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby
From
Andres Freund
Date:
On 2014-06-20 17:38:16 -0400, Alvaro Herrera wrote: > Jeff Janes wrote: > > > 12538 2014-06-17 12:10:02.925 PDT:LOG: JJ deleting 0xb66b20 5183 > > 12498 UPDATE 2014-06-17 12:10:03.188 PDT:DETAIL: Could not open file > > "pg_multixact/members/5183": No such file or directory. > > 12561 UPDATE 2014-06-17 12:10:04.473 PDT:DETAIL: Could not open file > > "pg_multixact/members/5183": No such file or directory. > > 12572 UPDATE 2014-06-17 12:10:04.475 PDT:DETAIL: Could not open file > > "pg_multixact/members/5183": No such file or directory. > > > > This problem was initially fairly easy to reproduce, but since I > > started adding instrumentation specifically to catch it, it has become > > devilishly hard to reproduce. > > I think I see the problem here now, after letting this test rig run for > a while. > > First, the fact that there are holes in members/ files because of the > random order in deletion, in itself, seems harmless, because the files > remaining in between will be deleted by a future vacuum. > > Now, the real problem is that we delete files during vacuum, but the > state that marks those file as safely deletable is written as part of a > checkpoint record, not by vacuum itself (vacuum writes its state in > pg_database, but a checkpoint derives its info from a shared memory > variable.) Taken together, this means that if there's a crash between > the vacuum that does a deletion and the next checkpoint, we might > attempt to read an offset file that is not supposed to be part of the > live range -- but we forgot that because we didn't reach the point where > we save the shmem state to disk. > It seems to me that we need to keep the offsets files around until a > checkpoint has written the "oldest" number to WAL. In other words we > need additional state in shared memory: (a) what we currently store > which is the oldest number as computed by vacuum (not safe to delete, > but it's the number that the next checkpoint must write), and (b) the > oldest number that the last checkpoint wrote (the safe deletion point). Why not just WAL log truncations? If we'd emit the WAL record after determining the offsets page we should be safe I think? That seems like easier and more robust fix? And it's what e.g. the clog does. > Another thing I noticed is that more than one vacuum process can try to > run deletion simultaneously, at least if they crash frequently while > they were doing deletion. I don't see that this is troublesome, even > though they might attempt to delete the same files. That actually seems to be bad to me. It might fail to fail, but still. > Finally, I noticed that we first read the oldest offset file, then > determine the member files to delete; then delete offset files, then > delete member files. Which means that we would delete offset files that > correspond to member files that we keep (assuming there is a crash in > the middle of deletion). It seems to me we should first delete members, > then delete offsets, if we wanted to be very safe about it. I don't > think this really would matters much, if we were to do things safely as > described above. Part of that seems to be solveable by WAL logging truncations. But I also think that we need a more robust tracking of the oldest member offset - we still aren't safe against member wraparounds. And I don't see how we can be without explicitly tracking the oldest member instead of the ugly 'search for oldest offset segment and map that to members' thing we're doing now. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby
From
Alvaro Herrera
Date:
Andres Freund wrote: > On 2014-06-20 17:38:16 -0400, Alvaro Herrera wrote: > > It seems to me that we need to keep the offsets files around until a > > checkpoint has written the "oldest" number to WAL. In other words we > > need additional state in shared memory: (a) what we currently store > > which is the oldest number as computed by vacuum (not safe to delete, > > but it's the number that the next checkpoint must write), and (b) the > > oldest number that the last checkpoint wrote (the safe deletion point). > > Why not just WAL log truncations? If we'd emit the WAL record after > determining the offsets page we should be safe I think? That seems like > easier and more robust fix? And it's what e.g. the clog does. Yes, I think this whole thing would be simpler if we just wal-logged the truncations, like pg_clog does. But I would like to avoid doing that for now, and do it in 9.5 only in the future. As a backpatchable (to 9.4/9.3) fix, I propose we do the following: 1. have vacuum update MultiXactState->oldestMultiXactId based on the minimum value of pg_database->datminmxid. Since this value is saved in pg_control, it is restored from checkpoint replay during recovery. 2. Keep track of a new value, MultiXactState->lastCheckpointedOldest. This value is updated by CreateCheckPoint in a primary server after the checkpoint record has been flushed, and by xlog_redo in a hot standby, to be the MultiXactState->oldestMultiXactId value that was last flushed. 3. TruncateMultiXact() no longer receives a parameter. Files are removed based on MultiXactState->lastCheckpointedOldest instead. 4. call TruncateMultiXact at checkpoint time, after the checkpoint WAL record has been flushed, and at restartpoint time (just like today). This means we only remove files that a prior checkpoint has already registered as being no longer necessary. Also, if a recovery is interrupted before end of WAL (recovery target), the files are still present. So we no longer truncate during vacuum. Another consideration for (4) is that right now we're only invoking multixact truncation in a primary when we're able to advance pg_database.datminmxid (see vac_update_datfrozenxid). The problem is that after a crash and subsequent recovery, pg_database might be updated without removing pg_multixact files; this would mean that the next opportunity to remove files would be far in the future, when the minimum datminmxid is advanced again. One way to fix that would be to have every single call to vac_update_datfrozenxid() attempt multixact truncation, but that seems wasteful since I expect vacuuming is more frequent than checkpointing. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby
From
Alvaro Herrera
Date:
Jeff Janes wrote: > This problem was initially fairly easy to reproduce, but since I > started adding instrumentation specifically to catch it, it has become > devilishly hard to reproduce. > > I think my next step will be to also log each of the values which goes > into the complex if (...) expression that decides on the deletion. Could you please to reproduce it after updating to latest? I pushed fixes that should close these issues. Maybe you want to remove the instrumentation you added, to make failures more likely. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services