Thread: Small Bug in GetConflictingVirtualXIDs

Small Bug in GetConflictingVirtualXIDs

From
Andres Freund
Date:
Hi Simon, Hi all,


HS currently does the following in GetConflictingVirtualXIDs


TransactionId pxmin = proc->xmin;

/** We ignore an invalid pxmin because this means that backend* has no snapshot and cannot get another one while we
holdexclusive lock.*/
 
if (TransactionIdIsValid(pxmin) && !TransactionIdFollows(pxmin, limitXmin))
{VirtualTransactionId vxid;
GET_VXID_FROM_PGPROC(vxid, *proc);if (VirtualTransactionIdIsValid(vxid))    vxids[count++] = vxid;
}


The logic behind this seems fine except in the case of dropping a database. 
There you very well might have a open connection without an open snapshot.

This leads to nice errors when youre connected to a database on the slave 
without having a in progress transaction while dropping the database on the 
primary:


CET FATAL:  terminating connection due to administrator command
CET LOG:  shutting down
CET LOG:  restartpoint starting: shutdown immediate
CET FATAL:  could not open file "base/16604/1259": No such file or directory
CET CONTEXT:  writing block 5 of relation base/16604/1259
CET WARNING:  could not write block 5 of base/16604/1259
CET DETAIL:  Multiple failures --- write error might be permanent.
CET CONTEXT:  writing block 5 of relation base/16604/1259

Should easily be solvable through an extra parameter for 
GetConflictingVirtualXIDs. Want me to prepare a patch?

Andres


Re: Small Bug in GetConflictingVirtualXIDs

From
Simon Riggs
Date:
On Mon, 2009-12-21 at 04:02 +0100, Andres Freund wrote:

> The logic behind this seems fine except in the case of dropping a database. 
> There you very well might have a open connection without an open snapshot.

Yes, you're right, thanks for the report. 

I re-arranged the logic there recently to reduce the number of backends
that would conflict, so it looks like I broke that case when I did
that. 

> Should easily be solvable through an extra parameter for 
> GetConflictingVirtualXIDs. Want me to prepare a patch?

Much appreciated, thanks.

-- Simon Riggs           www.2ndQuadrant.com



Re: Small Bug in GetConflictingVirtualXIDs

From
Alvaro Herrera
Date:
Andres Freund wrote:

> The logic behind this seems fine except in the case of dropping a database. 
> There you very well might have a open connection without an open snapshot.

Perhaps the simplest fix is to ensure that drop database gets a snapshot?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Small Bug in GetConflictingVirtualXIDs

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Andres Freund wrote:
>> The logic behind this seems fine except in the case of dropping a database. 
>> There you very well might have a open connection without an open snapshot.

> Perhaps the simplest fix is to ensure that drop database gets a snapshot?

I confess to not having followed the thread closely, but why is DROP
DATABASE special in this regard?  Wouldn't we soon find ourselves
needing every utility command to take a snapshot?
        regards, tom lane


Re: Small Bug in GetConflictingVirtualXIDs

From
Simon Riggs
Date:
On Mon, 2009-12-21 at 10:38 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Andres Freund wrote:
> >> The logic behind this seems fine except in the case of dropping a database. 
> >> There you very well might have a open connection without an open snapshot.
> 
> > Perhaps the simplest fix is to ensure that drop database gets a snapshot?
> 
> I confess to not having followed the thread closely, but why is DROP
> DATABASE special in this regard?  Wouldn't we soon find ourselves
> needing every utility command to take a snapshot?

Andres has worded this a little imprecisely, causing a confusion. In
cases regarding HS we need to be clear whether the interacting sessions
are on the master or on the standby to understand the reasons for poor
interactions.

What he means is that you can be connected to the standby without an
open snapshot (from the standby) at the point we replay a drop database
command that had been run on the master. That case currently causes the
bug, created by my recent change to GetConflictingVirtualXids().

Giving the drop database a snapshot is not the answer. I expect Andres
to be able to fix this with a simple patch that would not effect the
case of normal running.

-- Simon Riggs           www.2ndQuadrant.com



Re: Small Bug in GetConflictingVirtualXIDs

From
Andres Freund
Date:
On Monday 21 December 2009 16:38:07 Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Andres Freund wrote:
> >> The logic behind this seems fine except in the case of dropping a
> >> database. There you very well might have a open connection without an
> >> open snapshot.
> > Perhaps the simplest fix is to ensure that drop database gets a snapshot?
> I confess to not having followed the thread closely, but why is DROP
> DATABASE special in this regard?  Wouldn't we soon find ourselves
> needing every utility command to take a snapshot?
Because most other "entities" are locked when you access them. So on the 
standby the AccessExlusive (generated on the master) will conflict with 
whatever lock you currently have on that entity (on the slave).
There are no locks for an idle session though.

Andres


Re: Small Bug in GetConflictingVirtualXIDs

From
Andres Freund
Date:
On Monday 21 December 2009 16:48:52 Simon Riggs wrote:
> Giving the drop database a snapshot is not the answer. I expect Andres
> to be able to fix this with a simple patch that would not effect the
> case of normal running.
Actually its less simply than I had thought at first - I don't think the code 
ever handled that correctly.
I might be wrong there, my knowledge of the involved code is a bit sparse...
The whole conflict resolution builds on the concept of waiting for an VXid, but 
an idle backend does not have a valid vxid. Thats correct, right?
Sure, the code should be modifyable to handle that code mostly transparently 
(simply ignoring a invalid localTransactionId when the database id is valid), 
but ...

I am inclined to just unconditionally kill the users of the database. Its not 
like that would be an issue in production. I cant see a case where its 
important to run a session to its end on a database which was dropped on the 
master.
Opinions on that?

Andres


Re: Small Bug in GetConflictingVirtualXIDs

From
Simon Riggs
Date:
On Tue, 2009-12-22 at 03:19 +0100, Andres Freund wrote:
> On Monday 21 December 2009 16:48:52 Simon Riggs wrote:
> > Giving the drop database a snapshot is not the answer. I expect Andres
> > to be able to fix this with a simple patch that would not effect the
> > case of normal running.
> Actually its less simply than I had thought at first - I don't think the code 
> ever handled that correctly.
> I might be wrong there, my knowledge of the involved code is a bit sparse...
> The whole conflict resolution builds on the concept of waiting for an VXid, but 
> an idle backend does not have a valid vxid. Thats correct, right?

Yes, that's correct. I'll take this one back then.

> Sure, the code should be modifyable to handle that code mostly transparently 
> (simply ignoring a invalid localTransactionId when the database id is valid), 
> but ...
> 
> I am inclined to just unconditionally kill the users of the database. Its not 
> like that would be an issue in production. I cant see a case where its 
> important to run a session to its end on a database which was dropped on the 
> master.
> Opinions on that?

I don't see any mileage in making Startup process wait for an idle
session, so no real reason to wait for others either.

-- Simon Riggs           www.2ndQuadrant.com



Re: Small Bug in GetConflictingVirtualXIDs

From
Andres Freund
Date:
On Tuesday 22 December 2009 11:42:30 Simon Riggs wrote:
> On Tue, 2009-12-22 at 03:19 +0100, Andres Freund wrote:
> > On Monday 21 December 2009 16:48:52 Simon Riggs wrote:
> > > Giving the drop database a snapshot is not the answer. I expect Andres
> > > to be able to fix this with a simple patch that would not effect the
> > > case of normal running.
> >
> > Actually its less simply than I had thought at first - I don't think the
> > code ever handled that correctly.
> > I might be wrong there, my knowledge of the involved code is a bit
> > sparse... The whole conflict resolution builds on the concept of waiting
> > for an VXid, but an idle backend does not have a valid vxid. Thats
> > correct, right?
> 
> Yes, that's correct. I'll take this one back then.
So youre writing a fix or shall I?

Andres


Re: Small Bug in GetConflictingVirtualXIDs

From
Simon Riggs
Date:
On Sun, 2009-12-27 at 20:11 +0100, Andres Freund wrote:
> On Tuesday 22 December 2009 11:42:30 Simon Riggs wrote:
> > On Tue, 2009-12-22 at 03:19 +0100, Andres Freund wrote:
> > > On Monday 21 December 2009 16:48:52 Simon Riggs wrote:
> > > > Giving the drop database a snapshot is not the answer. I expect Andres
> > > > to be able to fix this with a simple patch that would not effect the
> > > > case of normal running.
> > >
> > > Actually its less simply than I had thought at first - I don't think the
> > > code ever handled that correctly.
> > > I might be wrong there, my knowledge of the involved code is a bit
> > > sparse... The whole conflict resolution builds on the concept of waiting
> > > for an VXid, but an idle backend does not have a valid vxid. Thats
> > > correct, right?
> > I don't see any mileage in making Startup process wait for an idle
> > session, so no real reason to wait for others either.
> So here is a small patch implementing that behaviour.

OK, thanks. I have also written something, as mentioned. Will review.

-- Simon Riggs           www.2ndQuadrant.com



Re: Small Bug in GetConflictingVirtualXIDs

From
Andres Freund
Date:
On Tuesday 22 December 2009 11:42:30 Simon Riggs wrote:
> On Tue, 2009-12-22 at 03:19 +0100, Andres Freund wrote:
> > On Monday 21 December 2009 16:48:52 Simon Riggs wrote:
> > > Giving the drop database a snapshot is not the answer. I expect Andres
> > > to be able to fix this with a simple patch that would not effect the
> > > case of normal running.
> >
> > Actually its less simply than I had thought at first - I don't think the
> > code ever handled that correctly.
> > I might be wrong there, my knowledge of the involved code is a bit
> > sparse... The whole conflict resolution builds on the concept of waiting
> > for an VXid, but an idle backend does not have a valid vxid. Thats
> > correct, right?
> I don't see any mileage in making Startup process wait for an idle
> session, so no real reason to wait for others either.
So here is a small patch implementing that behaviour.

Andres

Re: Small Bug in GetConflictingVirtualXIDs

From
Andres Freund
Date:
On Sunday 27 December 2009 21:04:43 Simon Riggs wrote:
> On Sun, 2009-12-27 at 20:11 +0100, Andres Freund wrote:
> > On Tuesday 22 December 2009 11:42:30 Simon Riggs wrote:
> > > On Tue, 2009-12-22 at 03:19 +0100, Andres Freund wrote:
> > > > On Monday 21 December 2009 16:48:52 Simon Riggs wrote:
> > > > > Giving the drop database a snapshot is not the answer. I expect
> > > > > Andres to be able to fix this with a simple patch that would not
> > > > > effect the case of normal running.
> > > >
> > > > Actually its less simply than I had thought at first - I don't think
> > > > the code ever handled that correctly.
> > > > I might be wrong there, my knowledge of the involved code is a bit
> > > > sparse... The whole conflict resolution builds on the concept of
> > > > waiting for an VXid, but an idle backend does not have a valid vxid.
> > > > Thats correct, right?
> > >
> > > I don't see any mileage in making Startup process wait for an idle
> > > session, so no real reason to wait for others either.
> >
> > So here is a small patch implementing that behaviour.
> 
> OK, thanks. I have also written something, as mentioned. Will review.
Thats why I had asked in another mail ;-) But I have learned a bit more about 
pg while writing that patch so its fine for me at least ;-)

Andres


Re: Small Bug in GetConflictingVirtualXIDs

From
Simon Riggs
Date:
On Sun, 2009-12-27 at 20:11 +0100, Andres Freund wrote:
> On Tuesday 22 December 2009 11:42:30 Simon Riggs wrote:
> > On Tue, 2009-12-22 at 03:19 +0100, Andres Freund wrote:
> > > On Monday 21 December 2009 16:48:52 Simon Riggs wrote:
> > > > Giving the drop database a snapshot is not the answer. I expect Andres
> > > > to be able to fix this with a simple patch that would not effect the
> > > > case of normal running.
> > >
> > > Actually its less simply than I had thought at first - I don't think the
> > > code ever handled that correctly.
> > > I might be wrong there, my knowledge of the involved code is a bit
> > > sparse... The whole conflict resolution builds on the concept of waiting
> > > for an VXid, but an idle backend does not have a valid vxid. Thats
> > > correct, right?
> > I don't see any mileage in making Startup process wait for an idle
> > session, so no real reason to wait for others either.
> So here is a small patch implementing that behaviour.

I've committed a slightly modified form of this patch.

It was an outstanding bug, so delaying fix at this stage not worth it. I
had in mind a slightly grander fix, but it's hardly a priority to polish
the chrome on this one.

Thanks for the bug report and fix.

-- Simon Riggs           www.2ndQuadrant.com



Re: Small Bug in GetConflictingVirtualXIDs

From
Simon Riggs
Date:
On Sun, 2009-12-27 at 20:11 +0100, Andres Freund wrote:
> On Tuesday 22 December 2009 11:42:30 Simon Riggs wrote:
> > On Tue, 2009-12-22 at 03:19 +0100, Andres Freund wrote:
> > > On Monday 21 December 2009 16:48:52 Simon Riggs wrote:
> > > > Giving the drop database a snapshot is not the answer. I expect Andres
> > > > to be able to fix this with a simple patch that would not effect the
> > > > case of normal running.
> > >
> > > Actually its less simply than I had thought at first - I don't think the
> > > code ever handled that correctly.
> > > I might be wrong there, my knowledge of the involved code is a bit
> > > sparse... The whole conflict resolution builds on the concept of waiting
> > > for an VXid, but an idle backend does not have a valid vxid. Thats
> > > correct, right?
> > I don't see any mileage in making Startup process wait for an idle
> > session, so no real reason to wait for others either.
> So here is a small patch implementing that behaviour.

On further testing, I received a re-connection from an automatic session
retry. That shouldn't have happened, but it indicates the need for
locking around the conflict handler. I had understood that to be placed
elsewhere but that seems wrong now.

This is a low priority item, so looking for a quick fix to allow time on
other areas.

Any objections?

--
 Simon Riggs           www.2ndQuadrant.com

Attachment