Thread: Potential GIN vacuum bug
On 08/16/2015 12:58 AM, Jeff Janes wrote: > When ginbulkdelete gets called for the first time in a VACUUM(i.e. stats > == NULL), one of the first things it does is call ginInsertCleanup to get > rid of the pending list. It does this in lieu of vacuuming the pending > list. > > This is important because if there are any dead tids still in the Pending > list, someone else could come along during the vacuum and post the dead > tids into a part of the index that VACUUM has already passed over. > > The potential bug is that ginInsertCleanup exits early (ginfast.c lines > 796, 860, 898) if it detects that someone else is cleaning up the pending > list, without waiting for that someone else to finish the job. > > Isn't this a problem? Yep, I think you're right. When that code runs as part of VACUUM, it should not give up like that. Hmm, I see other race conditions in that code too. Even if VACUUM wins the race you spotted, and performs all the insertions, reaches the end of the pending items list, and deletes the pending list pages, it's possible that another backend started earlier, and is still processing the same items from the pending items list. It will add them to the tree, and after it's finished with that it will see that the pending list page was already deleted, and bail out. But if there is a dead tuple in the pending items list, you have trouble. The other backend will re-insert it, and that might happen after VACUUM had already removed it from the tree. Also, ginInsertCleanup() seems to assume that if another backend has just finished cleaning up the same page, it will see the page marked as deleted. But what if the page is not only marked as deleted, but also reused for something else already? - Heikki
On Aug 16, 2015 11:49 PM, "Heikki Linnakangas" <hlinnaka@iki.fi> wrote:
>
> On 08/16/2015 12:58 AM, Jeff Janes wrote:
>>
>> When ginbulkdelete gets called for the first time in a VACUUM(i.e. stats
>> == NULL), one of the first things it does is call ginInsertCleanup to get
>> rid of the pending list. It does this in lieu of vacuuming the pending
>> list.
>>
>> This is important because if there are any dead tids still in the Pending
>> list, someone else could come along during the vacuum and post the dead
>> tids into a part of the index that VACUUM has already passed over.
>>
>> The potential bug is that ginInsertCleanup exits early (ginfast.c lines
>> 796, 860, 898) if it detects that someone else is cleaning up the pending
>> list, without waiting for that someone else to finish the job.
>>
>> Isn't this a problem?
>
>
> Yep, I think you're right. When that code runs as part of VACUUM, it should not give up like that.
>
> Hmm, I see other race conditions in that code too. Even if VACUUM wins the race you spotted, and performs all the insertions, reaches the end of the pending items list, and deletes the pending list pages, it's possible that another backend started earlier, and is still processing the same items from the pending items list. It will add them to the tree, and after it's finished with that it will see that the pending list page was already deleted, and bail out. But if there is a dead tuple in the pending items list, you have trouble. The other backend will re-insert it, and that might happen after VACUUM had already removed it from the tree.Could the right to clean the pending list be represented by a self-conflicting heavy weight lock on the index? Vacuum could block on it, while user back-ends could try to get it conditionally and just give up on the cleanup if it is not available.
>
> Also, ginInsertCleanup() seems to assume that if another backend has just finished cleaning up the same page, it will see the page marked as deleted. But what if the page is not only marked as deleted, but also reused for something else already?Yeah. Which is possible but pretty unlikely now; but would be far more likely if we added these page to the fsm more aggressively.
Attachment
Jeff Janes wrote: > The attached patch takes a ShareUpdateExclusiveLock lock on the index in > order to clean the pending list. Does it take a lock on the table also? Because if not ... > One potential problem is how it will interact with "create index > concurrently". ... then I don't understand how you could have a problem here. Surely no pending list cleanup can happen concurrently with the index being created? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Jeff Janes wrote:
> The attached patch takes a ShareUpdateExclusiveLock lock on the index in
> order to clean the pending list.
Does it take a lock on the table also? Because if not ...
> One potential problem is how it will interact with "create index
> concurrently".
... then I don't understand how you could have a problem here. Surely
no pending list cleanup can happen concurrently with the index being
created?
On Mon, Aug 17, 2015 at 5:41 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > User backends attempt to take the lock conditionally, because otherwise they > would cause an autovacuum already holding the lock to cancel itself, which > seems quite bad. > > Not that this a substantial behavior change, in that with this code the user > backends which find the list already being cleaned will just add to the end > of the pending list and go about their business. So if things are added to > the list faster than they can be cleaned up, the size of the pending list > can increase without bound. > > Under the existing code each concurrent user backend will try to clean the > pending list at the same time. The work doesn't parallelize, so doing this > is just burns CPU (and possibly consuming up to maintenance_work_mem for > *each* backend) but it does server to throttle the insertion rate and so > keep the list from growing without bound. > > This is just a proof-of-concept patch, because I don't know if this approach > is the right approach. I'm not sure if this is the right approach, but I'm a little wary of involving the heavyweight lock manager in this. If pending list cleanups are frequent, this could involve a lot of additional lock manager traffic, which could be bad for performance. Even if they are infrequent, it seems like it would be more natural to handle this without some regime of locks and pins and buffer cleanup locks on the buffers that are storing the pending list, rather than a heavyweight lock on the whole relation. But I am just waving my hands wildly here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 17, 2015 at 5:41 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> User backends attempt to take the lock conditionally, because otherwise they
> would cause an autovacuum already holding the lock to cancel itself, which
> seems quite bad.
>
> Not that this a substantial behavior change, in that with this code the user
> backends which find the list already being cleaned will just add to the end
> of the pending list and go about their business. So if things are added to
> the list faster than they can be cleaned up, the size of the pending list
> can increase without bound.
>
> Under the existing code each concurrent user backend will try to clean the
> pending list at the same time. The work doesn't parallelize, so doing this
> is just burns CPU (and possibly consuming up to maintenance_work_mem for
> *each* backend) but it does server to throttle the insertion rate and so
> keep the list from growing without bound.
>
> This is just a proof-of-concept patch, because I don't know if this approach
> is the right approach.
I'm not sure if this is the right approach, but I'm a little wary of
involving the heavyweight lock manager in this. If pending list
cleanups are frequent, this could involve a lot of additional lock
manager traffic, which could be bad for performance.
Even if they are
infrequent, it seems like it would be more natural to handle this
without some regime of locks and pins and buffer cleanup locks on the
buffers that are storing the pending list, rather than a heavyweight
lock on the whole relation. But I am just waving my hands wildly
here.
On Tue, Aug 18, 2015 at 8:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:On Mon, Aug 17, 2015 at 5:41 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> User backends attempt to take the lock conditionally, because otherwise they
> would cause an autovacuum already holding the lock to cancel itself, which
> seems quite bad.
>
> Not that this a substantial behavior change, in that with this code the user
> backends which find the list already being cleaned will just add to the end
> of the pending list and go about their business. So if things are added to
> the list faster than they can be cleaned up, the size of the pending list
> can increase without bound.
>
> Under the existing code each concurrent user backend will try to clean the
> pending list at the same time. The work doesn't parallelize, so doing this
> is just burns CPU (and possibly consuming up to maintenance_work_mem for
> *each* backend) but it does server to throttle the insertion rate and so
> keep the list from growing without bound.
>
> This is just a proof-of-concept patch, because I don't know if this approach
> is the right approach.
I'm not sure if this is the right approach, but I'm a little wary of
involving the heavyweight lock manager in this. If pending list
cleanups are frequent, this could involve a lot of additional lock
manager traffic, which could be bad for performance.Even if they are
infrequent, it seems like it would be more natural to handle this
without some regime of locks and pins and buffer cleanup locks on the
buffers that are storing the pending list, rather than a heavyweight
lock on the whole relation. But I am just waving my hands wildly
here.I also thought of a buffer clean up lock on the pending list head buffer to represent the right to do a clean up. But with the proviso that once you have obtained the clean up lock, you can then drop the exclusive buffer content lock and continue to hold the conceptual lock just by maintaining the pin. I think that this would be semantically correct, but backends doing a cleanup would have to get the lock conditionally, and I think you would have too many chances for false failures where it bails out when the other party simply holds a pin. I guess I could implement it and see how it fairs in my test case.
Attachment
Jeff Janes <jeff.janes@gmail.com> writes: > Attached is a patch to deal with this without the heavyweight locks. > I realized that using the clean up lock on the meta page, rather than the > pending head page, would be easier to implement as a pin was already held > on the meta page throughout, and the pending head page can change during > the cleanup if we need multiple passes. > ... > I still prefer the heavy-weight approach. The buffer clean up lock for > vacuuming seems fragile to start with, and abusing it for other purposes > doesn't improve on that. FWIW, I would go with the heavyweight lock approach as well. The extra cycles needed for a heavyweight lock don't seem significant in this context, and you have far more control over which other operations conflict or don't conflict with the lock. Taking a buffer cleanup lock on the metapage sounds really scary from that viewpoint; it's basically going to conflict with everything else, even if the other operations only take it for short intervals, and you have no freedom to adjust that. Your earlier point about how the current design throttles insertions to keep the pending list from growing without bound seems like a bigger deal to worry about. I think we'd like to have some substitute for that. Perhaps we could make the logic in insertion be something like if (pending-list-size > threshold){ if (conditional-lock-acquire(...)) { do-pending-list-cleanup; lock-release; } else if (pending-list-size > threshold * 2) { unconditional-lock-acquire(...); if (pending-list-size> threshold) do-pending-list-cleanup; lock-release; }} so that once the pending list got too big, incoming insertions would wait for it to be cleared. Whether to use a 2x safety margin or something else could be a subject for debate, of course. regards, tom lane
Jeff Janes <jeff.janes@gmail.com> writes:
Your earlier point about how the current design throttles insertions to
keep the pending list from growing without bound seems like a bigger deal
to worry about. I think we'd like to have some substitute for that.
Perhaps we could make the logic in insertion be something like
if (pending-list-size > threshold)
{
if (conditional-lock-acquire(...))
{
do-pending-list-cleanup;
lock-release;
}
else if (pending-list-size > threshold * 2)
{
unconditional-lock-acquire(...);
if (pending-list-size > threshold)
do-pending-list-cleanup;
lock-release;
}
}
so that once the pending list got too big, incoming insertions would wait
for it to be cleared. Whether to use a 2x safety margin or something else
could be a subject for debate, of course.
Jeff Janes <jeff.janes@gmail.com> writes: > On Sun, Aug 30, 2015 at 11:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Your earlier point about how the current design throttles insertions to >> keep the pending list from growing without bound seems like a bigger deal >> to worry about. I think we'd like to have some substitute for that. >> ... > If the goal is to not change existing behavior (like for back patching) the > margin should be 1, always wait. The current behavior is buggy, both as to performance and correctness, so I'm not sure how come "don't change the behavior" should be a requirement. > But we would still have to deal with the > fact that unconditional acquire attempt by the backends will cause a vacuum > to cancel itself, which is undesirable. Good point. > If we define a new namespace for > this lock (like the relation extension lock has its own namespace) then > perhaps the cancellation code could be made to not cancel on that > condition. But that too seems like a lot of work to backpatch. We could possibly teach the autocancel logic to distinguish this lock type from others without using a new namespace. That seems a bit klugy, but maybe better than adding a new namespace. (For example, there are probably only a couple of modes in which we take page-level locks at present. Choosing a currently unused, but self-exclusive, mode for taking such a lock might serve.) > Would we bother to back-patch a theoretical bug which there is no evidence > is triggering in the field? What's theoretical about it? You seemed pretty sure that the issue in http://www.postgresql.org/message-id/flat/CA+bfosGVGVQhMAa=0-mUE6cOo7dBSgAYxb-XsnR5vm-S39hpNg@mail.gmail.com was exactly this. > If we want to improve the current behavior rather than fix a bug, then I > think that if the list is greater than threshold*2 and the cleaning lock is > unavailable, what it should do is proceed to insert the tuple's keys into > the index itself, as if fastupdate = off. That would require some major > surgery to the existing code, as by the time it invokes the clean up, it is > too late to not insert into the pending list. Meh. That's introducing a whole new behavioral regime, which quite aside from correctness bugs might introduce new performance bugs of its own. It certainly doesn't sound like a better back-patch candidate than the other thing. regards, tom lane
Jeff Janes <jeff.janes@gmail.com> writes:
> On Sun, Aug 30, 2015 at 11:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Your earlier point about how the current design throttles insertions to
>> keep the pending list from growing without bound seems like a bigger deal
>> to worry about. I think we'd like to have some substitute for that.
>> ...
> If the goal is to not change existing behavior (like for back patching) the
> margin should be 1, always wait.
The current behavior is buggy, both as to performance and correctness,
so I'm not sure how come "don't change the behavior" should be a
requirement.
> But we would still have to deal with the
> fact that unconditional acquire attempt by the backends will cause a vacuum
> to cancel itself, which is undesirable.
Good point.
> If we define a new namespace for
> this lock (like the relation extension lock has its own namespace) then
> perhaps the cancellation code could be made to not cancel on that
> condition. But that too seems like a lot of work to backpatch.
We could possibly teach the autocancel logic to distinguish this lock type
from others without using a new namespace. That seems a bit klugy, but
maybe better than adding a new namespace. (For example, there are
probably only a couple of modes in which we take page-level locks at
present. Choosing a currently unused, but self-exclusive, mode for taking
such a lock might serve.)
> Would we bother to back-patch a theoretical bug which there is no evidence
> is triggering in the field?
What's theoretical about it? You seemed pretty sure that the issue in
http://www.postgresql.org/message-id/flat/CA+bfosGVGVQhMAa=0-mUE6cOo7dBSgAYxb-XsnR5vm-S39hpNg@mail.gmail.com
was exactly this.
> If we want to improve the current behavior rather than fix a bug, then I
> think that if the list is greater than threshold*2 and the cleaning lock is
> unavailable, what it should do is proceed to insert the tuple's keys into
> the index itself, as if fastupdate = off. That would require some major
> surgery to the existing code, as by the time it invokes the clean up, it is
> too late to not insert into the pending list.
Meh. That's introducing a whole new behavioral regime, which quite aside
from correctness bugs might introduce new performance bugs of its own.
It certainly doesn't sound like a better back-patch candidate than the
other thing.
Attachment
On Sun, Aug 30, 2015 at 6:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> But we would still have to deal with the >> fact that unconditional acquire attempt by the backends will cause a vacuum >> to cancel itself, which is undesirable. > > Good point. > >> If we define a new namespace for >> this lock (like the relation extension lock has its own namespace) then >> perhaps the cancellation code could be made to not cancel on that >> condition. But that too seems like a lot of work to backpatch. > > We could possibly teach the autocancel logic to distinguish this lock type > from others without using a new namespace. That seems a bit klugy, but > maybe better than adding a new namespace. (For example, there are > probably only a couple of modes in which we take page-level locks at > present. Choosing a currently unused, but self-exclusive, mode for taking > such a lock might serve.) That seems like a pretty big kludge to me; it will work until it doesn't. Adding a new lock type (similar to "relation extension") would address a lot of my concern with the heavyweight lock approach. It strikes me that trying to grab a lock on the index in what's basically a pretty low-level operation here could have a variety of unintended consequences. The autovacuum thing is one; but consider also the risk of new deadlock scenarios resulting from a lock upgrade. Those deadlocks would likely be, to use Peter Geoghegan's term, unprincipled. The locking regime around indexes is already pretty complicated, and I'm skeptical about the idea that we can complicate it further without any blowback. If we use a new lock type, it's a lot easier to reason about the interactions, I think. We know all of the things that will take that lock type. And we can be reasonably confident that future code changes won't introduce any new ones, or that the current ones will be considered before making changes. It's not great to have to back-patch such a change, but in practice the blast radius should be pretty limited. People who are looking at pg_locks might start to see a new lock type show up that they're not expecting, but a lot of people either aren't looking at that data at all, or are looking at it but not doing anything programmatic with it and therefore won't really be impacted by something new showing up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Aug 30, 2015 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:Jeff Janes <jeff.janes@gmail.com> writes:
> On Sun, Aug 30, 2015 at 11:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:> But we would still have to deal with the
> fact that unconditional acquire attempt by the backends will cause a vacuum
> to cancel itself, which is undesirable.
Good point.
> If we define a new namespace for
> this lock (like the relation extension lock has its own namespace) then
> perhaps the cancellation code could be made to not cancel on that
> condition. But that too seems like a lot of work to backpatch.
We could possibly teach the autocancel logic to distinguish this lock type
from others without using a new namespace. That seems a bit klugy, but
maybe better than adding a new namespace. (For example, there are
probably only a couple of modes in which we take page-level locks at
present. Choosing a currently unused, but self-exclusive, mode for taking
such a lock might serve.)Like the attached? (The conditional variant for user backends was unceremoniously yanked out.)
On Sun, Aug 30, 2015 at 6:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> But we would still have to deal with the
>> fact that unconditional acquire attempt by the backends will cause a vacuum
>> to cancel itself, which is undesirable.
>
> Good point.
>
>> If we define a new namespace for
>> this lock (like the relation extension lock has its own namespace) then
>> perhaps the cancellation code could be made to not cancel on that
>> condition. But that too seems like a lot of work to backpatch.
>
> We could possibly teach the autocancel logic to distinguish this lock type
> from others without using a new namespace. That seems a bit klugy, but
> maybe better than adding a new namespace. (For example, there are
> probably only a couple of modes in which we take page-level locks at
> present. Choosing a currently unused, but self-exclusive, mode for taking
> such a lock might serve.)
That seems like a pretty big kludge to me; it will work until it doesn't.
Adding a new lock type (similar to "relation extension") would address
a lot of my concern with the heavyweight lock approach. It strikes me
that trying to grab a lock on the index in what's basically a pretty
low-level operation here could have a variety of unintended
consequences. The autovacuum thing is one; but consider also the risk
of new deadlock scenarios resulting from a lock upgrade. Those
deadlocks would likely be, to use Peter Geoghegan's term,
unprincipled. The locking regime around indexes is already pretty
complicated, and I'm skeptical about the idea that we can complicate
it further without any blowback.
On Thu, Oct 1, 2015 at 4:44 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > Is the locking around indexes summarized someplace? Not to my knowledge. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 31, 2015 at 12:10 AM, Jeff Janes <jeff.janes@gmail.com> wrote:On Sun, Aug 30, 2015 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:Jeff Janes <jeff.janes@gmail.com> writes:
> On Sun, Aug 30, 2015 at 11:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:> But we would still have to deal with the
> fact that unconditional acquire attempt by the backends will cause a vacuum
> to cancel itself, which is undesirable.
Good point.
> If we define a new namespace for
> this lock (like the relation extension lock has its own namespace) then
> perhaps the cancellation code could be made to not cancel on that
> condition. But that too seems like a lot of work to backpatch.
We could possibly teach the autocancel logic to distinguish this lock type
from others without using a new namespace. That seems a bit klugy, but
maybe better than adding a new namespace. (For example, there are
probably only a couple of modes in which we take page-level locks at
present. Choosing a currently unused, but self-exclusive, mode for taking
such a lock might serve.)Like the attached? (The conditional variant for user backends was unceremoniously yanked out.)A problem here is that now we have the user backends waiting on vacuum to do the clean up, but vacuum is using throttled IO and so taking its sweet time at it. Under the old code, the user backend could pass up the vacuum while it was sleeping.Maybe we could have the vacuum detect when someone is waiting on it, and temporarily suspend throttling and just run at full speed. I don't believe there is any precedence for that, but I do think there are other places where such a technique could be useful. That is kind of a scary change to backpatch.I am running out of ideas.