Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE |
Date | |
Msg-id | CAM3SWZRyrS02qQq9mjvmzLVoiaCdoXM2WOQs=K31i=YV8GO9TA@mail.gmail.com Whole thread Raw |
In response to | Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
|
List | pgsql-hackers |
On Wed, Sep 11, 2013 at 2:28 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Nice to see the next version, won't have time to look in any details in > the next few days tho. Thanks Andres! > I think for this approach to be workable you also need to explain how we > can deal with stuff like toast insertion that may need to write hundreds > of megabytes all the while leaving an entire value-range of the unique > key share locked. Right. That is a question that needs to be addressed in a future revision. > I still think that even doing a plain heap insertion is longer than > acceptable to hold even a share lock over a btree page Well, there is really only one way of judging something like that, and that's to do a benchmark. I still haven't taken the time to "pick the low hanging fruit" here that I'd mentioned - there are some fairly obvious ways to shorten the window in which value locks are held. Furthermore, I'm sort of at a loss as to what a fair benchmark would look like - what is actually representative here? Also, what's the baseline? It's not as if someone has an alternative, competing patch. We can only hypothesize what additional costs those other approaches introduce, unless someone has a suggestion as to how they can be simulated without writing the full patch, which is something I'd entertain. As I've already pointed out, all page splits occur with the same buffer exclusive lock held. Only, in our case, we're weakening that lock to a shared lock. So I don't think that the heap insertion is going to be that big of a deal, particularly in the average case. Having said that, it's a question that surely must be closely examined before proceeding much further. And yes, the worst case could be pretty bad, and that surely matters too. > The easiest answer is doing the toasting before doing the index locking, > but that will result in bloat, the avoidance of which seems to be the > primary advantage of your approach. I would say that the primary advantage of my approach is that it's much simpler than any other approach that has been considered by others in the past. The approach is easier to reason about because it's really just an extension of how btrees already do value locking. Granted, I haven't adequately demonstrated that things really are so rosy, but I think I'll be able to. The key point is that with trivial exception, all other parts of the code, like VACUUM, don't consider themselves to directly have license to acquire locks on btree buffers - they go through the AM interface instead. What do they know about what makes sense for a particular AM? The surface area actually turns out to be fairly manageable. With the promise tuple approach, it's more the maintainability overhead of new *classes* of bloat that I'm concerned about than the bloat itself, and all the edge cases that are likely to be introduced. But yes, the overhead of doing all that extra writing (including WAL-logging twice), and the fact that it all has to happen with an exclusive lock on the leaf page buffer is also a concern of mine. With v3 of my patch, we still only have to do all the preliminary work like finding the right page and verifying that there are no duplicates once. So with recent revisions, the amount of time spent exclusive locking with my proposed approach is now approximately half the time of alternative proposals (assuming no page split is necessary). In the worst case, the number of values locked on the leaf page is quite localized and manageable, as a natural consequence of the fact that it's a btree leaf page. I haven't run any numbers, but for an int4 btree (which really is the worst case here), 200 or so read-locked values would be quite close to as bad as things got. Plus, if there isn't a second phase of locking, which is on average a strong possibility, those locks would be hardly held at all - contrast that with having to do lots of exclusive locking for all that clean-up. I might experiment with weakening the exclusive lock even earlier in my next revision, and/or strengthening later. Off hand, I can't see a reason for not weakening after we find the first leaf page that the key might be on (granted, I haven't thought about it that much) - _bt_check_unique() does not have license to alter the buffer already proposed for insertion. Come to think of it, all of this new buffer lock weakening/strengthening stuff might independently justify itself as an optimization to regular btree index tuple insertion. That's a whole other patch, though -- it's a big ambition to have as a sort of incidental adjunct to what is already a big, complex patch. In practice the vast majority of insertions don't involve TOASTing. That's not an excuse for allowing the worst case to be really bad in terms of its impact on query response time, but it may well justify having whatever ameliorating measures we take result in bloat. It's at least the kind of bloat we're more or less used to dealing with, and have already invested a lot in controlling. Plus bloat-wise it can't be any worse than just inserting the tuple and having the transaction abort on a duplicate, since that already happens after toasting has done its work with regular insertion. -- Peter Geoghegan
pgsql-hackers by date: