Re: INSERT...ON DUPLICATE KEY IGNORE - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: INSERT...ON DUPLICATE KEY IGNORE |
Date | |
Msg-id | 20130901110639.GA5695@awork2.anarazel.de Whole thread Raw |
In response to | Re: INSERT...ON DUPLICATE KEY IGNORE (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: INSERT...ON DUPLICATE KEY IGNORE
|
List | pgsql-hackers |
On 2013-08-31 19:38:49 -0700, Peter Geoghegan wrote: > On Sat, Aug 31, 2013 at 11:34 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> > Won't S2 in this case exclusively lock a page in foo_a_key (the one > >> > where 2 will reside on) for 3600s, while it's waiting for S1 to finish > >> > while getting the speculative insertion into foo_b_key? > >> > >> Yes. The way the patch currently handles that case is unacceptable. It > >> needs to be more concerned about holding an exclusive lock on > >> earlier-locked indexes when locking the second and subsequent indexes > >> iff it blocks on another transaction in the course of locking the > >> second and subsequent indexes. > > I think that the fix here may be to do something analogous to what the > code already does: release all buffer locks, wait for the other > transaction to commit/abort, and try again. The difference being that > we try again across all unique indexes rather than just this one. I > have other obligations to take care of over the next couple of days, > so I don't really have time to code that up and satisfy myself that > it's robust; I'd prefer to have something more concrete to back this > thought up, but that's my immediate reaction for what it's worth. That could probably made work. I still have some concerns with the fundamentals of your concept I voice later in the mail... > We're looking for the first duplicate. So it would probably be okay > for the IGNORE case to not bother retrying and re-locking if the other > transaction committed (which, at least very broadly speaking, is the > likely outcome). Hm. Could you explain this a bit more? Do you mean that we can stop trying to insert after the first violation (yes, sure if so)? > > After some thinking I don't think any solution primarily based on > > holding page level locks across other index operations is going to scale > > ok. > > Just to be clear: I'm certainly not suggesting that it's okay to have > exclusive buffer locks held for any amount of time longer than > instantaneous. Furthermore, it isn't necessary to do anything > different to what we currently do with non-unique btree indexes during > speculative insertion - I don't treat them any differently, and they > only get locked in the usual way at the usual time, after heap > insertion. Much of the time, there'll only be a primary key index to > lock, and you'll have increased the window of the exclusive write-lock > (compared to the existing code when doing a regular insertion) by only > a very small amount - the duration of heap tuple insertion (the > baseline here is the duration of finding an insertion location on page > for index tuple, tuple insertion + possible page split). We're talking > about a write-lock on a btree leaf page, which is only going to block > other sessions from *finishing off* tuple insertion, and then only for > a minority of such insertions much of the time. I can't really agree with this part. First off, a heap_insert() certainly isn't lightweight operation. There's several things that can take extended time: * toasting of rows (writing a gigabyte worth of data...) * extension of the heap/toast table * writeout of a dirty victim buffer if the page isn't in s_b * reading in the target page if the page isn't in s_b * waiting for an exclusive lock on the target pages Secondly, you seem to be saying that such a lock would only block other sessions from finishing of tuple insertions - but it will block normal index searches as well? I have a hard time accepting the fact that one session using INSERT ... KEY IGNORE will block lookup of a range of values from the index in *other* sessions. > The question must be asked: if you don't think this will scale okay, > what are you comparing it to? The most useful basis of comparison is > likely to be other systems. Asking about uniqueness across many unique > indexes, and getting them all to commit to their unanimous position > for a moment in time is inherently involved - I don't see a way to do > it other than lock values, and I don't see any granularity at which > that's possible other than the btree page granularity. I'd expect that with the current implementation strategy if you e.g. made pgbench use INSERT .. ON DUPLICATE IGNORE without ever actually hitting duplicates, you'd see a very, very noticeable hit in scalability. I wouldn't expect a single session to get much slower bug trying a dozen or two should show a very noticeable dip. > > What I had in mind when playing around with this is the following trick: > > Just as in your patch split index insertion into two phases and perform > > one of them before inserting the heap tuple. In what you called > > "speculative insertion" don't stop at locking the page, but actually > > insert an index tuple in the index. As we haven't yet inserted the heap > > tuple we obviously can't insert an actual tuple. Instead set a flag on > > the item and reuse the the item pointer to store the xid of the > > inserting process. I coined those items "insertion promises"... > > I did consider that sort of approach from an early stage - Simon > suggested it to me privately early last year. Actually, I even pointed > out in the 2012 developer meeting that there were unused IndexTuple > flag bits available for such purposes. Yea, I am not trying to say that I am the first one having that idea, just that that's what I had investigated. > > Index searches will ignore promises they see, but concurrent index > > insertions will notice it's a promise and wait for the xid (or not, if > > it has aborted). > > If inserters must block, they'll have to restart - I'm sure you'll > agree that they can't sit on even a shared/read buffer lock > indefinitely. We already have the logic for waiting in _bt_doinsert() - we can just reuse the logic already there for plain unique insertions. Just that we would know the xid to wait on without doing actual heap lookups (in _bt_check_unique()). > So, have you really saved anything? Especially since > you'll have to search at least twice (more shared locks on *root* > pages) where my scheme only needs to search once. I think we could optimize away the second search with some cleverness. If we remember & pin the page from the first search we know that the promise will have to be either on that page or - after a page split - potentially to ones on the right. Which we should be able to find by repeatedly stepping right. Similar to logic of doing the _bt_moveright() in _bt_doinsert(). > Also, are you really sure that index searches should have license to > ignore these promises? What about dirty snapshots? Well, normal indexsearches are looking for actual heap tuples. Which do not yet exist at that point, so we cannot return anything for those. And given that all current logic using dirty snapshots works with us inserting the index tuples only after inserting heap tuples I cannot see this causing distress there. Obviously we cannot ignore the promises from inside the btree insertion code, since them being visible there is their raison d'être... > The question I should really ask about your scheme is the same > question you've asked of mine: What about multiple unique indexes on > the same table? Complexity issues aside, is it really better to get > exclusive/write locks on each of them twice (albeit shorter held > ones), and approximately double the number of shared locks too, just > to avoid concurrently held index tuple exclusive locks? I am pretty sure it would be better. We know that nbtree works pretty damn well under concurrency. And the "promise" logic wouldn't change anything of that. Sure, the ON DUPLICATE IGNORE inserter will have a slightly higher overhead because it has to navigate to wherever the promise has been split to and then modify the promise to point to an ondisk tuple instead of the xid. But it won't have to do splits a second time or anything like that. And it will just need to lock one page at a time. Furthermore, I am pretty sure that locking an index page *and* related heap pages exclusively at the same time will be a deadlock nightmare. Even in the case of only having a single unique index. > Now, I guess you probably could make all this work - somehow - but > that's even more complexity and even more cycles for vanishingly small > gain in one specific area. And frankly I don't think there was much > gain to begin with. You're glossing over too many details about what > your scheme would need to do to make a fair comparison with mine. It > seems like you're really concerned about scalability in the sense of > scaling the number of unique indexes on one table when there is lots > of concurrent activity on the same range of values across multiple > unique indexes. No, I am primarily concerned of the performance of concurrent inserts and concurrent lookups, even in the case of a single unique index. I don't think we can disregard multiple unique indexes - they are not all that uncommon - but I think we'll see problems before that. > With my patch, and a single primary key unique index, > the overhead of this is extended window of locking is very small > indeed - because we usually only walk the tree and lock once, the > overhead relative to regular insertion is fixed. 2 or 3 unique indexes > aren't much worse in practice, I suspect. Maybe I am missing something, but I can't see how that's could be the case. > > Imo that solution is fairly elegant because it doesn't cause any heap > > bloat, and it causes the same amount of index bloat > > "insert-into-heap-first" type of solutions cause. > > I don't think that's a reasonable comparison. Bloating indexes with > dead *duplicate* tuples is, in general, worse than doing the same with > the heap. If you end up with a heavily bloated index with lots of > duplicates, that will adversely affect performance worse than with > non-duplicate index tuples (see [1] for example). That was one reason > that HOT helped as much as it did, I believe. But we wouldn't ever insert promises when there is a duplicate in the index? We obviously need to check for existing unique violations before inserting the promise? Also, unless we abort with a fatal error, we can relatively easily mark all those promises as dead when the insertion doesn't suceed due to other indexes failing the uniqueness check (that won't merge pages though). > > It also has pretty > > good concurrency behaviour because you never need to retain a page lock > > for longer than a normal index insertion. > > It's certainly a bit more complex than your solution tho... > > It's a lot more complex. No denying there, although I think your scheme will also grow a good bit more complex ;). Please don't get me wrong, I'd be very happy if your scheme can be made to work, it certainly has advantages. I just don't see how, that's why I am explaining the problems I see and what solution *I* can see. But then, I unsuprisingly I have thought more about my idea of doing things, so I might completely miss some solutions to your idea. Also, I think, even if we were to switch schemes, about 70% of your patch could be used with barely any changes. It's mostly the btree changes that would have to change. > You have to write an index tuple, and WAL-log > that, just to do nothing. You may also need to WAL-log "insertion > proper", which in your scheme involved updating the index in-place. You certainly would need to log that, yes. > Having said all that, I'm perfectly willing to talk about the > scalability aspects of what I've done when I produce a revision that > fixes that bug. It would certainly help your argument here if you had > a benchmark, since it is my understanding that your concern is > entirely about performance/scalability. Yes. And deadlocks due to holding the exlusive lock for prolonged time. > Does the patch need logic about looking to the primary key for > duplicates first? Or maybe, if it's a SERIAL column, to look there > last since it's unlikely to have a duplicate and it might hurt more to > lock a leaf page there? It doesn't really matter now, but for upsert > we're going to want that. For upsert, which row to update may be > ambiguous, so it makes sense to prioritise indexes for checking if > only the first duplicate found will be updated in respect of a single > inserted tuple. This kind of thing also helps performance. Yes, I wondered whether we could sort the indexes to move the ones with the highest conflict ratio upfront. But I am afraid that doing so with a scheme that involves keeping exlusive locks across multiple other operations would result in deadlocks. We might want to keep unique violation stats per index to do this efficently. I think keeping those would be a good idea anyway. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: