I have found a case in nbtree where it is possible for two concurrent
transactions to insert the same key value in a unique index :-(
The reason is that _bt_check_unique can (of course) only detect keys
that are already in the index. To prevent concurrent insertion of
duplicate keys, we have to rely on locking. Two would-be inserters
of the same key must lock the same target page in _bt_doinsert, so
they cannot run the check at the same time. The second to obtain
the lock will see the first one's already-inserted key.
However, suppose that the index contains many existing instances
of the same key (presumably pointing at deleted-but-not-yet-vacuumed
tuples). If the existing instances span several index pages, it is
likely that _bt_insertonpg will decide to insert the new key on one
of the pages to the right of the original target key. As presently
coded, _bt_insertonpg drops the write lock it has before acquiring
write lock on the next page to the right. This creates a window
wherein another process could perform _bt_check_unique, fail to
find any non-dead instances of the key, and decide that it's okay
to insert the same key.
The fix is to acquire the next page's write lock before we drop the
current page's. This is deadlock-free since no writer ever tries
to chain write-locks to the left (in fact the same thing is done in
the page split logic).
I am not convinced that this explains any of the field reports of
duplicate rows that we've seen, but it's clearly a bug. Will commit
the fix shortly.
regards, tom lane