Re: [HACKERS] recent deadlock regression test failures - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [HACKERS] recent deadlock regression test failures |
Date | |
Msg-id | 15694.1491674175@sss.pgh.pa.us Whole thread Raw |
In response to | Re: [HACKERS] recent deadlock regression test failures (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: [HACKERS] recent deadlock regression test failures
|
List | pgsql-hackers |
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Sat, Apr 8, 2017 at 4:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It seems an entirely principle-free change in the function's definition. > You might say that pg_blocking_pid() is about locking only and not > arbitrary other kinds of waits, but safe snapshots are not completely > unrelated to locking if you tilt your head at the right angle: > GetSafeSnapshot waits for all transactions that might hold SIRead > locks that could affect this transaction's serializability to > complete. Well, the problem I'm having is that once we say that pg_blocking_pids() is not just about heavyweight locks, it's not clear where we stop. There might be many other cases where, if you dig around in the right data structures, it'd be possible to say that process X is waiting for process Y to do something. How many of those will we expect pg_blocking_pids() to cover? And there certainly will be cases where X is waiting for Y but we don't have any effective way of identifying that. Is that a bug in pg_blocking_pids()? Admittedly, this is a somewhat academic objection; but I dislike taking a function with a fairly clear, published spec and turning it into something that does whatever's expedient for a single use-case. > Based on the above, here is a version that introduces a simple boolean > function pg_waiting_for_safe_snapshot(pid) and adds that the the > query. This was my "option 1" upthread. Nah, this is not good either. Yes, it's a fairly precise conversion of what Kevin's patch did, but I think that patch is wrong even without considering the performance angle. If you look back at the discussions in Feb 2016 that led to what we had, it turned out to be important not just to be able to say that process X is blocked, but that it is blocked by one of the other isolationtest sessions, and not by some random third party (like autovacuum). I do not know whether there is, right now, any way for autovacuum to be the guilty party for a SafeSnapshot wait --- but it does not seem like a good plan to assume that there never will be. So I think what we need is functionality very much like what you had in the prior patch, ie identify the specific PIDs that are causing process X to wait for a safe snapshot. I'm just not happy with how you packaged it. Here's a sketch of what I think we ought to do: 1. Leave pg_blocking_pids() alone; it's a published API now. 2. Add GetSafeSnapshotBlockingPids() more or less as you had it in the previous patch (+ Kevin's recommendations). There might be value in providing a SQL-level way to call that, but I'm not sure, and it would be independent of fixing isolationtester anyway. 3. Invent a SQL function that is dedicated specifically to supporting isolationtester and need not be documented at all; this gets us out of the problem of whether it's okay to whack its semantics around anytime somebody thinks of something else they want to test. I'm imagining an API like isolation_test_is_waiting_for(int, int[]) returns bool so that isolationtester's query would reduce to something like SELECT pg_catalog.isolation_test_is_waiting_for($1, '{...}') which would take even more cycles out of the parse/plan overhead for it (which is basically what's killing the CCA animals). Internally, this function would call pg_blocking_pids and, if necessary, GetSafeSnapshotBlockingPids, and would check for matches in its array argument directly; it could probably do that significantly faster than the general-purpose array && code. So we'd have to expend a bit of backend C code, but we'd have something that would be quite speedy and we could customize in future without fear of breaking behavior that users are depending on. regards, tom lane
pgsql-hackers by date: