Re: Lots of stuck queries after upgrade to 9.4 - Mailing list pgsql-general
From | Heikki Linnakangas |
---|---|
Subject | Re: Lots of stuck queries after upgrade to 9.4 |
Date | |
Msg-id | 55B69A02.1050009@iki.fi Whole thread Raw |
In response to | Re: Lots of stuck queries after upgrade to 9.4 (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Lots of stuck queries after upgrade to 9.4
|
List | pgsql-general |
On 07/23/2015 02:36 PM, Heikki Linnakangas wrote: > On 07/23/2015 11:31 AM, Spiros Ioannou wrote: >> Well, so far with commit_delay=0 no problems. I will report back of couse >> if something happens, but I believe that the problem may indeed be >> solved/masked with that setting. >> >> Rough description of our setup, or how to reproduce: >> * Timeseries data in table , say, "measurements", size: 3-4TB, about 1000 >> inserts/second >> * table measurements also has a trigger on insert to also insert on >> measurements_a (for daily export purposes) >> Just the above would cause a stuck query after a few days. >> >> Now for exporting we run the following CTE query (measurements_b is an >> empty table, measurements_a has about 5GB) >> * WITH d_rows AS (DELETE FROM measurement_events_a RETURNING * ) INSERT >> INTO measurement_events_b SELECT * FROM d_rows; >> The above caused the problem to appear every time, after a 10-20 minutes. Hmm. With that CTE query, were there other queries running at the same time? > I was able to reproduce something like this with pgbench, by running a > custom little module that calls the WaitXLogInsertionsToFinish() in a > tight loop, and checks that the value it returns moves monotonically > forward. With commit_delay on, once every minute or so, it moves backwards. > > I'll investigate why that happens... I was able to debug the synthetic test case I created, but unfortunately I don't think it explains the lock up you're seeing after all. It's possible for WaitXLogInsertionsToFinish() to move backwards, in this scenario: 1. Backend A acquires WALInsertLock 2, and reserves xlog between byte positions 2100 - 2200 2. Backend B calls WaitXLogInsertionsToFinish(), which blocks on backend A, which hasn't advertised any location yet. 3. Backend C acquires WALInsertLock 1, and reserves xlog between byte positions 2200 - 2300 4. Backend C calls GetXLogBuffer(), and sees that the page is not in cache yet. (It does not call WALInsertLockUpdateInsertingAt() yet, because it's a bit slow or context-switched out) 5. Backend A initializes the page, completes inserting its WAL record, and releases its WALInsertLock. 6. Backend B gets unblocked, seeing that the lock held by B is now free. It calculated 2200 as the return value, which was the latest reserved WAL position. (Backend C started after it began, so it didn't have to wait for it) 7. Backend C calls WALInsertLockUpdateInsertingAt(), with a WAL position pointing to the beginning of the page, 2000. If you now call WALInsertLockUpdateInsertingAt() again, it will return 2000, because backend C is the only backend holding a lock, and its advertised position is 2000. But the previous call calculated 2200. GetXLogBuffer() always advertises a WAL position at the beginning of the requested page, but that's a bit bogus. However, AFAICS that is actually harmless. Backend C is not blocked. The page it's looking for is certainly in cache at this point, so it can continue without blocking. So I don't think this explains your lockup. - Heikki
pgsql-general by date: