Re: Lots of stuck queries after upgrade to 9.4 - Mailing list pgsql-general

From Spiros Ioannou
Subject Re: Lots of stuck queries after upgrade to 9.4
Date
Msg-id CACKh8C9kz=iLncE4R_zR5qcXrREOSKeFWkO4i=d=J6yHd5ufMw@mail.gmail.com
Whole thread Raw
In response to Re: Lots of stuck queries after upgrade to 9.4  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-general
Hmm. With that CTE query, were there other queries running at the same time?

yes, a multitude of INSERT and one COMMIT as I remember. I've noticed that in every stuck situation there was always 1 stuck COMMIT running. We do mainly INSERT and SELECT, very rarely UPDATE or DELETE.
but the stuck queries happen also without the CTE. The trigger on insert runs non-stop though.

I think I've noticed a different behaviour with 9.4 and the CTE but I'm just 50% sure about that, but may be relevant or not:
in 9.3 while the CTE was running, I think the trigger-filled table  was locked and no inserts could be made to the primary table (which copies all inserts to the trigger table). In 9.4 inserts can be made at a lower rate even when CTE runs. As I said not 100% sure about the behaviour was like that in 9.3.


I was able to debug the synthetic test case I created, but unfortunately I don't think it explains the lock up you're seeing after all.

It's possible for WaitXLogInsertionsToFinish() to move backwards, in this scenario:

1. Backend A acquires WALInsertLock 2, and reserves xlog between byte positions 2100 - 2200
2. Backend B calls WaitXLogInsertionsToFinish(), which blocks on backend A, which hasn't advertised any location yet.
3. Backend C acquires WALInsertLock 1, and reserves xlog between byte positions 2200 - 2300
4. Backend C calls GetXLogBuffer(), and sees that the page is not in cache yet. (It does not call WALInsertLockUpdateInsertingAt() yet, because it's a bit slow or context-switched out)
5. Backend A initializes the page, completes inserting its WAL record, and releases its WALInsertLock.
6. Backend B gets unblocked, seeing that the lock held by B is now free. It calculated 2200 as the return value, which was the latest reserved WAL position. (Backend C started after it began, so it didn't have to wait for it)
7. Backend C calls WALInsertLockUpdateInsertingAt(), with a WAL position pointing to the beginning of the page, 2000.

If you now call WALInsertLockUpdateInsertingAt() again, it will return 2000, because backend C is the only backend holding a lock, and its advertised position is 2000. But the previous call calculated 2200. GetXLogBuffer() always advertises a WAL position at the beginning of the requested page, but that's a bit bogus.

However, AFAICS that is actually harmless. Backend C is not blocked. The page it's looking for is certainly in cache at this point, so it can continue without blocking. So I don't think this explains your lockup.

Thanks for the investigation. The only think I can offer is that it's still running fine with commit_delay = 0. Perhaps a warning should be put on the documentation until the cause is found.
thanks
-Spiros

pgsql-general by date:

Previous
From: 林士博
Date:
Subject: Re: Using the database to validate data
Next
From: Herouth Maoz
Date:
Subject: Re: *SOLVED* Connections closing due to "terminating connection due to administrator command"