Home > mailing lists

Thread: Lots of stuck queries after upgrade to 9.4

Lots of stuck queries after upgrade to 9.4

From

Spiros Ioannou

Date:

19 July 2015, 21:49:55

We upgraded to 9.4.4 from 8.4 We have tried 9.3.x in the past with success.

Unfortunately in 9.4 gets stuck 2-3 times a week.

Lots of queries seem stuck, that is they have waiting:false. Queries can be COMMIT or INSERT.

No blocking locks seem to exist.

strace reveals that all stuck processes they are in semop

semop(2064417, {{0, -1, 0}}, 1^C <unfinished ...>

ipcs reveals no shortage of semaphores..

Please any ideas very welcome, our service now in a very sad state..

Spiros Ioannou
IT Manager, inAccess
www.inaccess.com

M: +30 6973-903808
T: +30 210-6802-358

Re: Lots of stuck queries after upgrade to 9.4

From

Spiros Ioannou

Date:

19 July 2015, 21:51:04

Os is debian 6

Re: Lots of stuck queries after upgrade to 9.4

From

Tom Lane

Date:

19 July 2015, 22:02:01

Spiros Ioannou <sivann@inaccess.com> writes:
> We upgraded to 9.4.4 from 8.4 We have tried 9.3.x in the past with success.
> Unfortunately in 9.4 gets stuck 2-3 times a week.

> Lots of queries seem stuck, that is they have waiting:false. Queries can be
> COMMIT or INSERT.
> No blocking locks seem to exist.
> strace reveals that all stuck processes they are in semop
> semop(2064417, {{0, -1, 0}}, 1^C <unfinished ...>

Stack traces of the stuck processes might help determine what's going on.

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

Please try to capture traces for both the COMMIT and INSERT cases,
as they're almost certainly different.

            regards, tom lane

Re: Lots of stuck queries after upgrade to 9.4

From

Spiros Ioannou

Date:

20 July 2015, 10:16:51

Hi Tom,

thank you for your input. The DB was stuck again, I attach all logs and stack traces.

A stack trace from a COMMIT, an INSERT, an UPDATE, the wal writer, the writer, and a sequence.

Stracing the commit was stuck at: semop(3145761, {{12, -1, 0}}, 1

Please tell me what else could I do to help debugging.

Is there a way to downgrade postgresql ?

Spiros Ioannou
IT Manager, inAccess
www.inaccess.com

M: +30 6973-903808
T: +30 210-6802-358

On 20 July 2015 at 01:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Spiros Ioannou <sivann@inaccess.com> writes:
> We upgraded to 9.4.4 from 8.4 We have tried 9.3.x in the past with success.
> Unfortunately in 9.4 gets stuck 2-3 times a week.

> Lots of queries seem stuck, that is they have waiting:false. Queries can be
> COMMIT or INSERT.
> No blocking locks seem to exist.
> strace reveals that all stuck processes they are in semop
> semop(2064417, {{0, -1, 0}}, 1^C <unfinished ...>

Stack traces of the stuck processes might help determine what's going on.

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

Please try to capture traces for both the COMMIT and INSERT cases,
as they're almost certainly different.

regards, tom lane

Attachment

db9lock.tar.gz

Re: Lots of stuck queries after upgrade to 9.4

From

Alvaro Herrera

Date:

20 July 2015, 11:06:59

Spiros Ioannou wrote:
> Hi Tom,
> thank you for your input. The DB was stuck again, I attach all logs and
> stack traces.
>
> A stack trace from a COMMIT, an INSERT, an UPDATE, the wal writer, the
> writer, and a sequence.
>
> Stracing the commit was stuck at: semop(3145761, {{12, -1, 0}}, 1

Hmm, isn't this an indication of problems in the WAL insertion logic?
Looks to me like there are deadlocks using lwlocks.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Lots of stuck queries after upgrade to 9.4

From

Andres Freund

Date:

20 July 2015, 11:23:03

On 2015-07-20 13:06:51 +0200, Alvaro Herrera wrote:
> Spiros Ioannou wrote:
> > Hi Tom,
> > thank you for your input. The DB was stuck again, I attach all logs and
> > stack traces.
> >
> > A stack trace from a COMMIT, an INSERT, an UPDATE, the wal writer, the
> > writer, and a sequence.
> >
> > Stracing the commit was stuck at: semop(3145761, {{12, -1, 0}}, 1
>
> Hmm, isn't this an indication of problems in the WAL insertion logic?
> Looks to me like there are deadlocks using lwlocks.

Hm. The problem seems to be the WaitXLogInsertionsToFinish() call in
XLogFlush().

Spiros, I guess you have commit_delay enabled? If so, does disabling it
"fix" the issue?

Regards,

Andres

Re: Lots of stuck queries after upgrade to 9.4

From

Andres Freund

Date:

20 July 2015, 11:27:16

On 2015-07-20 13:22:42 +0200, Andres Freund wrote:
> Hm. The problem seems to be the WaitXLogInsertionsToFinish() call in
> XLogFlush().

These are the relevant stack traces:
db9lock/debuglog-commit.txt
#2  0x00007f7405bd44f4 in LWLockWaitForVar (l=0x7f70f2ab6680, valptr=0x7f70f2ab66a0, oldval=<optimized out>,
newval=0xffffffffffffffff)at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:1011 
#3  0x00007f7405a0d3e6 in WaitXLogInsertionsToFinish (upto=121713318915952) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1755
#4  0x00007f7405a0e1d3 in XLogFlush (record=121713318911056) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2849

db9lock/debuglog-insert-8276.txt
#1  0x00007f7405b77d91 in PGSemaphoreLock (sema=0x7f73ff6531d0, interruptOK=0 '\000') at pg_sema.c:421
#2  0x00007f7405bd4849 in LWLockAcquireCommon (val=<optimized out>, valptr=<optimized out>, mode=<optimized out>,
l=<optimizedout>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:626 
#3  LWLockAcquire (l=0x7f70ecaaa1a0, mode=LW_EXCLUSIVE) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:467
#4  0x00007f7405a0dcca in AdvanceXLInsertBuffer (upto=<optimized out>, opportunistic=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2161
#5  0x00007f7405a0e301 in GetXLogBuffer (ptr=121713318928384) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1848
#6  0x00007f7405a0e9c9 in CopyXLogRecordToWAL (EndPos=<optimized out>, StartPos=<optimized out>, rdata=0x7ffff1c21b90,
isLogSwitch=<optimizedout>, write_len=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1494
#7  XLogInsert (rmid=<optimized out>, info=<optimized out>, rdata=<optimized out>) at /tmp/buildd/postgre

Re: Lots of stuck queries after upgrade to 9.4

From

Andres Freund

Date:

20 July 2015, 12:01:51

Heikki,

On 2015-07-20 13:27:12 +0200, Andres Freund wrote:
> On 2015-07-20 13:22:42 +0200, Andres Freund wrote:
> > Hm. The problem seems to be the WaitXLogInsertionsToFinish() call in
> > XLogFlush().
>
> These are the relevant stack traces:
> db9lock/debuglog-commit.txt
> #2  0x00007f7405bd44f4 in LWLockWaitForVar (l=0x7f70f2ab6680, valptr=0x7f70f2ab66a0, oldval=<optimized out>,
newval=0xffffffffffffffff)at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:1011 
> #3  0x00007f7405a0d3e6 in WaitXLogInsertionsToFinish (upto=121713318915952) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1755
> #4  0x00007f7405a0e1d3 in XLogFlush (record=121713318911056) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2849
>
> db9lock/debuglog-insert-8276.txt
> #1  0x00007f7405b77d91 in PGSemaphoreLock (sema=0x7f73ff6531d0, interruptOK=0 '\000') at pg_sema.c:421
> #2  0x00007f7405bd4849 in LWLockAcquireCommon (val=<optimized out>, valptr=<optimized out>, mode=<optimized out>,
l=<optimizedout>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:626 
> #3  LWLockAcquire (l=0x7f70ecaaa1a0, mode=LW_EXCLUSIVE) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:467
> #4  0x00007f7405a0dcca in AdvanceXLInsertBuffer (upto=<optimized out>, opportunistic=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2161
> #5  0x00007f7405a0e301 in GetXLogBuffer (ptr=121713318928384) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1848
> #6  0x00007f7405a0e9c9 in CopyXLogRecordToWAL (EndPos=<optimized out>, StartPos=<optimized out>,
rdata=0x7ffff1c21b90,isLogSwitch=<optimized out>, write_len=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1494
> #7  XLogInsert (rmid=<optimized out>, info=<optimized out>, rdata=<optimized out>) at /tmp/buildd/postgre


XLogFlush() has the following comment:
            /*
             * Re-check how far we can now flush the WAL. It's generally not
             * safe to call WaitXLogInsertionsToFinish while holding
             * WALWriteLock, because an in-progress insertion might need to
             * also grab WALWriteLock to make progress. But we know that all
             * the insertions up to insertpos have already finished, because
             * that's what the earlier WaitXLogInsertionsToFinish() returned.
             * We're only calling it again to allow insertpos to be moved
             * further forward, not to actually wait for anyone.
             */
            insertpos = WaitXLogInsertionsToFinish(insertpos);

but I don't think that's valid reasoning. WaitXLogInsertionsToFinish()
calls LWLockWaitForVar(oldval = InvalidXLogRecPtr), which will block if
there's a exlusive locker and some backend doesn't yet have set
initializedUpto. Which seems like a ossible state?

Re: Lots of stuck queries after upgrade to 9.4

From

Heikki Linnakangas

Date:

20 July 2015, 12:21:31

On 07/20/2015 03:01 PM, Andres Freund wrote:
> Heikki,
>
> On 2015-07-20 13:27:12 +0200, Andres Freund wrote:
>> On 2015-07-20 13:22:42 +0200, Andres Freund wrote:
>>> Hm. The problem seems to be the WaitXLogInsertionsToFinish() call in
>>> XLogFlush().
>>
>> These are the relevant stack traces:
>> db9lock/debuglog-commit.txt
>> #2  0x00007f7405bd44f4 in LWLockWaitForVar (l=0x7f70f2ab6680, valptr=0x7f70f2ab66a0, oldval=<optimized out>,
newval=0xffffffffffffffff)at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:1011 
>> #3  0x00007f7405a0d3e6 in WaitXLogInsertionsToFinish (upto=121713318915952) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1755
>> #4  0x00007f7405a0e1d3 in XLogFlush (record=121713318911056) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2849
>>
>> db9lock/debuglog-insert-8276.txt
>> #1  0x00007f7405b77d91 in PGSemaphoreLock (sema=0x7f73ff6531d0, interruptOK=0 '\000') at pg_sema.c:421
>> #2  0x00007f7405bd4849 in LWLockAcquireCommon (val=<optimized out>, valptr=<optimized out>, mode=<optimized out>,
l=<optimizedout>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:626 
>> #3  LWLockAcquire (l=0x7f70ecaaa1a0, mode=LW_EXCLUSIVE) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:467
>> #4  0x00007f7405a0dcca in AdvanceXLInsertBuffer (upto=<optimized out>, opportunistic=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2161
>> #5  0x00007f7405a0e301 in GetXLogBuffer (ptr=121713318928384) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1848
>> #6  0x00007f7405a0e9c9 in CopyXLogRecordToWAL (EndPos=<optimized out>, StartPos=<optimized out>,
rdata=0x7ffff1c21b90,isLogSwitch=<optimized out>, write_len=<optimized out>) at
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1494
>> #7  XLogInsert (rmid=<optimized out>, info=<optimized out>, rdata=<optimized out>) at /tmp/buildd/postgre
>
>
> XLogFlush() has the following comment:
>             /*
>              * Re-check how far we can now flush the WAL. It's generally not
>              * safe to call WaitXLogInsertionsToFinish while holding
>              * WALWriteLock, because an in-progress insertion might need to
>              * also grab WALWriteLock to make progress. But we know that all
>              * the insertions up to insertpos have already finished, because
>              * that's what the earlier WaitXLogInsertionsToFinish() returned.
>              * We're only calling it again to allow insertpos to be moved
>              * further forward, not to actually wait for anyone.
>              */
>             insertpos = WaitXLogInsertionsToFinish(insertpos);
>
> but I don't think that's valid reasoning. WaitXLogInsertionsToFinish()
> calls LWLockWaitForVar(oldval = InvalidXLogRecPtr), which will block if
> there's a exlusive locker and some backend doesn't yet have set
> initializedUpto. Which seems like a ossible state?

A backend always updates its insert position before sleeping/acquiring
another lock, by calling WALInsertLockUpdateInsertingAt. So even though
another backend might indeed be in the
initializedUpto==InvalidXlogRecPtr state, it will get out of that state
before either by releasing the lock or updating initializedUpto, before
it will in turn do anything that might deadlock.

Clearly there's *something* wrong here, though, given the bug report...

- Heikki

Re: Lots of stuck queries after upgrade to 9.4

From

Spiros Ioannou

Date:

20 July 2015, 12:59:18

Happened again, another backtrace from a COMMIT process. I changed the commit_delay to 0 (it was 4000 to help with our storage) and will report back.

Already logging to debuglog.txt.

#0 0x00007f47b0789ec7 in semop () from /lib/libc.so.6

#1 0x00007f47b2513d91 in PGSemaphoreLock (sema=0x7f47abfe7ed0, interruptOK=0 '\000') at pg_sema.c:421

#2 0x00007f47b25704f4 in LWLockWaitForVar (l=0x7f449f452680, valptr=0x7f449f4526a0, oldval=<optimized out>, newval=0xffffffffffffffff) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:1011

#3 0x00007f47b23a93e6 in WaitXLogInsertionsToFinish (upto=121780827906336) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1755

#4 0x00007f47b23aa1d3 in XLogFlush (record=121780827898328) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2849

#5 0x00007f47b2398f17 in RecordTransactionCommit () at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xact.c:1196

#6 0x00007f47b239921b in CommitTransaction () at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xact.c:1953

#7 0x00007f47b239a8e5 in CommitTransactionCommand () at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xact.c:2732

#8 0x00007f47b257c105 in finish_xact_command () at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/tcop/postgres.c:2437

#9 0x00007f47b2580a10 in exec_execute_message (max_rows=<optimized out>, portal_name=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/tcop/postgres.c:1974

#10 PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=0x7f47b49b5730 "ifms_db", username=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/tcop/postgres.c:4137

#11 0x00007f47b25280f5 in BackendRun (port=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/postmaster/postmaster.c:4164

#12 BackendStartup (port=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/postmaster/postmaster.c:3829

#13 ServerLoop () at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/postmaster/postmaster.c:1597

#14 PostmasterMain (argc=<optimized out>, argv=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/postmaster/postmaster.c:1244

#15 0x00007f47b24ba474 in main (argc=5, argv=0x7f47b49b4920) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/main/main.c:228

Detaching from program: /usr/lib/postgresql/9.4/bin/postgres, process 15601

Spiros Ioannou
IT Manager, inAccess
www.inaccess.com

M: +30 6973-903808
T: +30 210-6802-358

On 20 July 2015 at 15:21, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 07/20/2015 03:01 PM, Andres Freund wrote:
Heikki,

On 2015-07-20 13:27:12 +0200, Andres Freund wrote:
On 2015-07-20 13:22:42 +0200, Andres Freund wrote:
Hm. The problem seems to be the WaitXLogInsertionsToFinish() call in
XLogFlush().

These are the relevant stack traces:
db9lock/debuglog-commit.txt
#2 0x00007f7405bd44f4 in LWLockWaitForVar (l=0x7f70f2ab6680, valptr=0x7f70f2ab66a0, oldval=<optimized out>, newval=0xffffffffffffffff) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:1011
#3 0x00007f7405a0d3e6 in WaitXLogInsertionsToFinish (upto=121713318915952) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1755
#4 0x00007f7405a0e1d3 in XLogFlush (record=121713318911056) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2849

db9lock/debuglog-insert-8276.txt
#1 0x00007f7405b77d91 in PGSemaphoreLock (sema=0x7f73ff6531d0, interruptOK=0 '\000') at pg_sema.c:421
#2 0x00007f7405bd4849 in LWLockAcquireCommon (val=<optimized out>, valptr=<optimized out>, mode=<optimized out>, l=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:626
#3 LWLockAcquire (l=0x7f70ecaaa1a0, mode=LW_EXCLUSIVE) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:467
#4 0x00007f7405a0dcca in AdvanceXLInsertBuffer (upto=<optimized out>, opportunistic=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2161
#5 0x00007f7405a0e301 in GetXLogBuffer (ptr=121713318928384) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1848
#6 0x00007f7405a0e9c9 in CopyXLogRecordToWAL (EndPos=<optimized out>, StartPos=<optimized out>, rdata=0x7ffff1c21b90, isLogSwitch=<optimized out>, write_len=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1494
#7 XLogInsert (rmid=<optimized out>, info=<optimized out>, rdata=<optimized out>) at /tmp/buildd/postgre

XLogFlush() has the following comment:
/*
* Re-check how far we can now flush the WAL. It's generally not
* safe to call WaitXLogInsertionsToFinish while holding
* WALWriteLock, because an in-progress insertion might need to
* also grab WALWriteLock to make progress. But we know that all
* the insertions up to insertpos have already finished, because
* that's what the earlier WaitXLogInsertionsToFinish() returned.
* We're only calling it again to allow insertpos to be moved
* further forward, not to actually wait for anyone.
*/
insertpos = WaitXLogInsertionsToFinish(insertpos);

but I don't think that's valid reasoning. WaitXLogInsertionsToFinish()
calls LWLockWaitForVar(oldval = InvalidXLogRecPtr), which will block if
there's a exlusive locker and some backend doesn't yet have set
initializedUpto. Which seems like a ossible state?

A backend always updates its insert position before sleeping/acquiring another lock, by calling WALInsertLockUpdateInsertingAt. So even though another backend might indeed be in the initializedUpto==InvalidXlogRecPtr state, it will get out of that state before either by releasing the lock or updating initializedUpto, before it will in turn do anything that might deadlock.

Clearly there's *something* wrong here, though, given the bug report...

- Heikki

Re: Lots of stuck queries after upgrade to 9.4

From

Andres Freund

Date:

20 July 2015, 13:04:19

Hi,

On 2015-07-20 15:58:33 +0300, Spiros Ioannou wrote:
> Happened again, another backtrace from a COMMIT process. I changed the
> commit_delay to 0 (it was 4000 to help with our storage) and will report
> back.

What hardware & OS is this happening on?

Regards,

Andres

Re: Lots of stuck queries after upgrade to 9.4

From

Spiros Ioannou

Date:

20 July 2015, 13:10:43

Debian 6, x64, HP Proliant, dbsize: 3TB on EMC AX-4i (iscsi), 16 cores, 24GB RAM.

kernel: Linux vserver11 2.6.32-5-xen-amd64 #1 SMP Sun Dec 7 22:52:42 UTC 2014 x86_64 GNU/Linux

I attach postgresql.conf

Spiros Ioannou
IT Manager, inAccess
www.inaccess.com

M: +30 6973-903808
T: +30 210-6802-358

On 20 July 2015 at 16:04, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2015-07-20 15:58:33 +0300, Spiros Ioannou wrote:
> Happened again, another backtrace from a COMMIT process. I changed the
> commit_delay to 0 (it was 4000 to help with our storage) and will report
> back.

What hardware & OS is this happening on?

Regards,

Andres

Attachment

postgresql.conf

Re: Lots of stuck queries after upgrade to 9.4

From

Spiros Ioannou

Date:

20 July 2015, 14:01:40

FYI we have an 9.3.5 with commit_delay = 4000 and commit_siblings = 5 with a 8TB dataset which seems fine. (Runs on different - faster hardware though).

Spiros Ioannou
IT Manager, inAccess
www.inaccess.com

M: +30 6973-903808
T: +30 210-6802-358

On 20 July 2015 at 15:01, Andres Freund <andres@anarazel.de> wrote:

Heikki,

On 2015-07-20 13:27:12 +0200, Andres Freund wrote:
> On 2015-07-20 13:22:42 +0200, Andres Freund wrote:
> > Hm. The problem seems to be the WaitXLogInsertionsToFinish() call in
> > XLogFlush().
>
> These are the relevant stack traces:
> db9lock/debuglog-commit.txt
> #2 0x00007f7405bd44f4 in LWLockWaitForVar (l=0x7f70f2ab6680, valptr=0x7f70f2ab66a0, oldval=<optimized out>, newval=0xffffffffffffffff) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:1011
> #3 0x00007f7405a0d3e6 in WaitXLogInsertionsToFinish (upto=121713318915952) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1755
> #4 0x00007f7405a0e1d3 in XLogFlush (record=121713318911056) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2849
>
> db9lock/debuglog-insert-8276.txt
> #1 0x00007f7405b77d91 in PGSemaphoreLock (sema=0x7f73ff6531d0, interruptOK=0 '\000') at pg_sema.c:421
> #2 0x00007f7405bd4849 in LWLockAcquireCommon (val=<optimized out>, valptr=<optimized out>, mode=<optimized out>, l=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:626
> #3 LWLockAcquire (l=0x7f70ecaaa1a0, mode=LW_EXCLUSIVE) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:467
> #4 0x00007f7405a0dcca in AdvanceXLInsertBuffer (upto=<optimized out>, opportunistic=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2161
> #5 0x00007f7405a0e301 in GetXLogBuffer (ptr=121713318928384) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1848
> #6 0x00007f7405a0e9c9 in CopyXLogRecordToWAL (EndPos=<optimized out>, StartPos=<optimized out>, rdata=0x7ffff1c21b90, isLogSwitch=<optimized out>, write_len=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1494
> #7 XLogInsert (rmid=<optimized out>, info=<optimized out>, rdata=<optimized out>) at /tmp/buildd/postgre

XLogFlush() has the following comment:
/*
* Re-check how far we can now flush the WAL. It's generally not
* safe to call WaitXLogInsertionsToFinish while holding
* WALWriteLock, because an in-progress insertion might need to
* also grab WALWriteLock to make progress. But we know that all
* the insertions up to insertpos have already finished, because
* that's what the earlier WaitXLogInsertionsToFinish() returned.
* We're only calling it again to allow insertpos to be moved
* further forward, not to actually wait for anyone.
*/
insertpos = WaitXLogInsertionsToFinish(insertpos);

but I don't think that's valid reasoning. WaitXLogInsertionsToFinish()
calls LWLockWaitForVar(oldval = InvalidXLogRecPtr), which will block if
there's a exlusive locker and some backend doesn't yet have set
initializedUpto. Which seems like a ossible state?

Re: Lots of stuck queries after upgrade to 9.4

From

Andres Freund

Date:

20 July 2015, 14:02:57

On 2015-07-20 17:00:52 +0300, Spiros Ioannou wrote:
> FYI we have an 9.3.5 with commit_delay = 4000 and commit_siblings = 5 with
> a 8TB dataset which seems fine. (Runs on different - faster hardware
> though).

9.4 has a different xlog insertion algorithm (scaling much better), so
that unfortunately doesn't say very much...

Re: Lots of stuck queries after upgrade to 9.4

From

Spiros Ioannou

Date:

23 July 2015, 08:32:05

Well, so far with commit_delay=0 no problems. I will report back of couse if something happens, but I believe that the problem may indeed be solved/masked with that setting.

Rough description of our setup, or how to reproduce:

* Timeseries data in table , say, "measurements", size: 3-4TB, about 1000 inserts/second

* table measurements also has a trigger on insert to also insert on measurements_a (for daily export purposes)

Just the above would cause a stuck query after a few days.

Now for exporting we run the following CTE query (measurements_b is an empty table, measurements_a has about 5GB)

* WITH d_rows AS (DELETE FROM measurement_events_a RETURNING * ) INSERT INTO measurement_events_b SELECT * FROM d_rows;

The above caused the problem to appear every time, after a 10-20 minutes.

Regards,

-Spiros

Spiros Ioannou
IT Manager, inAccess
www.inaccess.com

M: +30 6973-903808
T: +30 210-6802-358

On 20 July 2015 at 17:02, Andres Freund <andres@anarazel.de> wrote:

On 2015-07-20 17:00:52 +0300, Spiros Ioannou wrote:
> FYI we have an 9.3.5 with commit_delay = 4000 and commit_siblings = 5 with
> a 8TB dataset which seems fine. (Runs on different - faster hardware
> though).

9.4 has a different xlog insertion algorithm (scaling much better), so
that unfortunately doesn't say very much...

Re: Lots of stuck queries after upgrade to 9.4

From

Heikki Linnakangas

Date:

23 July 2015, 11:36:24

On 07/23/2015 11:31 AM, Spiros Ioannou wrote:
> Well, so far with commit_delay=0 no problems. I will report back of couse
> if something happens, but I believe that the problem may indeed be
> solved/masked with that setting.
>
> Rough description of our setup, or how to reproduce:
> * Timeseries data in table , say, "measurements", size: 3-4TB, about 1000
> inserts/second
> * table measurements also has a trigger on insert to also insert on
> measurements_a (for daily export purposes)
> Just the above would cause a stuck query after a few days.
>
> Now for exporting we run the following CTE query (measurements_b is an
> empty table, measurements_a has about 5GB)
> * WITH d_rows AS (DELETE FROM measurement_events_a RETURNING * )  INSERT
> INTO measurement_events_b SELECT * FROM d_rows;
> The above caused the problem to appear every time, after a 10-20 minutes.

I was able to reproduce something like this with pgbench, by running a
custom little module that calls the WaitXLogInsertionsToFinish() in a
tight loop, and checks that the value it returns moves monotonically
forward. With commit_delay on, once every minute or so, it moves backwards.

I'll investigate why that happens...

- Heikki

Re: Lots of stuck queries after upgrade to 9.4

From

Heikki Linnakangas

Date:

27 July 2015, 20:52:26

On 07/23/2015 02:36 PM, Heikki Linnakangas wrote:
> On 07/23/2015 11:31 AM, Spiros Ioannou wrote:
>> Well, so far with commit_delay=0 no problems. I will report back of couse
>> if something happens, but I believe that the problem may indeed be
>> solved/masked with that setting.
>>
>> Rough description of our setup, or how to reproduce:
>> * Timeseries data in table , say, "measurements", size: 3-4TB, about 1000
>> inserts/second
>> * table measurements also has a trigger on insert to also insert on
>> measurements_a (for daily export purposes)
>> Just the above would cause a stuck query after a few days.
>>
>> Now for exporting we run the following CTE query (measurements_b is an
>> empty table, measurements_a has about 5GB)
>> * WITH d_rows AS (DELETE FROM measurement_events_a RETURNING * )  INSERT
>> INTO measurement_events_b SELECT * FROM d_rows;
>> The above caused the problem to appear every time, after a 10-20 minutes.

Hmm. With that CTE query, were there other queries running at the same time?

> I was able to reproduce something like this with pgbench, by running a
> custom little module that calls the WaitXLogInsertionsToFinish() in a
> tight loop, and checks that the value it returns moves monotonically
> forward. With commit_delay on, once every minute or so, it moves backwards.
>
> I'll investigate why that happens...

I was able to debug the synthetic test case I created, but unfortunately
I don't think it explains the lock up you're seeing after all.

It's possible for WaitXLogInsertionsToFinish() to move backwards, in
this scenario:

1. Backend A acquires WALInsertLock 2, and reserves xlog between byte
positions 2100 - 2200
2. Backend B calls WaitXLogInsertionsToFinish(), which blocks on backend
A, which hasn't advertised any location yet.
3. Backend C acquires WALInsertLock 1, and reserves xlog between byte
positions 2200 - 2300
4. Backend C calls GetXLogBuffer(), and sees that the page is not in
cache yet. (It does not call WALInsertLockUpdateInsertingAt() yet,
because it's a bit slow or context-switched out)
5. Backend A initializes the page, completes inserting its WAL record,
and releases its WALInsertLock.
6. Backend B gets unblocked, seeing that the lock held by B is now free.
It calculated 2200 as the return value, which was the latest reserved
WAL position. (Backend C started after it began, so it didn't have to
wait for it)
7. Backend C calls WALInsertLockUpdateInsertingAt(), with a WAL position
pointing to the beginning of the page, 2000.

If you now call WALInsertLockUpdateInsertingAt() again, it will return
2000, because backend C is the only backend holding a lock, and its
advertised position is 2000. But the previous call calculated 2200.
GetXLogBuffer() always advertises a WAL position at the beginning of the
requested page, but that's a bit bogus.

However, AFAICS that is actually harmless. Backend C is not blocked. The
page it's looking for is certainly in cache at this point, so it can
continue without blocking. So I don't think this explains your lockup.

- Heikki

Re: Lots of stuck queries after upgrade to 9.4

From

Spiros Ioannou

Date:

28 July 2015, 07:23:22

Hmm. With that CTE query, were there other queries running at the same time?

yes, a multitude of INSERT and one COMMIT as I remember. I've noticed that in every stuck situation there was always 1 stuck COMMIT running. We do mainly INSERT and SELECT, very rarely UPDATE or DELETE.

but the stuck queries happen also without the CTE. The trigger on insert runs non-stop though.

I think I've noticed a different behaviour with 9.4 and the CTE but I'm just 50% sure about that, but may be relevant or not:

in 9.3 while the CTE was running, I think the trigger-filled table was locked and no inserts could be made to the primary table (which copies all inserts to the trigger table). In 9.4 inserts can be made at a lower rate even when CTE runs. As I said not 100% sure about the behaviour was like that in 9.3.

I was able to debug the synthetic test case I created, but unfortunately I don't think it explains the lock up you're seeing after all.

It's possible for WaitXLogInsertionsToFinish() to move backwards, in this scenario:

1. Backend A acquires WALInsertLock 2, and reserves xlog between byte positions 2100 - 2200
2. Backend B calls WaitXLogInsertionsToFinish(), which blocks on backend A, which hasn't advertised any location yet.
3. Backend C acquires WALInsertLock 1, and reserves xlog between byte positions 2200 - 2300
4. Backend C calls GetXLogBuffer(), and sees that the page is not in cache yet. (It does not call WALInsertLockUpdateInsertingAt() yet, because it's a bit slow or context-switched out)
5. Backend A initializes the page, completes inserting its WAL record, and releases its WALInsertLock.
6. Backend B gets unblocked, seeing that the lock held by B is now free. It calculated 2200 as the return value, which was the latest reserved WAL position. (Backend C started after it began, so it didn't have to wait for it)
7. Backend C calls WALInsertLockUpdateInsertingAt(), with a WAL position pointing to the beginning of the page, 2000.

If you now call WALInsertLockUpdateInsertingAt() again, it will return 2000, because backend C is the only backend holding a lock, and its advertised position is 2000. But the previous call calculated 2200. GetXLogBuffer() always advertises a WAL position at the beginning of the requested page, but that's a bit bogus.

However, AFAICS that is actually harmless. Backend C is not blocked. The page it's looking for is certainly in cache at this point, so it can continue without blocking. So I don't think this explains your lockup.

Thanks for the investigation. The only think I can offer is that it's still running fine with commit_delay = 0. Perhaps a warning should be put on the documentation until the cause is found.

thanks

-Spiros

Re: Lots of stuck queries after upgrade to 9.4

From

Heikki Linnakangas

Date:

28 July 2015, 20:36:07

A-ha, I succeeded to reproduce this now on my laptop, with pgbench! It
seems to be important to have a very large number of connections:

pgbench -n -c400 -j4 -T600 -P5

That got stuck after a few minutes. I'm using commit_delay=100.

Now that I have something to work with, I'll investigate this more tomorrow.

- Heikki

Re: Lots of stuck queries after upgrade to 9.4

From

Heikki Linnakangas

Date:

29 July 2015, 10:42:49

On 07/28/2015 11:36 PM, Heikki Linnakangas wrote:
> A-ha, I succeeded to reproduce this now on my laptop, with pgbench! It
> seems to be important to have a very large number of connections:
>
> pgbench -n -c400 -j4 -T600 -P5
>
> That got stuck after a few minutes. I'm using commit_delay=100.
>
> Now that I have something to work with, I'll investigate this more tomorrow.

Ok, it seems that this is caused by the same issue that I found with my
synthetic test case, after all. It is possible to get a lockup because
of it.

For the archives, here's a hopefully easier-to-understand explanation of
how the lockup happens. It involves three backends. A and C are
insertion WAL records, while B is flushing the WAL with commit_delay.
The byte positions 2000, 2100, 2200, and 2300 are offsets within a WAL
page. 2000 points to the beginning of the page, while the others are
later positions on the same page. WaitToFinish() is an abbreviation for
WaitXLogInsertionsToFinish(). "Update pos X" means a call to
WALInsertLockUpdateInsertingAt(X). "Reserve A-B" means a call to
ReserveXLogInsertLocation, which returned StartPos A and EndPos B.

Backend A        Backend B        Backend C
---------        ---------        ---------
Acquire InsertLock 2
Reserve 2100-2200
            Calls WaitToFinish()
              reservedUpto is 2200
              sees that Lock 1 is
              free
                        Acquire InsertLock 1
                        Reserve 2200-2300
                        GetXLogBuffer(2200)
                         page not in cache
                         Update pos 2000
                         AdvanceXLInsertBuffer()
                          run until about to
                          acquire WALWriteLock
GetXLogBuffer(2100)
  page not in cache
  Update pos 2000
  AdvanceXLInsertBuffer()
   Acquire WALWriteLock
   write out old page
   initialize new page
   Release WALWriteLock
finishes insertion
release InsertLock 2
            WaitToFinish() continues
              sees that lock 2 is
              free. Returns 2200.

            Acquire WALWriteLock
            Call WaitToFinish(2200)
              blocks on Lock 1,
              whose initializedUpto
              is 2000.

At this point, there is a deadlock between B and C. B is waiting for C
to release the lock or update its insertingAt value past 2200, while C
is waiting for WALInsertLock, held by B.

To fix that, let's fix GetXLogBuffer() to always advertise the exact
position, not the beginning of the page (except when inserting the first
record on the page, just after the page header, see comments).

This fixes the problem for me. I've been running pgbench for about 30
minutes without lockups now, while without the patch it locked up within
a couple of minutes. Spiros, can you easily test this patch in your
environment? Would be nice to get a confirmation that this fixes the
problem for you too.

- Heikki

Attachment

fix-walinsertlock-deadlock-1.patch

Re: Lots of stuck queries after upgrade to 9.4

From

Spiros Ioannou

Date:

30 July 2015, 13:24:38

I'm very sorry but we don't have a synthetic load generator for our testing setup, only production and that is on SLA. I would be happy to test the next release though.

Spiros Ioannou
IT Manager, inAccess
www.inaccess.com

M: +30 6973-903808
T: +30 210-6802-358

On 29 July 2015 at 13:42, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 07/28/2015 11:36 PM, Heikki Linnakangas wrote:
A-ha, I succeeded to reproduce this now on my laptop, with pgbench! It
seems to be important to have a very large number of connections:

pgbench -n -c400 -j4 -T600 -P5

That got stuck after a few minutes. I'm using commit_delay=100.

Now that I have something to work with, I'll investigate this more tomorrow.

Ok, it seems that this is caused by the same issue that I found with my synthetic test case, after all. It is possible to get a lockup because of it.

For the archives, here's a hopefully easier-to-understand explanation of how the lockup happens. It involves three backends. A and C are insertion WAL records, while B is flushing the WAL with commit_delay. The byte positions 2000, 2100, 2200, and 2300 are offsets within a WAL page. 2000 points to the beginning of the page, while the others are later positions on the same page. WaitToFinish() is an abbreviation for WaitXLogInsertionsToFinish(). "Update pos X" means a call to WALInsertLockUpdateInsertingAt(X). "Reserve A-B" means a call to ReserveXLogInsertLocation, which returned StartPos A and EndPos B.

Backend A Backend B Backend C
--------- --------- ---------
Acquire InsertLock 2
Reserve 2100-2200
Calls WaitToFinish()
reservedUpto is 2200
sees that Lock 1 is
free
Acquire InsertLock 1
Reserve 2200-2300
GetXLogBuffer(2200)
page not in cache
Update pos 2000
AdvanceXLInsertBuffer()
run until about to
acquire WALWriteLock
GetXLogBuffer(2100)
page not in cache
Update pos 2000
AdvanceXLInsertBuffer()
Acquire WALWriteLock
write out old page
initialize new page
Release WALWriteLock
finishes insertion
release InsertLock 2
WaitToFinish() continues
sees that lock 2 is
free. Returns 2200.

Acquire WALWriteLock
Call WaitToFinish(2200)
blocks on Lock 1,
whose initializedUpto
is 2000.

At this point, there is a deadlock between B and C. B is waiting for C to release the lock or update its insertingAt value past 2200, while C is waiting for WALInsertLock, held by B.

To fix that, let's fix GetXLogBuffer() to always advertise the exact position, not the beginning of the page (except when inserting the first record on the page, just after the page header, see comments).

This fixes the problem for me. I've been running pgbench for about 30 minutes without lockups now, while without the patch it locked up within a couple of minutes. Spiros, can you easily test this patch in your environment? Would be nice to get a confirmation that this fixes the problem for you too.

- Heikki

Re: Lots of stuck queries after upgrade to 9.4

From

Scott Marlowe

Date:

30 July 2015, 18:47:40

You might want to try pg replay: http://laurenz.github.io/pgreplay/

On Thu, Jul 30, 2015 at 7:23 AM, Spiros Ioannou <sivann@inaccess.com> wrote:

I'm very sorry but we don't have a synthetic load generator for our testing setup, only production and that is on SLA. I would be happy to test the next release though.

Spiros Ioannou
IT Manager, inAccess
www.inaccess.com

M: +30 6973-903808
T: +30 210-6802-358

On 29 July 2015 at 13:42, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 07/28/2015 11:36 PM, Heikki Linnakangas wrote:
A-ha, I succeeded to reproduce this now on my laptop, with pgbench! It
seems to be important to have a very large number of connections:

pgbench -n -c400 -j4 -T600 -P5

That got stuck after a few minutes. I'm using commit_delay=100.

Now that I have something to work with, I'll investigate this more tomorrow.

Ok, it seems that this is caused by the same issue that I found with my synthetic test case, after all. It is possible to get a lockup because of it.

For the archives, here's a hopefully easier-to-understand explanation of how the lockup happens. It involves three backends. A and C are insertion WAL records, while B is flushing the WAL with commit_delay. The byte positions 2000, 2100, 2200, and 2300 are offsets within a WAL page. 2000 points to the beginning of the page, while the others are later positions on the same page. WaitToFinish() is an abbreviation for WaitXLogInsertionsToFinish(). "Update pos X" means a call to WALInsertLockUpdateInsertingAt(X). "Reserve A-B" means a call to ReserveXLogInsertLocation, which returned StartPos A and EndPos B.

Backend A Backend B Backend C
--------- --------- ---------
Acquire InsertLock 2
Reserve 2100-2200
Calls WaitToFinish()
reservedUpto is 2200
sees that Lock 1 is
free
Acquire InsertLock 1
Reserve 2200-2300
GetXLogBuffer(2200)
page not in cache
Update pos 2000
AdvanceXLInsertBuffer()
run until about to
acquire WALWriteLock
GetXLogBuffer(2100)
page not in cache
Update pos 2000
AdvanceXLInsertBuffer()
Acquire WALWriteLock
write out old page
initialize new page
Release WALWriteLock
finishes insertion
release InsertLock 2
WaitToFinish() continues
sees that lock 2 is
free. Returns 2200.

Acquire WALWriteLock
Call WaitToFinish(2200)
blocks on Lock 1,
whose initializedUpto
is 2000.

At this point, there is a deadlock between B and C. B is waiting for C to release the lock or update its insertingAt value past 2200, while C is waiting for WALInsertLock, held by B.

To fix that, let's fix GetXLogBuffer() to always advertise the exact position, not the beginning of the page (except when inserting the first record on the page, just after the page header, see comments).

This fixes the problem for me. I've been running pgbench for about 30 minutes without lockups now, while without the patch it locked up within a couple of minutes. Spiros, can you easily test this patch in your environment? Would be nice to get a confirmation that this fixes the problem for you too.

- Heikki

To understand recursion, one must first understand recursion.

Re: Lots of stuck queries after upgrade to 9.4

From

Spiros Ioannou

Date:

30 July 2015, 19:03:40

That would need a replica of the data probably which is not possible (tablespace is 4TB).

Spiros Ioannou
IT Manager, inAccess
www.inaccess.com

M: +30 6973-903808
T: +30 210-6802-358

On 30 July 2015 at 21:47, Scott Marlowe <scott.marlowe@gmail.com> wrote:

You might want to try pg replay: http://laurenz.github.io/pgreplay/

On Thu, Jul 30, 2015 at 7:23 AM, Spiros Ioannou <sivann@inaccess.com> wrote:
I'm very sorry but we don't have a synthetic load generator for our testing setup, only production and that is on SLA. I would be happy to test the next release though.

Spiros Ioannou
IT Manager, inAccess
www.inaccess.com

M: +30 6973-903808
T: +30 210-6802-358

On 29 July 2015 at 13:42, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 07/28/2015 11:36 PM, Heikki Linnakangas wrote:
A-ha, I succeeded to reproduce this now on my laptop, with pgbench! It
seems to be important to have a very large number of connections:

pgbench -n -c400 -j4 -T600 -P5

That got stuck after a few minutes. I'm using commit_delay=100.

Now that I have something to work with, I'll investigate this more tomorrow.

Ok, it seems that this is caused by the same issue that I found with my synthetic test case, after all. It is possible to get a lockup because of it.

For the archives, here's a hopefully easier-to-understand explanation of how the lockup happens. It involves three backends. A and C are insertion WAL records, while B is flushing the WAL with commit_delay. The byte positions 2000, 2100, 2200, and 2300 are offsets within a WAL page. 2000 points to the beginning of the page, while the others are later positions on the same page. WaitToFinish() is an abbreviation for WaitXLogInsertionsToFinish(). "Update pos X" means a call to WALInsertLockUpdateInsertingAt(X). "Reserve A-B" means a call to ReserveXLogInsertLocation, which returned StartPos A and EndPos B.

Backend A Backend B Backend C
--------- --------- ---------
Acquire InsertLock 2
Reserve 2100-2200
Calls WaitToFinish()
reservedUpto is 2200
sees that Lock 1 is
free
Acquire InsertLock 1
Reserve 2200-2300
GetXLogBuffer(2200)
page not in cache
Update pos 2000
AdvanceXLInsertBuffer()
run until about to
acquire WALWriteLock
GetXLogBuffer(2100)
page not in cache
Update pos 2000
AdvanceXLInsertBuffer()
Acquire WALWriteLock
write out old page
initialize new page
Release WALWriteLock
finishes insertion
release InsertLock 2
WaitToFinish() continues
sees that lock 2 is
free. Returns 2200.

Acquire WALWriteLock
Call WaitToFinish(2200)
blocks on Lock 1,
whose initializedUpto
is 2000.

At this point, there is a deadlock between B and C. B is waiting for C to release the lock or update its insertingAt value past 2200, while C is waiting for WALInsertLock, held by B.

To fix that, let's fix GetXLogBuffer() to always advertise the exact position, not the beginning of the page (except when inserting the first record on the page, just after the page header, see comments).

This fixes the problem for me. I've been running pgbench for about 30 minutes without lockups now, while without the patch it locked up within a couple of minutes. Spiros, can you easily test this patch in your environment? Would be nice to get a confirmation that this fixes the problem for you too.

- Heikki

--
To understand recursion, one must first understand recursion.

Re: Lots of stuck queries after upgrade to 9.4

From

Tatsuo Ishii

Date:

31 July 2015, 00:08:43

Heikki,

> A-ha, I succeeded to reproduce this now on my laptop, with pgbench! It
> seems to be important to have a very large number of connections:
>
> pgbench -n -c400 -j4 -T600 -P5
>
> That got stuck after a few minutes. I'm using commit_delay=100.

Because the original reporter seems to have difficulty to execute your
test case immediately, I think I would try your test case.

However unfortunately I have not succeeded in reproducing the stuck
problem by using 9.4.4. I tend to think about increasing commit_delay
and/or -c of pgbench and try it out again. Anything I should try other
than this?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Lots of stuck queries after upgrade to 9.4

From

Heikki Linnakangas

Date:

02 August 2015, 17:15:04

On 07/30/2015 04:23 PM, Spiros Ioannou wrote:
> I'm very sorry but we don't have a synthetic load generator for our testing
> setup, only production and that is on SLA. I would be happy to test the
> next release though.

Ok, no worries. I've pushed this changes, it will appear in the next
release. Thanks for the report!

- Heikki