Re: [HACKERS] Deadlock in XLogInsert at AIX - Mailing list pgsql-hackers

From Bernd Helmle
Subject Re: [HACKERS] Deadlock in XLogInsert at AIX
Date
Msg-id 1485786380.3084.2.camel@oopsware.de
Whole thread Raw
In response to Re: [HACKERS] Deadlock in XLogInsert at AIX  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
List pgsql-hackers
Hi Konstantin,

We had observed exactly the same issues on a customer system with the
same environment and PostgreSQL 9.5.5. Additionally, we've tested on
Linux with XL/C 12 and 13 with exactly the same deadlock behavior. 

So we assumed that this is somehow a compiler issue.

Am Dienstag, den 24.01.2017, 19:26 +0300 schrieb Konstantin Knizhnik:
> More information about the problem - Postgres log contains several
> records:
> 
> 2017-01-24 19:15:20.272 MSK [19270462] LOG:  request to flush past
> end 
> of generated WAL; request 6/AAEBE000, currpos 6/AAEBC2B0
> 
> and them correspond to the time when deadlock happen.

Yeah, the same logs here:

LOG:  request to flush past end of generated WAL; request 1/1F4C6000,
currpos 1/1F4C40E0
STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + -2653
WHERE aid = 3662494;


> There is the following comment in xlog.c concerning this message:
> 
>      /*
>       * No-one should request to flush a piece of WAL that hasn't
> even been
>       * reserved yet. However, it can happen if there is a block with
> a 
> bogus
>       * LSN on disk, for example. XLogFlush checks for that situation
> and
>       * complains, but only after the flush. Here we just assume that
> to 
> mean
>       * that all WAL that has been reserved needs to be finished. In
> this
>       * corner-case, the return value can be smaller than 'upto'
> argument.
>       */
> 
> So looks like it should not happen.
> The first thing to suspect is spinlock implementation which is
> different 
> for GCC and XLC.
> But ... if I rebuild Postgres without spinlocks, then the problem is 
> still reproduced.

Before we got the results from XLC on Linux (where Postgres show the
same behavior) i had a look into the spinlock implementation. If i got
it right, XLC doesn't use the ppc64 specific ones, but the fallback
implementation (system monitoring on AIX also has shown massive calls
for signal(0)...). So i tried the following patch:

diff --git a/src/include/port/atomics/arch-ppc.h
b/src/include/port/atomics/arch-ppc.h
new file mode 100644
index f901a0c..028cced
*** a/src/include/port/atomics/arch-ppc.h
--- b/src/include/port/atomics/arch-ppc.h
***************
*** 23,26 ****
--- 23,33 ----
  #define pg_memory_barrier_impl()      __asm__ __volatile__ ("sync" :
: :
"memory")
  #define pg_read_barrier_impl()                __asm__ __volatile__
("lwsync" : : : "memory")
  #define pg_write_barrier_impl()               __asm__ __volatile__
("lwsync" : : : "memory")
+
+ #elif defined(__IBMC__) || defined(__IBMCPP__)
+
+ #define pg_memory_barrier_impl()      __asm__ __volatile__ (" sync
\n"
::: "memory")
+ #define pg_read_barrier_impl()                __asm__ __volatile__ ("
lwsync \n" ::: "memory")
+ #define pg_write_barrier_impl()               __asm__ __volatile__ ("
lwsync \n" ::: "memory")
+
  #endif

This didn't change the picture, though.




pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: [HACKERS] Superowners
Next
From: Pavel Stehule
Date:
Subject: Re: [HACKERS] One-shot expanded output in psql using \G