Thread: BUG #16094: Database entering recovery mode

BUG #16094: Database entering recovery mode

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      16094
Logged by:          Mircea Pirv
Email address:      mircea@reva.tech
PostgreSQL version: 12.0
Operating system:   OSX
Description:

Hello, 

We've updated our database from 10.6 to 12.0 recently and we keep
encountering an error which says that the database is entering recovery
mode.
Our setup has replication enabled, with a replication slot, and using a
replication subscription. On our development machines replication is setup
to use the same machine.
The problem is, that with replication enabled, changing a column value using
an update statement, causes the database to enter recovery mode.
This happens only when replication is enabled.
We could not find any pattern, like what kind of columns might be the
culprit. It happens with all kinds of columns, regardless of the fact that
we have constraints, triggers, defaults on the columns. With some columns it
happens randomly, with some it happens all the time.

In the logs the only error we see, is a segmentation fault, when replication
tries to run.
We tried with both a local server, and a dockerized version (12.0).

It only reproduces in 12.0. 10.6 and 11.5 do not have this problem, we could
not reproduce it.

Any ideas on what may have changed in 12.0 to cause this behavior ?

Thanks,
Mircea


Re: BUG #16094: Database entering recovery mode

From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes:
> We've updated our database from 10.6 to 12.0 recently and we keep
> encountering an error which says that the database is entering recovery
> mode.
> ...
> In the logs the only error we see, is a segmentation fault, when replication
> tries to run.

It should be possible to collect a stack trace for the segfault,
which would greatly assist debugging this.  Please see

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

            regards, tom lane



Re: BUG #16094: Database entering recovery mode

From
Mircea Pirv
Date:
This is all that I managed to retrieve using gdb 

#0  0x00007f53d89eb7b7 in epoll_wait (epfd=3, events=0x55bc137a59c0, maxevents=1, maxevents@entry=<error reading variable: Cannot access memory at address 0x7ffee2d44afc>, timeout=-1, timeout@entry=<error reading variable: Cannot access memory at address 0x7ffee2d44ae8>) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
        resultvar = 18446744073709551612
        sc_ret = <optimized out>
#1  0x000055bc11790379 in WaitEventSetWaitBlock (nevents=<error reading variable: Cannot access memory at address 0x7ffee2d44afc>, occurred_events=<error reading variable: Cannot access memory at address 0x7ffee2d44ad8>, cur_timeout=<error reading variable: Cannot access memory at address 0x7ffee2d44ae8>, set=0x55bc137a5948) at ./build/../src/backend/storage/ipc/latch.c:1080
        returned_events = 0
        rc = <optimized out>
        cur_event = <optimized out>
        cur_epoll_event = <optimized out>
        returned_events = <optimized out>
        rc = <optimized out>
        cur_event = <optimized out>
        cur_epoll_event = <optimized out>
        __func__ = <error reading variable __func__ (Cannot access memory at address 0x55bc11a5a510)>
        __errno_location = <optimized out>
#2  WaitEventSetWait (set=0x55bc137a5948, timeout=<error reading variable: Cannot access memory at address 0x7ffee2d44ae0>, occurred_events=<error reading variable: Cannot access memory at address 0x7ffee2d44ad8>, nevents=<error reading variable: Cannot access memory at address 0x7ffee2d44afc>, wait_event_info=<optimized out>) at ./build/../src/backend/storage/ipc/latch.c:1032
        rc = <optimized out>
        returned_events = 0
        start_time = <error reading variable start_time (Cannot access memory at address 0x7ffee2d44b00)>
        cur_time = <error reading variable cur_time (Cannot access memory at address 0x7ffee2d44b10)>
        cur_timeout = <error reading variable cur_timeout (Cannot access memory at address 0x7ffee2d44ae8)>
Backtrace stopped: Cannot access memory at address 0x7ffee2d44b78

On Mon, Nov 4, 2019 at 5:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
PG Bug reporting form <noreply@postgresql.org> writes:
> We've updated our database from 10.6 to 12.0 recently and we keep
> encountering an error which says that the database is entering recovery
> mode.
> ...
> In the logs the only error we see, is a segmentation fault, when replication
> tries to run.

It should be possible to collect a stack trace for the segfault,
which would greatly assist debugging this.  Please see

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

                        regards, tom lane

Re: BUG #16094: Database entering recovery mode

From
Michael Paquier
Date:
On Wed, Nov 06, 2019 at 03:37:31PM +0200, Mircea Pirv wrote:
> #2  WaitEventSetWait (set=0x55bc137a5948, timeout=<error reading variable:
> Cannot access memory at address 0x7ffee2d44ae0>, occurred_events=<error
> reading variable: Cannot access memory at address 0x7ffee2d44ad8>,
> nevents=<error reading variable: Cannot access memory at address
> 0x7ffee2d44afc>, wait_event_info=<optimized out>) at
> ./build/../src/backend/storage/ipc/latch.c:1032
>         rc = <optimized out>
>         returned_events = 0
>         start_time = <error reading variable start_time (Cannot access
> memory at address 0x7ffee2d44b00)>
>         cur_time = <error reading variable cur_time (Cannot access memory
> at address 0x7ffee2d44b10)>
>         cur_timeout = <error reading variable cur_timeout (Cannot access
> memory at address 0x7ffee2d44ae8)>
> Backtrace stopped: Cannot access memory at address 0x7ffee2d44b78

Is this the only part of the stack you can get?  WaitEventSetWait()
gets called in three places, which are be-secure.c, syslogger.c and
condition_variable.c.  Could it be possible to see more of the actual
callers here?
--
Michael

Attachment

Re: BUG #16094: Database entering recovery mode

From
Mircea Pirv
Date:
Hi, Thanks for the reply. That's all the bt full command prints out when the segmentation fault error occurs.

Thanks.
Mircea

On Thu, Nov 7, 2019 at 4:35 AM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Nov 06, 2019 at 03:37:31PM +0200, Mircea Pirv wrote:
> #2  WaitEventSetWait (set=0x55bc137a5948, timeout=<error reading variable:
> Cannot access memory at address 0x7ffee2d44ae0>, occurred_events=<error
> reading variable: Cannot access memory at address 0x7ffee2d44ad8>,
> nevents=<error reading variable: Cannot access memory at address
> 0x7ffee2d44afc>, wait_event_info=<optimized out>) at
> ./build/../src/backend/storage/ipc/latch.c:1032
>         rc = <optimized out>
>         returned_events = 0
>         start_time = <error reading variable start_time (Cannot access
> memory at address 0x7ffee2d44b00)>
>         cur_time = <error reading variable cur_time (Cannot access memory
> at address 0x7ffee2d44b10)>
>         cur_timeout = <error reading variable cur_timeout (Cannot access
> memory at address 0x7ffee2d44ae8)>
> Backtrace stopped: Cannot access memory at address 0x7ffee2d44b78

Is this the only part of the stack you can get?  WaitEventSetWait()
gets called in three places, which are be-secure.c, syslogger.c and
condition_variable.c.  Could it be possible to see more of the actual
callers here?
--
Michael