Re: BUG #18076: Consistently receiving Signal 7 and Signal 11 errors - Mailing list pgsql-bugs

From Robert Sanford
Subject Re: BUG #18076: Consistently receiving Signal 7 and Signal 11 errors
Date
Msg-id CABa+nRtZoekMo5yFcF_CpcaMShtaBjexr24-xcnFOBS2b4bnAA@mail.gmail.com
Whole thread Raw
In response to BUG #18076: Consistently receiving Signal 7 and Signal 11 errors  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #18076: Consistently receiving Signal 7 and Signal 11 errors  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
Note - one thing I read was that work_mem is a per-connection setting so 8GB * 32 active connections = way more RAM than the machine has. 

I decreased that to 128MB and re-ran the tests. Failed in the same place. It did get more done than it had previously, but failed in the same place. Then I decreased it to 32MB and the same thing happened.

Using the Azure VM monitor I see that available memory has not ever gone below 50% so work_mem may not be the issue.

rjsjr

On Tue, Aug 29, 2023 at 3:31 PM PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:

Bug reference:      18076
Logged by:          Robert Sanford
Email address:      wobbet@wobbet.com
PostgreSQL version: 15.4
Operating system:   Linux 5.15.0-1042-azure #49~20.04.1-Ubuntu aarch64
Description:       

Installed v15 via Docker on my local dev machine (Intel x64). Created the
database schema, ran a bunch of tests. All very good.

pg_dump the database and then imported into a new instance running on ARM64
VM in Azure (8 cores, 64 GB RAM, Ubuntu 20.04.1). This is a full VM, not a
docker instance. ARM64 CPUs are, according to Azure, physical CPUs and not
virtual.

Logs from the Azure machine show Signal 7 and Signal 11 errors consistently
in the same place of the processing every time (log entries below). It's
executing a specific function. Server restarts and has to catch up from the
WAL. None of the other functions show this behavior. Running the stored
procedure from DBeaver does not cause the same issues.

My ETL app is a .NET Core 6 console application running on my local machine
and connecting to either the DB on my local machine or the VM in Azure. It
connects via NPGSQL. It doesn't seem to be horribly expensive in terms of
CPU usage and generally finishes in < 5s. That being said, there can be as
many as 32 simultaneous connections running the specific stored procedure.


After reading other bug reports of Signal 7 I set huge_pages=off and
disabled extensions such as autoanalyze. Other non-default config entries
include:
- max_connections = 100
- shared_buffers = 8GB
- work_mem = 8GB
- maintenance_work_mem = 1GB
- max_wal_size = 4GB
- min_wal_size = 500MB

I am hopeful that someone will be able to help me out with this. Let me know
if there's any other details that I need to provide.

Thanks!

rjsjr



Log entries...
2023-08-29 01:50:57.225 UTC [432307] LOG:  server process (PID 433701) was
terminated by signal 7: Bus error
2023-08-29 01:50:57.225 UTC [432307] DETAIL:  Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:50:57.225 UTC [432307] LOG:  terminating any other active
server processes
2023-08-29 01:50:57.231 UTC [432307] LOG:  all server processes terminated;
reinitializing
2023-08-29 01:50:57.441 UTC [438947] LOG:  database system was interrupted;
last known up at 2023-08-29 01:44:22 UTC
2023-08-29 01:50:57.547 UTC [438947] LOG:  database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:50:57.553 UTC [438947] LOG:  redo starts at F/3CD6C068
2023-08-29 01:50:57.896 UTC [438950] FATAL:  the database system is not yet
accepting connections
2023-08-29 01:50:57.896 UTC [438950] DETAIL:  Consistent recovery state has
not been yet reached.
2023-08-29 01:50:58.612 UTC [438947] LOG:  unexpected pageaddr E/99864000 in
log segment 000000010000000F00000050, offset 8798208
2023-08-29 01:50:58.612 UTC [438947] LOG:  redo done at F/50863FB0 system
usage: CPU: user: 0.66 s, system: 0.32 s, elapsed: 1.05 s
2023-08-29 01:50:58.629 UTC [438948] LOG:  checkpoint starting:
end-of-recovery immediate wait


2023-08-29 01:51:07.996 UTC [438948] LOG:  checkpoint complete: wrote 35042
buffers (3.3%); 0 WAL file(s) added, 20 removed, 0 recycled; write=8.989 s,
sync=0.197 s, total=9.372 s; sync files=254, longest=0.193 s, average=0.001
s; distance=322528 kB, estimate=322528 kB
2023-08-29 01:51:08.027 UTC [432307] LOG:  database system is ready to
accept connections
2023-08-29 01:52:07.472 UTC [432307] LOG:  server process (PID 440110) was
terminated by signal 7: Bus error
2023-08-29 01:52:07.472 UTC [432307] DETAIL:  Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:52:07.472 UTC [432307] LOG:  terminating any other active
server processes
2023-08-29 01:52:07.500 UTC [432307] LOG:  all server processes terminated;
reinitializing
2023-08-29 01:52:07.689 UTC [440184] LOG:  database system was interrupted;
last known up at 2023-08-29 01:51:07 UTC
2023-08-29 01:52:07.797 UTC [440184] LOG:  database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:52:07.803 UTC [440184] LOG:  redo starts at F/508640C0
2023-08-29 01:52:07.892 UTC [440184] LOG:  unexpected pageaddr E/C8C98000 in
log segment 000000010000000F00000051, offset 13205504
2023-08-29 01:52:07.892 UTC [440184] LOG:  redo done at F/51C97F28 system
usage: CPU: user: 0.06 s, system: 0.02 s, elapsed: 0.08 s

2023-08-29 01:52:07.908 UTC [440185] LOG:  checkpoint starting:
end-of-recovery immediate wait
2023-08-29 01:52:08.085 UTC [440185] LOG:  checkpoint complete: wrote 1216
buffers (0.1%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.140 s,
sync=0.011 s, total=0.181 s; sync files=39, longest=0.005 s, average=0.001
s; distance=20688 kB, estimate=20688 kB
2023-08-29 01:52:08.104 UTC [432307] LOG:  database system is ready to
accept connections
2023-08-29 01:52:08.153 UTC [440187] FATAL:  the database system is not yet
accepting connections
2023-08-29 01:52:08.153 UTC [440187] DETAIL:  Consistent recovery state has
not been yet reached.
2023-08-29 01:52:09.716 UTC [432307] LOG:  server process (PID 440192) was
terminated by signal 11: Segmentation fault
2023-08-29 01:52:09.716 UTC [432307] DETAIL:  Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:52:09.716 UTC [432307] LOG:  terminating any other active
server processes
2023-08-29 01:52:09.720 UTC [432307] LOG:  all server processes terminated;
reinitializing
2023-08-29 01:52:09.873 UTC [440196] LOG:  database system was interrupted;
last known up at 2023-08-29 01:52:08 UTC
2023-08-29 01:52:09.984 UTC [440196] LOG:  database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:52:09.998 UTC [440196] LOG:  redo starts at F/51C980C0
2023-08-29 01:52:10.001 UTC [440196] LOG:  unexpected pageaddr E/C8CEA000 in
log segment 000000010000000F00000051, offset 13541376
2023-08-29 01:52:10.001 UTC [440196] LOG:  redo done at F/51CE9AB0 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-08-29 01:52:10.024 UTC [440197] LOG:  checkpoint starting:
end-of-recovery immediate wait
2023-08-29 01:52:10.132 UTC [440197] LOG:  checkpoint complete: wrote 58
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.077 s,
sync=0.014 s, total=0.113 s; sync files=26, longest=0.005 s, average=0.001
s; distance=328 kB, estimate=328 kB
2023-08-29 01:52:10.141 UTC [432307] LOG:  database system is ready to
accept connections
2023-08-29 01:52:11.501 UTC [432307] LOG:  server process (PID 440272) was
terminated by signal 11: Segmentation fault
2023-08-29 01:52:11.501 UTC [432307] DETAIL:  Failed process was running:
select * from process_route_compliance($1, $2, $3)
2023-08-29 01:52:11.501 UTC [432307] LOG:  terminating any other active
server processes
2023-08-29 01:52:11.504 UTC [432307] LOG:  all server processes terminated;
reinitializing
2023-08-29 01:52:11.648 UTC [440279] LOG:  database system was interrupted;
last known up at 2023-08-29 01:52:10 UTC
2023-08-29 01:52:11.748 UTC [440279] LOG:  database system was not properly
shut down; automatic recovery in progress
2023-08-29 01:52:11.754 UTC [440279] LOG:  redo starts at F/51CEA0C0
2023-08-29 01:52:11.759 UTC [440279] LOG:  unexpected pageaddr E/C8D86000 in
log segment 000000010000000F00000051, offset 14180352
2023-08-29 01:52:11.759 UTC [440279] LOG:  redo done at F/51D85F90 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-08-29 01:52:11.778 UTC [440280] LOG:  checkpoint starting:
end-of-recovery immediate wait
2023-08-29 01:52:11.874 UTC [440280] LOG:  checkpoint complete: wrote 89
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.062 s,
sync=0.016 s, total=0.103 s; sync files=31, longest=0.008 s, average=0.001
s; distance=624 kB, estimate=624 kB
2023-08-29 01:52:11.883 UTC [432307] LOG:  database system is ready to
accept connections
2023-08-29 01:53:16.516 UTC [432307] LOG:  server process (PID 440285) was
terminated by signal 11: Segmentation fault
2023-08-29 01:53:16.516 UTC [432307] DETAIL:  Failed process was running:
select * from process_route_compliance($1, $2, $3)

pgsql-bugs by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: BUG #18075: configuration variable idle_session_timeout not working as expected
Next
From: Tom Lane
Date:
Subject: Re: BUG #18075: configuration variable idle_session_timeout not working as expected