Thread: Segmentation Fault PG 14

Segmentation Fault PG 14

From
Willian Colognesi
Date:
Hello!

I started to use version `14.5-2.pgdg20.04+2` for a dedicated database and I'm facing many segmentation faults during the day when the database has more heavy queries.

The server log there are many of this:
```
2022-11-07 17:23:19.423 UTC [728] LOG:  background worker "parallel worker" (PID 9558) was terminated by signal 11: Segmentation fault
2022-11-07 17:23:19.423 UTC [728] DETAIL:  Failed process was running: select blablabla from heavyquery where ...;
2022-11-07 17:23:19.423 UTC [728] LOG:  terminating any other active server processes
2022-11-07 17:23:19.681 UTC [9588] microservice@microservice FATAL:  the database system is in recovery mode
2022-11-07 17:23:19.683 UTC [9589] microservice@microservice FATAL:  the database system is in recovery mode
2022-11-07 17:23:24.543 UTC [728] LOG:  all server processes terminated; reinitializing
2022-11-07 17:23:24.894 UTC [9622] LOG:  database system was interrupted; last known up at 2022-11-07 17:22:07 UTC
2022-11-07 17:23:25.636 UTC [9622] LOG:  invalid record length at 134/227A3A68: wanted 24, got 0
2022-11-07 17:23:25.636 UTC [9622] LOG:  redo done at 134/227A3A38 system usage: CPU: user: 0.04 s, system: 0.06 s, elapsed: 0.70 s
2022-11-07 17:23:27.608 UTC [728] LOG:  database system is ready to accept connections
2022-11-07 17:23:33.474 UTC [9635] replica@[unknown] LOG:  could not receive data from client: Connection reset by peer
2022-11-07 17:23:33.474 UTC [9635] replica@[unknown] STATEMENT:  START_REPLICATION 134/22000000 TIMELINE 1
2022-11-07 17:23:33.474 UTC [9635] replica@[unknown] LOG:  unexpected EOF on standby connection
2022-11-07 17:23:33.474 UTC [9635] replica@[unknown] STATEMENT:  START_REPLICATION 134/22000000 TIMELINE 1
2022-11-07 17:23:51.310 UTC [9662] replica@[unknown] LOG:  could not receive data from client: Connection reset by peer
2022-11-07 17:23:51.310 UTC [9662] replica@[unknown] STATEMENT:  START_REPLICATION 134/22000000 TIMELINE 1
2022-11-07 17:23:51.310 UTC [9662] replica@[unknown] LOG:  unexpected EOF on standby connection
2022-11-07 17:23:51.310 UTC [9662] replica@[unknown] STATEMENT:  START_REPLICATION 134/22000000 TIMELINE 1
INFO: 2022/11/07 17:23:51.445710 FILE PATH: 000000010000013400000022.lz4
2022-11-07 17:24:09.206 UTC [9672] replica@[unknown] LOG:  could not receive data from client: Connection reset by peer
2022-11-07 17:24:09.206 UTC [9672] replica@[unknown] STATEMENT:  START_REPLICATION 134/23000000 TIMELINE 1
2022-11-07 17:24:09.206 UTC [9672] replica@[unknown] LOG:  unexpected EOF on standby connection
2022-11-07 17:24:09.206 UTC [9672] replica@[unknown] STATEMENT:  START_REPLICATION 134/23000000 TIMELINE 1
INFO: 2022/11/07 17:24:27.527897 FILE PATH: 000000010000013400000023.lz4
INFO: 2022/11/07 17:24:38.076058 FILE PATH: 000000010000013400000024.lz4
```

It's server is running in ubuntu 22.04 in aarch64 (ARM architecture)

I could also get a little information from gdb, I'm not sure if it will help:
```
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/postgresql/14/bin/postgres...
Reading symbols from /usr/lib/debug/.build-id/d7/87a0cf1bb645b349f7c137e36cc30f7ba8805f.debug...
[New LWP 9559]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: 14/main: parallel worker for PID 9528                               '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000100000c757c9c in ?? ()
(gdb) bt
#0  0x000100000c757c9c in ?? ()
#1  0x0000ffff0c757124 in ?? ()
#2  0x0000aaaac2ac9970 in ExecProcNode (node=0xaaaafc599818) at ./build/../src/include/executor/executor.h:257
#3  ExecAppend (pstate=0xaaaafc595918) at ./build/../src/backend/executor/nodeAppend.c:360
#4  0x0000aaaac2ac9970 in ExecProcNode (node=0xaaaafc595918) at ./build/../src/include/executor/executor.h:257
#5  ExecAppend (pstate=0xaaaafc526988) at ./build/../src/backend/executor/nodeAppend.c:360
#6  0x0000000000000001 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)
``` 

Has anyone already faced this problem or may know a solution?

Thanks in advance.

--

Willian Cezar de O. Colognesi
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090

Re: Segmentation Fault PG 14

From
Tom Lane
Date:
Willian Colognesi <willian_colognesi@trimble.com> writes:
> I started to use version `14.5-2.pgdg20.04+2` for a dedicated database and
> I'm facing many segmentation faults during the day when the database has
> more heavy queries.

I take it things were okay with the version you used previously?
What was that exactly?  Has anything else changed?

> I could also get a little information from gdb, I'm not sure if it will
> help:

This looks pretty messed up.  Are you sure the debug symbols you're using
match the package?

Even better, can you construct a self-contained test case?

            regards, tom lane



Re: Segmentation Fault PG 14

From
Willian Colognesi
Date:
Hi Tom,

`I take it things were okay with the version you used previously?`
Yes, it was working pretty well in another instance with pg version `12.4-1.pgdg18.04+1`, and we had to make a migration of one database that was running in this server to another using Logical Replication.

the process was basically this:
CREATE PUBLICATION my_database_pub FOR ALL TABLES;
postgres@origin:~$ psql "dbname=<my_database> replication=database"
my_database=# CREATE_REPLICATION_SLOT <slot_name> LOGICAL pgoutput;
pg_dump -j4 -h <host> -p 5432 --no-subscriptions --no-publications -d <my_database> --snapshot=<snapshot_generated> -Fd -U <my_user> -f </mnt/dump>
postgres@destination:/mnt/database$ pg_restore -d <my_database> -j 5 </mnt/dump>

CREATE SUBSCRIPTION <name_sub>    
       CONNECTION 'host=<host> dbname=<my_database> user=replica password=?? port=5432'
       PUBLICATION <name_pub>
       WITH (slot_name=<slot_name>, create_slot=false, copy_data=false);

After this migration we started to have this kind of problem in both replica and primary servers.

`This looks pretty messed up.  Are you sure the debug symbols you're using`
What exactly do you mean? I'm not too familiar with this debug toolings, the packages I've used were:

postgresql-14/focal-pgdg,now 14.5-2.pgdg20.04+2 arm64 [installed]
postgresql-14-dbgsym/focal-pgdg,now 14.5-2.pgdg20.04+2 arm64 [installed]

`Even better, can you construct a self-contained test case?`:
Actually I couldn't reproduce the problem because it's happening just in a production database, and it doesn't look to have a pattern in the cases when it happens.

Is there anything I could provide you to help the analysis ?



On Mon, Nov 7, 2022 at 3:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Willian Colognesi <willian_colognesi@trimble.com> writes:
> I started to use version `14.5-2.pgdg20.04+2` for a dedicated database and
> I'm facing many segmentation faults during the day when the database has
> more heavy queries.

I take it things were okay with the version you used previously?
What was that exactly?  Has anything else changed?

> I could also get a little information from gdb, I'm not sure if it will
> help:

This looks pretty messed up.  Are you sure the debug symbols you're using
match the package?

Even better, can you construct a self-contained test case?

                        regards, tom lane


--

Willian Cezar de O. Colognesi
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090

Re: Segmentation Fault PG 14

From
Adrian Klaver
Date:
On 11/7/22 10:36 AM, Willian Colognesi wrote:
> Hi Tom,
> 
> `I take it things were okay with the version you used previously?`
> Yes, it was working pretty well in another instance with pg version 
> `12.4-1.pgdg18.04+1`, and we had to make a migration of one database 
> that was running in this server to another using Logical Replication.

Actually you used dump/restore and logical replication. '

In below:

1) What versions of pg_dump and pg_restore did you use?

2) To be clear the subscription was started after the restore?

3) Where there any error messages issued at any point in below?

4) Are the database clusters on the same machine?

> 
> the process was basically this:
> |CREATE| |PUBLICATION my_database_pub ||FOR| |ALL| |TABLES;|
> |postgres@origin:~$ psql "dbname=<my_database> replication=database"
> |
> |
> |my_database=# CREATE_REPLICATION_SLOT <slot_name> LOGICAL pgoutput;|
> pg_dump -j4 -h <host> -p 5432 --no-subscriptions --no-publications -d 
> <my_database> --snapshot=<snapshot_generated> -Fd -U <my_user> -f 
> </mnt/dump>
> postgres@destination:/mnt/database$ pg_restore -d <my_database> -j 5 
> </mnt/dump>
> 
> CREATE SUBSCRIPTION <name_sub>
>         CONNECTION 'host=<host> dbname=<my_database> user=replica 
> password=?? port=5432'
>         PUBLICATION <name_pub>
>         WITH (slot_name=<slot_name>, create_slot=false, copy_data=false);
> |
> 
> 
> After this migration we started to have this kind of problem in both 
> replica and primary servers.
> 
> `This looks pretty messed up.  Are you sure the debug symbols you're using`
> What exactly do you mean? I'm not too familiar with this debug toolings, 
> the packages I've used were:
> 
> postgresql-14/focal-pgdg,now 14.5-2.pgdg20.04+2 arm64 [installed]
> postgresql-14-dbgsym/focal-pgdg,now 14.5-2.pgdg20.04+2 arm64 [installed]
> 
> `Even better, can you construct a self-contained test case?`:
> Actually I couldn't reproduce the problem because it's happening just in 
> a production database, and it doesn't look to have a pattern in the 
> cases when it happens.
> 
> Is there anything I could provide you to help the analysis ?
> 
> 
> 
> On Mon, Nov 7, 2022 at 3:08 PM Tom Lane <tgl@sss.pgh.pa.us 
> <mailto:tgl@sss.pgh.pa.us>> wrote:
> 
>     Willian Colognesi <willian_colognesi@trimble.com
>     <mailto:willian_colognesi@trimble.com>> writes:
>      > I started to use version `14.5-2.pgdg20.04+2` for a dedicated
>     database and
>      > I'm facing many segmentation faults during the day when the
>     database has
>      > more heavy queries.
> 
>     I take it things were okay with the version you used previously?
>     What was that exactly?  Has anything else changed?
> 
>      > I could also get a little information from gdb, I'm not sure if
>     it will
>      > help:
> 
>     This looks pretty messed up.  Are you sure the debug symbols you're
>     using
>     match the package?
> 
>     Even better, can you construct a self-contained test case?
> 
>                              regards, tom lane
> 
> 
> 
> -- 
> <http://www.trimble.com/>
> *Willian Cezar de O. Colognesi
> *
> Systems Analysis Specialist, Trimble Transportation Brazil
> Avenida Santos Dumont, 271 | Londrina, PR | 86039-090
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Segmentation Fault PG 14

From
Willian Colognesi
Date:
1) What versions of pg_dump and pg_restore did you use?
A: pg_dump and pg_restore was done using pg 14 (the same as the destination was running)

2) To be clear the subscription was started after the restore?
A: Yes

3) Where there any error messages issued at any point in below?
A: no errors during the dump and restore.

4) Are the database clusters on the same machine?
A: No, the origin and destination were different servers at the same VPC.

On Mon, Nov 7, 2022 at 3:49 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 11/7/22 10:36 AM, Willian Colognesi wrote:
> Hi Tom,
>
> `I take it things were okay with the version you used previously?`
> Yes, it was working pretty well in another instance with pg version
> `12.4-1.pgdg18.04+1`, and we had to make a migration of one database
> that was running in this server to another using Logical Replication.

Actually you used dump/restore and logical replication. '

In below:

1) What versions of pg_dump and pg_restore did you use?

2) To be clear the subscription was started after the restore?

3) Where there any error messages issued at any point in below?

4) Are the database clusters on the same machine?

>
> the process was basically this:
> |CREATE| |PUBLICATION my_database_pub ||FOR| |ALL| |TABLES;|
> |postgres@origin:~$ psql "dbname=<my_database> replication=database"
> |
> |
> |my_database=# CREATE_REPLICATION_SLOT <slot_name> LOGICAL pgoutput;|
> pg_dump -j4 -h <host> -p 5432 --no-subscriptions --no-publications -d
> <my_database> --snapshot=<snapshot_generated> -Fd -U <my_user> -f
> </mnt/dump>
> postgres@destination:/mnt/database$ pg_restore -d <my_database> -j 5
> </mnt/dump>
>
> CREATE SUBSCRIPTION <name_sub>
>         CONNECTION 'host=<host> dbname=<my_database> user=replica
> password=?? port=5432'
>         PUBLICATION <name_pub>
>         WITH (slot_name=<slot_name>, create_slot=false, copy_data=false);
> |
>
>
> After this migration we started to have this kind of problem in both
> replica and primary servers.
>
> `This looks pretty messed up.  Are you sure the debug symbols you're using`
> What exactly do you mean? I'm not too familiar with this debug toolings,
> the packages I've used were:
>
> postgresql-14/focal-pgdg,now 14.5-2.pgdg20.04+2 arm64 [installed]
> postgresql-14-dbgsym/focal-pgdg,now 14.5-2.pgdg20.04+2 arm64 [installed]
>
> `Even better, can you construct a self-contained test case?`:
> Actually I couldn't reproduce the problem because it's happening just in
> a production database, and it doesn't look to have a pattern in the
> cases when it happens.
>
> Is there anything I could provide you to help the analysis ?
>
>
>
> On Mon, Nov 7, 2022 at 3:08 PM Tom Lane <tgl@sss.pgh.pa.us
> <mailto:tgl@sss.pgh.pa.us>> wrote:
>
>     Willian Colognesi <willian_colognesi@trimble.com
>     <mailto:willian_colognesi@trimble.com>> writes:
>      > I started to use version `14.5-2.pgdg20.04+2` for a dedicated
>     database and
>      > I'm facing many segmentation faults during the day when the
>     database has
>      > more heavy queries.
>
>     I take it things were okay with the version you used previously?
>     What was that exactly?  Has anything else changed?
>
>      > I could also get a little information from gdb, I'm not sure if
>     it will
>      > help:
>
>     This looks pretty messed up.  Are you sure the debug symbols you're
>     using
>     match the package?
>
>     Even better, can you construct a self-contained test case?
>
>                              regards, tom lane
>
>
>
> --
> <http://www.trimble.com/>
> *Willian Cezar de O. Colognesi
> *
> Systems Analysis Specialist, Trimble Transportation Brazil
> Avenida Santos Dumont, 271 | Londrina, PR | 86039-090
>


--
Adrian Klaver
adrian.klaver@aklaver.com


--

Willian Cezar de O. Colognesi
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090

Re: Segmentation Fault PG 14

From
Adrian Klaver
Date:
On 11/7/22 10:57 AM, Willian Colognesi wrote:
> 1) What versions of pg_dump and pg_restore did you use?
> A: pg_dump and pg_restore was done using pg 14 (the same as the 
> destination was running)
> 
> 2) To be clear the subscription was started after the restore?
> A: Yes
> 
> 3) Where there any error messages issued at any point in below?
> A: no errors during the dump and restore.
> 
> 4) Are the database clusters on the same machine?
> A: No, the origin and destination were different servers at the same VPC.

Are servers using the same version of OS?


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Segmentation Fault PG 14

From
Willian Colognesi
Date:
No, the origin where the database was was running ubuntu 18.04.5 x86_64 and the destination ubuntu 20.04.5 aarch64

On Mon, Nov 7, 2022 at 4:00 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 11/7/22 10:57 AM, Willian Colognesi wrote:
> 1) What versions of pg_dump and pg_restore did you use?
> A: pg_dump and pg_restore was done using pg 14 (the same as the
> destination was running)
>
> 2) To be clear the subscription was started after the restore?
> A: Yes
>
> 3) Where there any error messages issued at any point in below?
> A: no errors during the dump and restore.
>
> 4) Are the database clusters on the same machine?
> A: No, the origin and destination were different servers at the same VPC.

Are servers using the same version of OS?


--
Adrian Klaver
adrian.klaver@aklaver.com


--

Willian Cezar de O. Colognesi
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090

Re: Segmentation Fault PG 14

From
Adrian Klaver
Date:
On 11/7/22 11:03 AM, Willian Colognesi wrote:
> No, the origin where the database was was running ubuntu 18.04.5 x86_64 
> and the destination ubuntu 20.04.5 aarch64

Where I was going was this:

https://wiki.postgresql.org/wiki/Locale_data_changes

Then I realized you had not done any binary upgrades, so that is a dead end.


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Segmentation Fault PG 14

From
Tom Lane
Date:
Willian Colognesi <willian_colognesi@trimble.com> writes:
> `I take it things were okay with the version you used previously?`

> Yes, it was working pretty well in another instance with pg version
> `12.4-1.pgdg18.04+1`, and we had to make a migration of one database that
> was running in this server to another using Logical Replication.

12.4 to 14.5 is kind of a big jump :-(.

The stack trace seems to indicate that ExecProcNode transferred control
to never-never land, which says that something clobbered the function
pointer it's trying to indirect through.  I don't recall having seen
any similar reports though.

Are you using any extensions besides those that come with core Postgres?
A build incompatibility with some third-party extension might explain
this, perhaps.

One thing I'm curious about is that the stack trace seems to imply that
there was an Append plan node immediately below another Append.  That
shouldn't happen AFAIK --- the planner tries to collapse out such
cases.  Can you get us an EXPLAIN for the problem query?

            regards, tom lane



Re: Segmentation Fault PG 14

From
Willian Colognesi
Date:
All the extensions installed in this database are these:
```
                                     List of installed extensions
        Name        | Version |   Schema   |                        Description                        
--------------------+---------+------------+-----------------------------------------------------------
 amcheck            | 1.3     | public     | functions for verifying relation integrity
 btree_gist         | 1.6     | public     | support for indexing common datatypes in GiST
 pg_stat_statements | 1.9     | public     | track execution statistics of all SQL statements executed
 pgcrypto           | 1.3     | public     | cryptographic functions
 plpgsql            | 1.0     | pg_catalog | PL/pgSQL procedural language
(5 rows)
```

I tried to execute a query with parameters the query was supposed to be run (because I'm not sure exactly the values in the where clause that made the segmentation fault).

here is the explain: https://explain.depesz.com/s/Tql3 (Ps: I just had to suppress the real table/index names)

Looks like since I've disable jit as Boris told, until now the database did not restarted again... (not sure if it's coincidence)


On Mon, Nov 7, 2022 at 4:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Willian Colognesi <willian_colognesi@trimble.com> writes:
> `I take it things were okay with the version you used previously?`

> Yes, it was working pretty well in another instance with pg version
> `12.4-1.pgdg18.04+1`, and we had to make a migration of one database that
> was running in this server to another using Logical Replication.

12.4 to 14.5 is kind of a big jump :-(.

The stack trace seems to indicate that ExecProcNode transferred control
to never-never land, which says that something clobbered the function
pointer it's trying to indirect through.  I don't recall having seen
any similar reports though.

Are you using any extensions besides those that come with core Postgres?
A build incompatibility with some third-party extension might explain
this, perhaps.

One thing I'm curious about is that the stack trace seems to imply that
there was an Append plan node immediately below another Append.  That
shouldn't happen AFAIK --- the planner tries to collapse out such
cases.  Can you get us an EXPLAIN for the problem query?

                        regards, tom lane


--

Willian Cezar de O. Colognesi
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090

Re: Segmentation Fault PG 14

From
Adrian Klaver
Date:
On 11/7/22 12:15, Willian Colognesi wrote:
> All the extensions installed in this database are these:
> ```
>                                       List of installed extensions
>          Name        | Version |   Schema   |                       
>   Description
> --------------------+---------+------------+-----------------------------------------------------------
>   amcheck            | 1.3     | public     | functions for verifying 
> relation integrity
>   btree_gist         | 1.6     | public     | support for indexing 
> common datatypes in GiST
>   pg_stat_statements | 1.9     | public     | track execution statistics 
> of all SQL statements executed
>   pgcrypto           | 1.3     | public     | cryptographic functions
>   plpgsql            | 1.0     | pg_catalog | PL/pgSQL procedural language
> (5 rows)
> ```
> 
> I tried to execute a query with parameters the query was supposed to be 
> run (because I'm not sure exactly the values in the where clause that 
> made the segmentation fault).
> 
> here is the explain: https://explain.depesz.com/s/Tql3 
> <https://explain.depesz.com/s/Tql3> (Ps: I just had to suppress the real 
> table/index names)
> 
> Looks like since I've disable *jit* as Boris told, until now the 
> database did not restarted again... (not sure if it's coincidence)
> 

I did not see that post or suggestion.

What was the suggestion?

Are you saying the database does not start up now?

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Segmentation Fault PG 14

From
Willian Colognesi
Date:
No, the database is running well, no problem until now after disabled jit.

I just realized that he send an email direct to me, the message was:
```
I had similar problems with and the cure was to turn off jit in Postgres.conf

jit = off
--
Boris
```



On Mon, Nov 7, 2022 at 5:25 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 11/7/22 12:15, Willian Colognesi wrote:
> All the extensions installed in this database are these:
> ```
>                                       List of installed extensions
>          Name        | Version |   Schema   |                       
>   Description
> --------------------+---------+------------+-----------------------------------------------------------
>   amcheck            | 1.3     | public     | functions for verifying
> relation integrity
>   btree_gist         | 1.6     | public     | support for indexing
> common datatypes in GiST
>   pg_stat_statements | 1.9     | public     | track execution statistics
> of all SQL statements executed
>   pgcrypto           | 1.3     | public     | cryptographic functions
>   plpgsql            | 1.0     | pg_catalog | PL/pgSQL procedural language
> (5 rows)
> ```
>
> I tried to execute a query with parameters the query was supposed to be
> run (because I'm not sure exactly the values in the where clause that
> made the segmentation fault).
>
> here is the explain: https://explain.depesz.com/s/Tql3
> <https://explain.depesz.com/s/Tql3> (Ps: I just had to suppress the real
> table/index names)
>
> Looks like since I've disable *jit* as Boris told, until now the
> database did not restarted again... (not sure if it's coincidence)
>

I did not see that post or suggestion.

What was the suggestion?

Are you saying the database does not start up now?

--
Adrian Klaver
adrian.klaver@aklaver.com



--

Willian Cezar de O. Colognesi
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090

Re: Segmentation Fault PG 14

From
Tom Lane
Date:
Willian Colognesi <willian_colognesi@trimble.com> writes:
> No, the database is running well, no problem until now after disabled *jit.*

Interesting.  Which version of LLVM is installed?

            regards, tom lane



Re: Segmentation Fault PG 14

From
Willian Colognesi
Date:
Do you mean how it was compiled? the output of pg_config is it:
```
root@ip-10-x-x-x:/home/ubuntu# pg_config --configure
 '--build=aarch64-linux-gnu' '--prefix=/usr' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-silent-rules' '--libdir=${prefix}/lib/aarch64-linux-gnu' '--runstatedir=/run' '--disable-maintainer-mode' '--disable-dependency-tracking' '--with-tcl' '--with-perl' '--with-python' '--with-pam' '--with-openssl' '--with-libxml' '--with-libxslt' '--mandir=/usr/share/postgresql/14/man' '--docdir=/usr/share/doc/postgresql-doc-14' '--sysconfdir=/etc/postgresql-common' '--datarootdir=/usr/share/' '--datadir=/usr/share/postgresql/14' '--bindir=/usr/lib/postgresql/14/bin' '--libdir=/usr/lib/aarch64-linux-gnu/' '--libexecdir=/usr/lib/postgresql/' '--includedir=/usr/include/postgresql/' '--with-extra-version= (Ubuntu 14.5-2.pgdg20.04+2)' '--enable-nls' '--enable-thread-safety' '--enable-debug' '--enable-dtrace' '--disable-rpath' '--with-uuid=e2fs' '--with-gnu-ld' '--with-gssapi' '--with-ldap' '--with-pgport=5432' '--with-system-tzdata=/usr/share/zoneinfo' 'AWK=mawk' 'MKDIR_P=/bin/mkdir -p' 'PROVE=/usr/bin/prove' 'PYTHON=/usr/bin/python3' 'TAR=/bin/tar' 'XSLTPROC=xsltproc --nonet' 'CFLAGS=-g -O2 -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now' '--enable-tap-tests' '--with-icu' '--with-llvm' 'LLVM_CONFIG=/usr/bin/llvm-config-10' 'CLANG=/usr/bin/clang-10' '--with-lz4' '--with-systemd' '--with-selinux' 'build_alias=aarch64-linux-gnu' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -fstack-protector-strong -Wformat -Werror=format-security'
```

There is no llvm installed on ubuntu server, postgresql was installed via apt package `apt install postgresql-14`

On Mon, Nov 7, 2022 at 6:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Willian Colognesi <willian_colognesi@trimble.com> writes:
> No, the database is running well, no problem until now after disabled *jit.*

Interesting.  Which version of LLVM is installed?

                        regards, tom lane


--

Willian Cezar de O. Colognesi
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090

Re: Segmentation Fault PG 14

From
Tom Lane
Date:
Willian Colognesi <willian_colognesi@trimble.com> writes:
> There is no llvm installed on ubuntu server, postgresql was installed via
> apt package `apt install postgresql-14`

If there's no LLVM around, then disabling JIT wouldn't do anything,
because it depends on LLVM to compile code.

We should perhaps wait awhile to see if that really fixed it.

            regards, tom lane



Re: Segmentation Fault PG 14

From
Jeffrey Walton
Date:
On Mon, Nov 7, 2022 at 2:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Willian Colognesi <willian_colognesi@trimble.com> writes:
> > `I take it things were okay with the version you used previously?`
>
> > Yes, it was working pretty well in another instance with pg version
> > `12.4-1.pgdg18.04+1`, and we had to make a migration of one database that
> > was running in this server to another using Logical Replication.
>
> 12.4 to 14.5 is kind of a big jump :-(.
>
> The stack trace seems to indicate that ExecProcNode transferred control
> to never-never land, which says that something clobbered the function
> pointer it's trying to indirect through.  I don't recall having seen
> any similar reports though.

I'm just thinking out loud... I've seen the latest GCC do that on what
it believes to be dead code. Our problem was detailed at
https://github.com/weidai11/cryptopp/issues/1141 .

We identified the problem by building/running our self tests with
-fsanitize=unreachable .

Testing with -fsanitize=unreachable should confirm or rule out GCC and
Clang [incorrectly] removing code that is actually needed. If this is
the problem, then -fsanitize=unreachable will also provide a usable
stack trace and provide a useful debugging experience.

Jeff



Re: Segmentation Fault PG 14

From
Thomas Munro
Date:
On Tue, Nov 8, 2022 at 11:45 AM Willian Colognesi
<willian_colognesi@trimble.com> wrote:
> root@ip-10-x-x-x:/home/ubuntu# pg_config --configure
> ... --with-extra-version= (Ubuntu 14.5-2.pgdg20.04+2)' ...
> ... '--with-llvm' 'LLVM_CONFIG=/usr/bin/llvm-config-10' ...

> There is no llvm installed on ubuntu server, postgresql was installed via apt package `apt install postgresql-14`

We can see from the pg_config output that it's built with LLVM 10.
Also that looks like it's the usual pgdg packages which are certainly
built against LLVM and will install it automatically.



Re: Segmentation Fault PG 14

From
Willian Colognesi
Date:
You are right Thomas,

Just confirmed and it's installed:

ubuntu@ip-10-x-x-x:~$ apt search llvm | grep inst
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
libllvm10/focal,now 1:10.0.0-4ubuntu1 arm64 [installed,automatic]

I was trying something like `llvm -version` or something like that but did not have success, but I verified, and in the apt is installed.

Tom,
Since yesterday the database hasn't restarted, so I'm believing that there is some problem related to jit.

On Tue, Nov 8, 2022 at 4:11 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Tue, Nov 8, 2022 at 11:45 AM Willian Colognesi
<willian_colognesi@trimble.com> wrote:
> root@ip-10-x-x-x:/home/ubuntu# pg_config --configure
> ... --with-extra-version= (Ubuntu 14.5-2.pgdg20.04+2)' ...
> ... '--with-llvm' 'LLVM_CONFIG=/usr/bin/llvm-config-10' ...

> There is no llvm installed on ubuntu server, postgresql was installed via apt package `apt install postgresql-14`

We can see from the pg_config output that it's built with LLVM 10.
Also that looks like it's the usual pgdg packages which are certainly
built against LLVM and will install it automatically.


--

Willian Cezar de O. Colognesi
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090

Re: Segmentation Fault PG 14

From
Willian Colognesi
Date:
Looks like we can confirm that the jit disable fixed the problem, because since yesterday when I disabled jit, the database did not restarted again, and before it the database was restarting at least once per hour.

I don't think it will cause too much impact in our use case having it disabled, so, if you need anything else that could help the analyses to find the bug feel free to let me know and I can grab the logs or whatever needed.

Thanks y'all

On Mon, Nov 7, 2022 at 8:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Willian Colognesi <willian_colognesi@trimble.com> writes:
> There is no llvm installed on ubuntu server, postgresql was installed via
> apt package `apt install postgresql-14`

If there's no LLVM around, then disabling JIT wouldn't do anything,
because it depends on LLVM to compile code.

We should perhaps wait awhile to see if that really fixed it.

                        regards, tom lane


--

Willian Cezar de O. Colognesi
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090

Re: Segmentation Fault PG 14

From
Tom Lane
Date:
Willian Colognesi <willian_colognesi@trimble.com> writes:
> Looks like we can confirm that the jit disable fixed the problem, because
> since yesterday when I disabled jit, the database did not restarted again,
> and before it the database was restarting at least once per hour.

Hmm.  I now recall that we had a previous report of problems with
JIT on aarch64/Focal:

https://www.postgresql.org/message-id/flat/20220303150428.GA26036%40depesz.com

That was LLVM 9 not LLVM 10, but since we never identified the exact
issue, there's no real strong reason to suppose it's been fixed.

Probably keeping JIT off is the best answer for you --- it's hard to
say when we'll be able to make progress with this, given the lack of
reproducible test cases.

            regards, tom lane