Thread: autoovacuum launcher process

autoovacuum launcher process

From
liam saffioti
Date:
Hello,

I have 3 nodes patroni cluster with PG12.6 on RHEL 7.9.
I encountered an error indicating that the database has entered recovery mode. I didn't understand why this happened.
There are no similar error messages before the error, such INFO-ERROR-WARNING-FATAL so on.
Moreover, I run vacuumdb in crontab everyday. (vacuumdb -a -j12 -z --skip-locked)

>>>  Do you think vacuumdb could be the case that this is causing such an error?
>>>  What else could cause this autovacuum launcher process to exit?



2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID 156668) exited with exit code 1
2021-09-10 01:00:33.356     110257  LOG:  terminating any other active server processes
2021-09-10 01:00:33.447 10.26.105.11 aioc aioc pgmoon 144929 idle WARNING:  terminating connection because of crash of another server process
2021-09-10 01:00:33.447 10.26.105.11 aioc aioc pgmoon 144929 idle DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally an
d possibly corrupted shared memory.
2021-09-10 01:00:33.447 10.26.105.11 aioc aioc pgmoon 144929 idle HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2021-09-10 01:00:33.448 10.26.105.14 aioc aioc pgmoon 144895 idle WARNING:  terminating connection because of crash of another server process
2021-09-10 01:00:33.448 10.26.105.14 aioc aioc pgmoon 144895 idle DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally an
d possibly corrupted shared memory.
...... (similar lots of log)

Thanks a lot.
Liam

Re: autoovacuum launcher process

From
Tom Lane
Date:
liam saffioti <liam.saffiotti@gmail.com> writes:
> 2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID
> 156668) exited with exit code 1

The interesting log entry(s) would be before this one.  You sure PID
156668 didn't log anything before dying?

            regards, tom lane



Re: autoovacuum launcher process

From
liam saffioti
Date:
Hello again,
I grepped the db log with PID 156668, the result is below;

# more pg_cluster-Fri.log | grep 156668
2021-09-10 01:00:32.691     156668  FATAL:  cannot read pg_class without having selected a database
2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID 156668) exited with exit code 1

sincerely,
Liam

Tom Lane <tgl@sss.pgh.pa.us>, 12 Eyl 2021 Paz, 17:12 tarihinde şunu yazdı:
liam saffioti <liam.saffiotti@gmail.com> writes:
> 2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID
> 156668) exited with exit code 1

The interesting log entry(s) would be before this one.  You sure PID
156668 didn't log anything before dying?

                        regards, tom lane

Re: autoovacuum launcher process

From
Avinash Vallarapu
Date:
Hi Liam,

On Mon, Sep 13, 2021 at 12:52 AM liam saffioti <liam.saffiotti@gmail.com> wrote:
Hello again,
I grepped the db log with PID 156668, the result is below;

# more pg_cluster-Fri.log | grep 156668
2021-09-10 01:00:32.691     156668  FATAL:  cannot read pg_class without having selected a database
Though it sounds like an unrelated question, what is the unique list of extensions you got in all your database(s) ?

\dx

2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID 156668) exited with exit code 1
It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes. 

sincerely,
Liam

Tom Lane <tgl@sss.pgh.pa.us>, 12 Eyl 2021 Paz, 17:12 tarihinde şunu yazdı:
liam saffioti <liam.saffiotti@gmail.com> writes:
> 2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID
> 156668) exited with exit code 1

The interesting log entry(s) would be before this one.  You sure PID
156668 didn't log anything before dying?

                        regards, tom lane


--
Regards,
Avinash Vallarapu
CEO
MigOps, Inc. 

Re: autoovacuum launcher process

From
liam saffioti
Date:
Ok, my extensions list is;
 hypopg            
 pg_buffercache    
 pg_partman        
 pg_stat_kcache    
 pg_stat_statements
timescaledb
powa
postgres_fdw
btree_gin
btree_gist
plpythonu
uuid-ossp
plpgsql

>> It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes. 

I looked /var/log/messages, I suspected OOM killer, but there is no any log about it, or "terminate" etc.


Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,667 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:44 pg_rhel02 patroni: 2021-09-10 01:00:44,708 WARNING: Retry got exception: 'connection problems'
Sep 10 01:00:45 pg_rhel02 journal: Suppressed 89852 messages from /system.slice/auditd.service
Sep 10 01:00:47 pg_rhel02 patroni: localhost:5432 - no response
Sep 10 01:00:47 pg_rhel02 patroni: 2021-09-10 01:00:47,764 INFO: Error communicating with PostgreSQL. Will try again later
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:54 pg_rhel02 patroni: 2021-09-10 01:00:54,832 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:57 pg_rhel02 patroni: 2021-09-10 01:00:57,837 WARNING: Retry got exception: 'connection problems'
Sep 10 01:01:00 pg_rhel02 patroni: localhost:5432 - no response


Avinash Vallarapu <avinash.vallarapu@gmail.com>, 13 Eyl 2021 Pzt, 07:12 tarihinde şunu yazdı:
Hi Liam,

On Mon, Sep 13, 2021 at 12:52 AM liam saffioti <liam.saffiotti@gmail.com> wrote:
Hello again,
I grepped the db log with PID 156668, the result is below;

# more pg_cluster-Fri.log | grep 156668
2021-09-10 01:00:32.691     156668  FATAL:  cannot read pg_class without having selected a database
Though it sounds like an unrelated question, what is the unique list of extensions you got in all your database(s) ?

\dx

2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID 156668) exited with exit code 1
It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes. 

sincerely,
Liam

Tom Lane <tgl@sss.pgh.pa.us>, 12 Eyl 2021 Paz, 17:12 tarihinde şunu yazdı:
liam saffioti <liam.saffiotti@gmail.com> writes:
> 2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID
> 156668) exited with exit code 1

The interesting log entry(s) would be before this one.  You sure PID
156668 didn't log anything before dying?

                        regards, tom lane


--
Regards,
Avinash Vallarapu
CEO
MigOps, Inc. 

Re: autoovacuum launcher process

From
Avinash Vallarapu
Date:
Hi,

On Mon., Sep. 13, 2021, 2:04 a.m. liam saffioti, <liam.saffiotti@gmail.com> wrote:
Ok, my extensions list is;
 hypopg            
 pg_buffercache    
 pg_partman        
 pg_stat_kcache    
 pg_stat_statements
timescaledb
Reason I asked this question is,  I see the pg_class specific error messages you see in logs with timescaledb. 
powa
postgres_fdw
btree_gin
btree_gist
plpythonu
uuid-ossp
plpgsql

>> It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes. 

I looked /var/log/messages, I suspected OOM killer, but there is no any log about it, or "terminate" etc.


Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,667 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:44 pg_rhel02 patroni: 2021-09-10 01:00:44,708 WARNING: Retry got exception: 'connection problems'
Sep 10 01:00:45 pg_rhel02 journal: Suppressed 89852 messages from /system.slice/auditd.service
Sep 10 01:00:47 pg_rhel02 patroni: localhost:5432 - no response
Sep 10 01:00:47 pg_rhel02 patroni: 2021-09-10 01:00:47,764 INFO: Error communicating with PostgreSQL. Will try again later
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:54 pg_rhel02 patroni: 2021-09-10 01:00:54,832 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:57 pg_rhel02 patroni: 2021-09-10 01:00:57,837 WARNING: Retry got exception: 'connection problems'
Sep 10 01:01:00 pg_rhel02 patroni: localhost:5432 - no response


Avinash Vallarapu <avinash.vallarapu@gmail.com>, 13 Eyl 2021 Pzt, 07:12 tarihinde şunu yazdı:
Hi Liam,

On Mon, Sep 13, 2021 at 12:52 AM liam saffioti <liam.saffiotti@gmail.com> wrote:
Hello again,
I grepped the db log with PID 156668, the result is below;

# more pg_cluster-Fri.log | grep 156668
2021-09-10 01:00:32.691     156668  FATAL:  cannot read pg_class without having selected a database
Though it sounds like an unrelated question, what is the unique list of extensions you got in all your database(s) ?

\dx

2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID 156668) exited with exit code 1
It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes. 

sincerely,
Liam

Tom Lane <tgl@sss.pgh.pa.us>, 12 Eyl 2021 Paz, 17:12 tarihinde şunu yazdı:
liam saffioti <liam.saffiotti@gmail.com> writes:
> 2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID
> 156668) exited with exit code 1

The interesting log entry(s) would be before this one.  You sure PID
156668 didn't log anything before dying?

                        regards, tom lane


--
Regards,
Avinash Vallarapu
CEO
MigOps, Inc. 

Re: autoovacuum launcher process

From
liam saffioti
Date:
Hi Vanish,
Thank you for your reply,
Do you think timescaledb is causing this problem? I can't see any log about Timescaledb before the problem, but there are many logs after that.

2021-09-10 01:00:32.691     156668  FATAL:  cannot read pg_class without having selected a database
2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID 156668) exited with exit code 1
...

2021-09-10 01:02:27.810     146444  LOG:  TimescaleDB background worker limit of 8 exceeded
2021-09-10 01:02:27.810     146444  HINT:  Consider increasing timescaledb.max_background_workers.

Avinash Vallarapu <avinash.vallarapu@gmail.com>, 13 Eyl 2021 Pzt, 14:39 tarihinde şunu yazdı:
Hi,

On Mon., Sep. 13, 2021, 2:04 a.m. liam saffioti, <liam.saffiotti@gmail.com> wrote:
Ok, my extensions list is;
 hypopg            
 pg_buffercache    
 pg_partman        
 pg_stat_kcache    
 pg_stat_statements
timescaledb
Reason I asked this question is,  I see the pg_class specific error messages you see in logs with timescaledb. 
powa
postgres_fdw
btree_gin
btree_gist
plpythonu
uuid-ossp
plpgsql

>> It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes. 

I looked /var/log/messages, I suspected OOM killer, but there is no any log about it, or "terminate" etc.


Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,667 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:44 pg_rhel02 patroni: 2021-09-10 01:00:44,708 WARNING: Retry got exception: 'connection problems'
Sep 10 01:00:45 pg_rhel02 journal: Suppressed 89852 messages from /system.slice/auditd.service
Sep 10 01:00:47 pg_rhel02 patroni: localhost:5432 - no response
Sep 10 01:00:47 pg_rhel02 patroni: 2021-09-10 01:00:47,764 INFO: Error communicating with PostgreSQL. Will try again later
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:54 pg_rhel02 patroni: 2021-09-10 01:00:54,832 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:57 pg_rhel02 patroni: 2021-09-10 01:00:57,837 WARNING: Retry got exception: 'connection problems'
Sep 10 01:01:00 pg_rhel02 patroni: localhost:5432 - no response


Avinash Vallarapu <avinash.vallarapu@gmail.com>, 13 Eyl 2021 Pzt, 07:12 tarihinde şunu yazdı:
Hi Liam,

On Mon, Sep 13, 2021 at 12:52 AM liam saffioti <liam.saffiotti@gmail.com> wrote:
Hello again,
I grepped the db log with PID 156668, the result is below;

# more pg_cluster-Fri.log | grep 156668
2021-09-10 01:00:32.691     156668  FATAL:  cannot read pg_class without having selected a database
Though it sounds like an unrelated question, what is the unique list of extensions you got in all your database(s) ?

\dx

2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID 156668) exited with exit code 1
It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes. 

sincerely,
Liam

Tom Lane <tgl@sss.pgh.pa.us>, 12 Eyl 2021 Paz, 17:12 tarihinde şunu yazdı:
liam saffioti <liam.saffiotti@gmail.com> writes:
> 2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID
> 156668) exited with exit code 1

The interesting log entry(s) would be before this one.  You sure PID
156668 didn't log anything before dying?

                        regards, tom lane


--
Regards,
Avinash Vallarapu
CEO
MigOps, Inc. 

Re: autoovacuum launcher process

From
liam saffioti
Date:
Sorry for the mistake,
Hi Avinash,
Thank you for your reply,
Do you think timescaledb is causing this problem? I can't see any log about Timescaledb before the problem, but there are many logs after that.

2021-09-10 01:00:32.691     156668  FATAL:  cannot read pg_class without having selected a database
2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID 156668) exited with exit code 1
...

2021-09-10 01:02:27.810     146444  LOG:  TimescaleDB background worker limit of 8 exceeded
2021-09-10 01:02:27.810     146444  HINT:  Consider increasing timescaledb.max_background_workers.

liam saffioti <liam.saffiotti@gmail.com>, 13 Eyl 2021 Pzt, 16:27 tarihinde şunu yazdı:
Hi Vanish,
Thank you for your reply,
Do you think timescaledb is causing this problem? I can't see any log about Timescaledb before the problem, but there are many logs after that.

2021-09-10 01:00:32.691     156668  FATAL:  cannot read pg_class without having selected a database
2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID 156668) exited with exit code 1
...

2021-09-10 01:02:27.810     146444  LOG:  TimescaleDB background worker limit of 8 exceeded
2021-09-10 01:02:27.810     146444  HINT:  Consider increasing timescaledb.max_background_workers.

Avinash Vallarapu <avinash.vallarapu@gmail.com>, 13 Eyl 2021 Pzt, 14:39 tarihinde şunu yazdı:
Hi,

On Mon., Sep. 13, 2021, 2:04 a.m. liam saffioti, <liam.saffiotti@gmail.com> wrote:
Ok, my extensions list is;
 hypopg            
 pg_buffercache    
 pg_partman        
 pg_stat_kcache    
 pg_stat_statements
timescaledb
Reason I asked this question is,  I see the pg_class specific error messages you see in logs with timescaledb. 
powa
postgres_fdw
btree_gin
btree_gist
plpythonu
uuid-ossp
plpgsql

>> It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes. 

I looked /var/log/messages, I suspected OOM killer, but there is no any log about it, or "terminate" etc.


Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,667 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:44 pg_rhel02 patroni: 2021-09-10 01:00:44,708 WARNING: Retry got exception: 'connection problems'
Sep 10 01:00:45 pg_rhel02 journal: Suppressed 89852 messages from /system.slice/auditd.service
Sep 10 01:00:47 pg_rhel02 patroni: localhost:5432 - no response
Sep 10 01:00:47 pg_rhel02 patroni: 2021-09-10 01:00:47,764 INFO: Error communicating with PostgreSQL. Will try again later
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:54 pg_rhel02 patroni: 2021-09-10 01:00:54,832 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:57 pg_rhel02 patroni: 2021-09-10 01:00:57,837 WARNING: Retry got exception: 'connection problems'
Sep 10 01:01:00 pg_rhel02 patroni: localhost:5432 - no response


Avinash Vallarapu <avinash.vallarapu@gmail.com>, 13 Eyl 2021 Pzt, 07:12 tarihinde şunu yazdı:
Hi Liam,

On Mon, Sep 13, 2021 at 12:52 AM liam saffioti <liam.saffiotti@gmail.com> wrote:
Hello again,
I grepped the db log with PID 156668, the result is below;

# more pg_cluster-Fri.log | grep 156668
2021-09-10 01:00:32.691     156668  FATAL:  cannot read pg_class without having selected a database
Though it sounds like an unrelated question, what is the unique list of extensions you got in all your database(s) ?

\dx

2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID 156668) exited with exit code 1
It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes. 

sincerely,
Liam

Tom Lane <tgl@sss.pgh.pa.us>, 12 Eyl 2021 Paz, 17:12 tarihinde şunu yazdı:
liam saffioti <liam.saffiotti@gmail.com> writes:
> 2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID
> 156668) exited with exit code 1

The interesting log entry(s) would be before this one.  You sure PID
156668 didn't log anything before dying?

                        regards, tom lane


--
Regards,
Avinash Vallarapu
CEO
MigOps, Inc. 

Re: autoovacuum launcher process

From
Tom Lane
Date:
liam saffioti <liam.saffiotti@gmail.com> writes:
> Do you think timescaledb is causing this problem?

Seems possible.

> 2021-09-10 01:00:32.691     156668  FATAL:  cannot read pg_class without
> having selected a database
> 2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID
> 156668) exited with exit code 1

This is pretty clear evidence that some code running inside the
autovacuum launcher process tried to read a non-shared catalog, which it
cannot do because the launcher isn't connected to any specific database
within the cluster.  It is highly unlikely that anything in core Postgres
is doing that; we'd have noticed.  So I conclude that some extension code
is carelessly trying to do catalog accesses without checking to see if
it's running in a background process that doesn't support that.

Of the extensions you listed, timescaledb is perhaps the most likely to
contain such a bug; but I don't know much about hypopg, pg_partman,
pg_stat_kcache, or powa, so I can't completely rule those out.  It would
have to be an extension that sometimes acts of its own accord without
being called by a SQL command, so the list of suspects isn't all that
long.

Anyway, your first step should be to make sure you have up-to-date
versions of all those extensions.  If you do, then try filing a bug
with timescaledb.

            regards, tom lane



Re: autoovacuum launcher process

From
liam saffioti
Date:
Thank you a lot for your detailed explanation and guidance.
sincerely yours,
Liam

Tom Lane <tgl@sss.pgh.pa.us>, 13 Eyl 2021 Pzt, 16:46 tarihinde şunu yazdı:
liam saffioti <liam.saffiotti@gmail.com> writes:
> Do you think timescaledb is causing this problem?

Seems possible.

> 2021-09-10 01:00:32.691     156668  FATAL:  cannot read pg_class without
> having selected a database
> 2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID
> 156668) exited with exit code 1

This is pretty clear evidence that some code running inside the
autovacuum launcher process tried to read a non-shared catalog, which it
cannot do because the launcher isn't connected to any specific database
within the cluster.  It is highly unlikely that anything in core Postgres
is doing that; we'd have noticed.  So I conclude that some extension code
is carelessly trying to do catalog accesses without checking to see if
it's running in a background process that doesn't support that.

Of the extensions you listed, timescaledb is perhaps the most likely to
contain such a bug; but I don't know much about hypopg, pg_partman,
pg_stat_kcache, or powa, so I can't completely rule those out.  It would
have to be an extension that sometimes acts of its own accord without
being called by a SQL command, so the list of suspects isn't all that
long.

Anyway, your first step should be to make sure you have up-to-date
versions of all those extensions.  If you do, then try filing a bug
with timescaledb.

                        regards, tom lane