Thread: autoovacuum launcher process

autoovacuum launcher process

From

liam saffioti

Date:

12 September 2021, 10:03:23

Hello,

I have 3 nodes patroni cluster with PG12.6 on RHEL 7.9.
I encountered an error indicating that the database has entered recovery mode. I didn't understand why this happened.
There are no similar error messages before the error, such INFO-ERROR-WARNING-FATAL so on.
Moreover, I run vacuumdb in crontab everyday. (vacuumdb -a -j12 -z --skip-locked)

>>> Do you think vacuumdb could be the case that this is causing such an error?
>>> What else could cause this autovacuum launcher process to exit?

2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID 156668) exited with exit code 1
2021-09-10 01:00:33.356 110257 LOG: terminating any other active server processes
2021-09-10 01:00:33.447 10.26.105.11 aioc aioc pgmoon 144929 idle WARNING: terminating connection because of crash of another server process
2021-09-10 01:00:33.447 10.26.105.11 aioc aioc pgmoon 144929 idle DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally an
d possibly corrupted shared memory.
2021-09-10 01:00:33.447 10.26.105.11 aioc aioc pgmoon 144929 idle HINT: In a moment you should be able to reconnect to the database and repeat your command.
2021-09-10 01:00:33.448 10.26.105.14 aioc aioc pgmoon 144895 idle WARNING: terminating connection because of crash of another server process
2021-09-10 01:00:33.448 10.26.105.14 aioc aioc pgmoon 144895 idle DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally an
d possibly corrupted shared memory.

...... (similar lots of log)

Thanks a lot.

Liam

Re: autoovacuum launcher process

From

Tom Lane

Date:

12 September 2021, 14:12:23

liam saffioti <liam.saffiotti@gmail.com> writes:
> 2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID
> 156668) exited with exit code 1

The interesting log entry(s) would be before this one.  You sure PID
156668 didn't log anything before dying?

            regards, tom lane

Re: autoovacuum launcher process

From

liam saffioti

Date:

13 September 2021, 03:52:26

Hello again,

I grepped the db log with PID 156668, the result is below;

# more pg_cluster-Fri.log | grep 156668
2021-09-10 01:00:32.691 156668 FATAL: cannot read pg_class without having selected a database
2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID 156668) exited with exit code 1

sincerely,

Liam

Tom Lane <tgl@sss.pgh.pa.us>, 12 Eyl 2021 Paz, 17:12 tarihinde şunu yazdı:

liam saffioti <liam.saffiotti@gmail.com> writes:
> 2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID
> 156668) exited with exit code 1

The interesting log entry(s) would be before this one. You sure PID
156668 didn't log anything before dying?

regards, tom lane

Re: autoovacuum launcher process

From

Avinash Vallarapu

Date:

13 September 2021, 04:11:48

Hi Liam,

On Mon, Sep 13, 2021 at 12:52 AM liam saffioti <liam.saffiotti@gmail.com> wrote:

Hello again,
I grepped the db log with PID 156668, the result is below;

# more pg_cluster-Fri.log | grep 156668
2021-09-10 01:00:32.691 156668 FATAL: cannot read pg_class without having selected a database

Though it sounds like an unrelated question, what is the unique list of extensions you got in all your database(s) ?

\dx

2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID 156668) exited with exit code 1

It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes.

sincerely,
Liam

Tom Lane <tgl@sss.pgh.pa.us>, 12 Eyl 2021 Paz, 17:12 tarihinde şunu yazdı:
liam saffioti <liam.saffiotti@gmail.com> writes:
> 2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID
> 156668) exited with exit code 1

The interesting log entry(s) would be before this one. You sure PID
156668 didn't log anything before dying?

regards, tom lane

Regards,
Avinash Vallarapu

CEO

MigOps, Inc.

Re: autoovacuum launcher process

From

liam saffioti

Date:

13 September 2021, 05:03:58

Ok, my extensions list is;

hypopg
pg_buffercache
pg_partman
pg_stat_kcache
pg_stat_statements

timescaledb

powa

postgres_fdw

btree_gin

btree_gist

plpythonu

uuid-ossp

plpgsql

>> It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes.

I looked /var/log/messages, I suspected OOM killer, but there is no any log about it, or "terminate" etc.

Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,667 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:44 pg_rhel02 patroni: 2021-09-10 01:00:44,708 WARNING: Retry got exception: 'connection problems'
Sep 10 01:00:45 pg_rhel02 journal: Suppressed 89852 messages from /system.slice/auditd.service
Sep 10 01:00:47 pg_rhel02 patroni: localhost:5432 - no response
Sep 10 01:00:47 pg_rhel02 patroni: 2021-09-10 01:00:47,764 INFO: Error communicating with PostgreSQL. Will try again later
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:54 pg_rhel02 patroni: 2021-09-10 01:00:54,832 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:57 pg_rhel02 patroni: 2021-09-10 01:00:57,837 WARNING: Retry got exception: 'connection problems'
Sep 10 01:01:00 pg_rhel02 patroni: localhost:5432 - no response

Avinash Vallarapu <avinash.vallarapu@gmail.com>, 13 Eyl 2021 Pzt, 07:12 tarihinde şunu yazdı:

Hi Liam,

On Mon, Sep 13, 2021 at 12:52 AM liam saffioti <liam.saffiotti@gmail.com> wrote:
Hello again,
I grepped the db log with PID 156668, the result is below;

# more pg_cluster-Fri.log | grep 156668
2021-09-10 01:00:32.691 156668 FATAL: cannot read pg_class without having selected a database
Though it sounds like an unrelated question, what is the unique list of extensions you got in all your database(s) ?

\dx

2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID 156668) exited with exit code 1
It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes.

sincerely,
Liam

Tom Lane <tgl@sss.pgh.pa.us>, 12 Eyl 2021 Paz, 17:12 tarihinde şunu yazdı:
liam saffioti <liam.saffiotti@gmail.com> writes:
> 2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID
> 156668) exited with exit code 1

The interesting log entry(s) would be before this one. You sure PID
156668 didn't log anything before dying?

regards, tom lane

--
Regards,
Avinash Vallarapu
CEO
MigOps, Inc.

Re: autoovacuum launcher process

From

Avinash Vallarapu

Date:

13 September 2021, 11:39:05

Hi,

On Mon., Sep. 13, 2021, 2:04 a.m. liam saffioti, <liam.saffiotti@gmail.com> wrote:

Ok, my extensions list is;
hypopg
pg_buffercache
pg_partman
pg_stat_kcache
pg_stat_statements
timescaledb

Reason I asked this question is, I see the pg_class specific error messages you see in logs with timescaledb.

powa
postgres_fdw
btree_gin
btree_gist
plpythonu
uuid-ossp
plpgsql

>> It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes.

I looked /var/log/messages, I suspected OOM killer, but there is no any log about it, or "terminate" etc.

Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,667 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:44 pg_rhel02 patroni: 2021-09-10 01:00:44,708 WARNING: Retry got exception: 'connection problems'
Sep 10 01:00:45 pg_rhel02 journal: Suppressed 89852 messages from /system.slice/auditd.service
Sep 10 01:00:47 pg_rhel02 patroni: localhost:5432 - no response
Sep 10 01:00:47 pg_rhel02 patroni: 2021-09-10 01:00:47,764 INFO: Error communicating with PostgreSQL. Will try again later
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:54 pg_rhel02 patroni: 2021-09-10 01:00:54,832 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:57 pg_rhel02 patroni: 2021-09-10 01:00:57,837 WARNING: Retry got exception: 'connection problems'
Sep 10 01:01:00 pg_rhel02 patroni: localhost:5432 - no response

Avinash Vallarapu <avinash.vallarapu@gmail.com>, 13 Eyl 2021 Pzt, 07:12 tarihinde şunu yazdı:
Hi Liam,

On Mon, Sep 13, 2021 at 12:52 AM liam saffioti <liam.saffiotti@gmail.com> wrote:
Hello again,
I grepped the db log with PID 156668, the result is below;

# more pg_cluster-Fri.log | grep 156668
2021-09-10 01:00:32.691 156668 FATAL: cannot read pg_class without having selected a database
Though it sounds like an unrelated question, what is the unique list of extensions you got in all your database(s) ?

\dx

2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID 156668) exited with exit code 1
It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes.

sincerely,
Liam

Tom Lane <tgl@sss.pgh.pa.us>, 12 Eyl 2021 Paz, 17:12 tarihinde şunu yazdı:
liam saffioti <liam.saffiotti@gmail.com> writes:
> 2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID
> 156668) exited with exit code 1

The interesting log entry(s) would be before this one. You sure PID
156668 didn't log anything before dying?

regards, tom lane

--
Regards,
Avinash Vallarapu
CEO
MigOps, Inc.

Re: autoovacuum launcher process

From

liam saffioti

Date:

13 September 2021, 13:27:52

Hi Vanish,

Thank you for your reply,

Do you think timescaledb is causing this problem? I can't see any log about Timescaledb before the problem, but there are many logs after that.

2021-09-10 01:00:32.691 156668 FATAL: cannot read pg_class without having selected a database

2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID 156668) exited with exit code 1

...

2021-09-10 01:02:27.810 146444 LOG: TimescaleDB background worker limit of 8 exceeded
2021-09-10 01:02:27.810 146444 HINT: Consider increasing timescaledb.max_background_workers.

Avinash Vallarapu <avinash.vallarapu@gmail.com>, 13 Eyl 2021 Pzt, 14:39 tarihinde şunu yazdı:

Hi,

On Mon., Sep. 13, 2021, 2:04 a.m. liam saffioti, <liam.saffiotti@gmail.com> wrote:
Ok, my extensions list is;
hypopg
pg_buffercache
pg_partman
pg_stat_kcache
pg_stat_statements
timescaledb
Reason I asked this question is, I see the pg_class specific error messages you see in logs with timescaledb.
powa
postgres_fdw
btree_gin
btree_gist
plpythonu
uuid-ossp
plpgsql

>> It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes.

I looked /var/log/messages, I suspected OOM killer, but there is no any log about it, or "terminate" etc.

Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,667 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:44 pg_rhel02 patroni: 2021-09-10 01:00:44,708 WARNING: Retry got exception: 'connection problems'
Sep 10 01:00:45 pg_rhel02 journal: Suppressed 89852 messages from /system.slice/auditd.service
Sep 10 01:00:47 pg_rhel02 patroni: localhost:5432 - no response
Sep 10 01:00:47 pg_rhel02 patroni: 2021-09-10 01:00:47,764 INFO: Error communicating with PostgreSQL. Will try again later
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:54 pg_rhel02 patroni: 2021-09-10 01:00:54,832 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:57 pg_rhel02 patroni: 2021-09-10 01:00:57,837 WARNING: Retry got exception: 'connection problems'
Sep 10 01:01:00 pg_rhel02 patroni: localhost:5432 - no response

Avinash Vallarapu <avinash.vallarapu@gmail.com>, 13 Eyl 2021 Pzt, 07:12 tarihinde şunu yazdı:
Hi Liam,

On Mon, Sep 13, 2021 at 12:52 AM liam saffioti <liam.saffiotti@gmail.com> wrote:
Hello again,
I grepped the db log with PID 156668, the result is below;

# more pg_cluster-Fri.log | grep 156668
2021-09-10 01:00:32.691 156668 FATAL: cannot read pg_class without having selected a database
Though it sounds like an unrelated question, what is the unique list of extensions you got in all your database(s) ?

\dx

2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID 156668) exited with exit code 1
It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes.

sincerely,
Liam

Tom Lane <tgl@sss.pgh.pa.us>, 12 Eyl 2021 Paz, 17:12 tarihinde şunu yazdı:
liam saffioti <liam.saffiotti@gmail.com> writes:
> 2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID
> 156668) exited with exit code 1

The interesting log entry(s) would be before this one. You sure PID
156668 didn't log anything before dying?

regards, tom lane

--
Regards,
Avinash Vallarapu
CEO
MigOps, Inc.

Re: autoovacuum launcher process

From

liam saffioti

Date:

13 September 2021, 13:30:16

Sorry for the mistake,

Hi Avinash,

Thank you for your reply,

Do you think timescaledb is causing this problem? I can't see any log about Timescaledb before the problem, but there are many logs after that.

2021-09-10 01:00:32.691 156668 FATAL: cannot read pg_class without having selected a database

2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID 156668) exited with exit code 1

...

2021-09-10 01:02:27.810 146444 LOG: TimescaleDB background worker limit of 8 exceeded
2021-09-10 01:02:27.810 146444 HINT: Consider increasing timescaledb.max_background_workers.

liam saffioti <liam.saffiotti@gmail.com>, 13 Eyl 2021 Pzt, 16:27 tarihinde şunu yazdı:

Hi Vanish,
Thank you for your reply,
Do you think timescaledb is causing this problem? I can't see any log about Timescaledb before the problem, but there are many logs after that.

2021-09-10 01:00:32.691 156668 FATAL: cannot read pg_class without having selected a database
2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID 156668) exited with exit code 1
...

2021-09-10 01:02:27.810 146444 LOG: TimescaleDB background worker limit of 8 exceeded
2021-09-10 01:02:27.810 146444 HINT: Consider increasing timescaledb.max_background_workers.

Avinash Vallarapu <avinash.vallarapu@gmail.com>, 13 Eyl 2021 Pzt, 14:39 tarihinde şunu yazdı:
Hi,

On Mon., Sep. 13, 2021, 2:04 a.m. liam saffioti, <liam.saffiotti@gmail.com> wrote:
Ok, my extensions list is;
hypopg
pg_buffercache
pg_partman
pg_stat_kcache
pg_stat_statements
timescaledb
Reason I asked this question is, I see the pg_class specific error messages you see in logs with timescaledb.
powa
postgres_fdw
btree_gin
btree_gist
plpythonu
uuid-ossp
plpgsql

>> It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes.

I looked /var/log/messages, I suspected OOM killer, but there is no any log about it, or "terminate" etc.

Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:41 pg_rhel02 patroni: 2021-09-10 01:00:41,667 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:44 pg_rhel02 patroni: 2021-09-10 01:00:44,708 WARNING: Retry got exception: 'connection problems'
Sep 10 01:00:45 pg_rhel02 journal: Suppressed 89852 messages from /system.slice/auditd.service
Sep 10 01:00:47 pg_rhel02 patroni: localhost:5432 - no response
Sep 10 01:00:47 pg_rhel02 patroni: 2021-09-10 01:00:47,764 INFO: Error communicating with PostgreSQL. Will try again later
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: Lock owner: pg_rhel02; I am pg_rhel02
Sep 10 01:00:51 pg_rhel02 patroni: 2021-09-10 01:00:51,186 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:54 pg_rhel02 patroni: 2021-09-10 01:00:54,832 INFO: establishing a new patroni connection to the postgres cluster
Sep 10 01:00:57 pg_rhel02 patroni: 2021-09-10 01:00:57,837 WARNING: Retry got exception: 'connection problems'
Sep 10 01:01:00 pg_rhel02 patroni: localhost:5432 - no response

Avinash Vallarapu <avinash.vallarapu@gmail.com>, 13 Eyl 2021 Pzt, 07:12 tarihinde şunu yazdı:
Hi Liam,

On Mon, Sep 13, 2021 at 12:52 AM liam saffioti <liam.saffiotti@gmail.com> wrote:
Hello again,
I grepped the db log with PID 156668, the result is below;

# more pg_cluster-Fri.log | grep 156668
2021-09-10 01:00:32.691 156668 FATAL: cannot read pg_class without having selected a database
Though it sounds like an unrelated question, what is the unique list of extensions you got in all your database(s) ?

\dx

2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID 156668) exited with exit code 1
It would also be wise to look at dmesg or /var/log/messages and see if there is anything interesting around what could have terminated one of the processes.

sincerely,
Liam

Tom Lane <tgl@sss.pgh.pa.us>, 12 Eyl 2021 Paz, 17:12 tarihinde şunu yazdı:
liam saffioti <liam.saffiotti@gmail.com> writes:
> 2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID
> 156668) exited with exit code 1

The interesting log entry(s) would be before this one. You sure PID
156668 didn't log anything before dying?

regards, tom lane

--
Regards,
Avinash Vallarapu
CEO
MigOps, Inc.

Re: autoovacuum launcher process

From

Tom Lane

Date:

13 September 2021, 13:46:14

liam saffioti <liam.saffiotti@gmail.com> writes:
> Do you think timescaledb is causing this problem?

Seems possible.

> 2021-09-10 01:00:32.691     156668  FATAL:  cannot read pg_class without
> having selected a database
> 2021-09-10 01:00:33.356     110257  LOG:  autovacuum launcher process (PID
> 156668) exited with exit code 1

This is pretty clear evidence that some code running inside the
autovacuum launcher process tried to read a non-shared catalog, which it
cannot do because the launcher isn't connected to any specific database
within the cluster.  It is highly unlikely that anything in core Postgres
is doing that; we'd have noticed.  So I conclude that some extension code
is carelessly trying to do catalog accesses without checking to see if
it's running in a background process that doesn't support that.

Of the extensions you listed, timescaledb is perhaps the most likely to
contain such a bug; but I don't know much about hypopg, pg_partman,
pg_stat_kcache, or powa, so I can't completely rule those out.  It would
have to be an extension that sometimes acts of its own accord without
being called by a SQL command, so the list of suspects isn't all that
long.

Anyway, your first step should be to make sure you have up-to-date
versions of all those extensions.  If you do, then try filing a bug
with timescaledb.

            regards, tom lane

Re: autoovacuum launcher process

From

liam saffioti

Date:

13 September 2021, 14:48:09

Thank you a lot for your detailed explanation and guidance.

sincerely yours,

Liam

Tom Lane <tgl@sss.pgh.pa.us>, 13 Eyl 2021 Pzt, 16:46 tarihinde şunu yazdı:

liam saffioti <liam.saffiotti@gmail.com> writes:
> Do you think timescaledb is causing this problem?

Seems possible.

> 2021-09-10 01:00:32.691 156668 FATAL: cannot read pg_class without
> having selected a database
> 2021-09-10 01:00:33.356 110257 LOG: autovacuum launcher process (PID
> 156668) exited with exit code 1

This is pretty clear evidence that some code running inside the
autovacuum launcher process tried to read a non-shared catalog, which it
cannot do because the launcher isn't connected to any specific database
within the cluster. It is highly unlikely that anything in core Postgres
is doing that; we'd have noticed. So I conclude that some extension code
is carelessly trying to do catalog accesses without checking to see if
it's running in a background process that doesn't support that.

Of the extensions you listed, timescaledb is perhaps the most likely to
contain such a bug; but I don't know much about hypopg, pg_partman,
pg_stat_kcache, or powa, so I can't completely rule those out. It would
have to be an extension that sometimes acts of its own accord without
being called by a SQL command, so the list of suspects isn't all that
long.

Anyway, your first step should be to make sure you have up-to-date
versions of all those extensions. If you do, then try filing a bug
with timescaledb.

regards, tom lane