Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue - Mailing list pgsql-hackers

From Jacques Combrink
Subject Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
Date
Msg-id db706cba-c86f-4999-8696-fe5545d4e630@quantsolutions.co.za
Whole thread Raw
In response to Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue  (Matheus Alcantara <matheusssilv97@gmail.com>)
Responses Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
List pgsql-hackers
I came across this discussion after we had this problem at least twice 
now at work.
I read through the discussion to see if this is the problem we are 
experiencing.
At first, I thought this was not exactly our problem, because of the way 
it was showed to be reproduced.

With a long running transaction that is listening, and then doing all 
the other steps like like running through transactions until a second 
pg_xact file is created, and then vacuuming etc etc...

But, I felt like we don't do this, so it cannot be the way we run into 
the problem. Specifically, having a listener as a long running transaction.


So I set out to get it into the broken state, and here is how I can 
reproduce it without long running queries:

1. Create another database

postgres=# CREATE DATABASE test;
CREATE DATABASE

2. Create a listener on the postgres database.

postgres=# LISTEN xx;
LISTEN

3. Create notifies for the test database.

Here I struggled to get a stuck notification in the queue (SELECT 
pg_notification_queue_usage();).

It can happen with only a single notify from a  psql connection, but I 
get a higher hit rate with the following:

Create a notify.sql with the only contents being:

```
NOTIFY s, 'yappers';
```
Then run this against the test database with pgbench.

pgbench -n -c 80 -j 20 -t 1000 -f notify.sql test


4. Confirm that there is now something stuck in the queue:

postgres=# SELECT pg_notification_queue_usage();
  pg_notification_queue_usage
-----------------------------
          9.5367431640625e-07
(1 row)

If this still shows 0, then run step 3 again.

5. Consume xid's. I create a file consume.sql with the only contents being:
```
SELECT txid_current();
```

Then:

pgbench -n -c 80 -j 30 -t 100000 -f consume.sql postgres

6. Verify that a new file is created in the pg_xact folder.

If not, just run the previous step again.

7. Run vacuum freeze. (Remember to allow connections to template0 
beforehand with `ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true;`)

$ vacuumdb --all --freeze
vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming database "template0"
vacuumdb: vacuuming database "template1"
vacuumdb: vacuuming database "test"

8. Connect to the test database and try execute a listener.
test=# LISTEN anything;
ERROR:  could not access status of transaction 81086
DETAIL:  Could not open file "pg_xact/0000": No such file or directory.
test=#


Extra weirdness:

In step 2 create a connection to the test database and LISTEN there on 
any channel, even the one you are notifying to:
And you don't have to run anything on that connection, there is no 
backend_xmin on that connection, you don't start a transaction nothing, 
you just run `LISTEN something;`
Then after step 7, as long as you don't close this connection, you will 
not get an error when you try to set up a listener on the test database, 
even on a new connection to the test database.

What I'm saying is after step 7 you can have two connections. One to 
postgres database with active listener, and one to test database with 
active listener.
As long as the postgres one is open the queue is stuck, ie `SELECT 
pg_notification_queue_usage();` returns non-zero.
As long as the test database connection is open, new listeners do not 
experience the error.

If you close the test database connection, but leave the postgres 
connection, from now on any listener you try to create on test database 
will error.

If you close the postgres connection, the queue clears immediately and 
new LISTEN's on the test database will work.
This means it is possible to get rid of the problem without restarting 
the postmaster if you can close connections until you close the one that 
"caused" the problem.
Don't know if this is actually useful, but to me it seemed like everyone 
believed it to be impossible to recover without restarting, so it's 
something at least.


TLDR:
active listener on one database causes notify on another database to get 
stuck.
At no point could I get a stuck notify if I don't have a listener on at 
least one other database than the one I am notifying on. See the Extra 
weirdness section.
At no point do you need to have any other queries running, there is 
never an idle in transaction query needed for bad timing with the vacuum.

I hope I explained everything well enough so that one of you smart 
people can find and fix the problem.




pgsql-hackers by date:

Previous
From: Daniil Davydov
Date:
Subject: Re: Replace magic numbers with strategy numbers for B-tree indexes
Next
From: Bertrand Drouvot
Date:
Subject: Get rid of pgstat_count_backend_io_op*() functions