On Sat, Jun 05, 2021 at 05:25:39PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Mon, Apr 19, 2021 at 06:49:15PM +0000, Stepan Yankevych wrote:
> >> The issue happening each morning when application starts on the production DataBase during about a month.
> >> Always the same transaction id is mentioned in the error (1954017648)
> >> We tried to do UNLISTEN - no changes. the same issue.
> >> LISTEN works good for any other channels.
>
> > ...
> > On the other hand, if that's what happened, your report that "LISTEN works
> > good for any other channels" surprises me. Perhaps something completely
> > different happened on your system.
>
> I suspect the true state of affairs was more like "LISTEN works good in
> any other databases". The described symptoms are consistent with there
> being a message in the NOTIFY queue that has a pruned-away xid. An
> incoming listener would try to scan over already-committed notify
> messages, but testing to see whether this stale message is committed
> would fail. However, since asyncQueueProcessPageEntries just ignores
> messages not targeted for the current database, incoming listeners
> in other databases wouldn't notice the problem.
>
> ISTM the interesting question here is what was holding back truncation
> of the NOTIFY queue. Could there have been an open transaction somewhere
> that was failing to collect NOTIFY data?
Open transactions always decline to collect notify data, don't they? See
ProcessNotifyInterrupt().
> But a transaction sitting open
> for a month is likely to cause far more severe problems than that one.
True.
> > We could prevent the trouble if vac_truncate_clog() had access to the oldest
> > xid in the notification queue; it would set frozenXID to that value if
> > frozenXID would otherwise be older.
>
> Perhaps. I'm not sure how hard it is to extract the oldest xid in the
> queue (where "oldest" is defined as "numerically smallest"). The entries
> are in xid commit order which is a different thing.
Yeah, it wouldn't be cheap in the general case. The value could be a field in
pg_control, updated by a separate VACUUM NOTIFY, which autovacuum would also
run roughly as often as autovacuum visits template0.