Re: A few patches to clarify snapshot management - Mailing list pgsql-hackers

From Noah Misch
Subject Re: A few patches to clarify snapshot management
Date
Msg-id 20250809222338.cc.nmisch@google.com
Whole thread Raw
In response to Re: A few patches to clarify snapshot management  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On Tue, Jan 07, 2025 at 11:55:00AM +0200, Heikki Linnakangas wrote:
> On 07/01/2025 00:00, Andres Freund wrote:
> > On 2024-12-20 19:31:01 +0200, Heikki Linnakangas wrote:
> > > While playing around some more with this, I noticed that this code in
> > > GetTransactionSnapshot() is never reached, and AFAICS has always been dead
> > > code:
> > > 
> > > > Snapshot
> > > > GetTransactionSnapshot(void)
> > > > {
> > > >     /*
> > > >      * Return historic snapshot if doing logical decoding. We'll never need a
> > > >      * non-historic transaction snapshot in this (sub-)transaction, so there's
> > > >      * no need to be careful to set one up for later calls to
> > > >      * GetTransactionSnapshot().
> > > >      */
> > > >     if (HistoricSnapshotActive())
> > > >     {
> > > >         Assert(!FirstSnapshotSet);
> > > >         return HistoricSnapshot;
> > > >     }
> > > 
> > > when you think about it, that's good, because it doesn't really make sense
> > > to call GetTransactionSnapshot() during logical decoding. We jump through
> > > hoops to make the historic catalog decoding possible with historic
> > > snapshots, tracking subtransactions that modify catalogs and WAL-logging
> > > command ids, but they're not suitable for general purpose queries. So I
> > > think we should turn that into an error, per attached patch.
> > 
> > Hm. I'm not sure it's a good idea to forbid this. Couldn't there be sane C
> > code in an output functions calling GetTransactionSnapshot() or such to do
> > some internal lookups?
> 
> I haven't seen any. And I don't think that would work correctly while doing
> logical decoding anyway, because historical snapshots only track XIDs that
> modify catalogs. regclassout and enumout do work because they use the
> catalog snapshot rather than GetTransactionSnapshot().
> 
> (I committed that change in commit 1585ff7387 already, but discussion is
> still welcome of course)

https://github.com/2ndQuadrant/pglogical does rely on the pre-1585ff7387 code
for its row_filter feature.  row_filter calls ExecEvalExpr() from the output
plugin, e.g. to evaluate expression "id between 2 AND 4" arising from this
configuration in the pglogical test suite:

  SELECT * FROM pglogical.replication_set_add_table('default', 'basic_dml', false, row_filter := $rf$id between 2 AND
4$rf$);

One of the GetTransactionSnapshot() calls is in pglogical_output_plugin.c
itself.  For that, I could work around the change by forcing the old
HistoricSnapshot use:

-        PushActiveSnapshot(GetTransactionSnapshot());
+        Assert(HistoricSnapshotActive());
+        PushActiveSnapshot(GetCatalogSnapshot(InvalidOid));

That doesn't get far.  Calling a plpgsql function in the expression reaches
the "cannot take query snapshot during logical decoding" error via this stack
trace:

GetTransactionSnapshot at snapmgr.c:279:3
exec_eval_simple_expr at pl_exec.c:6214:3
 (inlined by) exec_eval_expr at pl_exec.c:5699:6
exec_stmt_raise at pl_exec.c:3820:8
 (inlined by) exec_stmts at pl_exec.c:2096:10
exec_stmt_block at pl_exec.c:1955:6
exec_toplevel_block at pl_exec.c:1646:7
plpgsql_exec_function at pl_exec.c:636:5
plpgsql_call_handler at pl_handler.c:278:11
fmgr_security_definer at fmgr.c:755:52
ExecInterpExpr at execExprInterp.c:927:7
pglogical_change_filter at pglogical_output_plugin.c:663:7
 (inlined by) pg_decode_change at pglogical_output_plugin.c:691:7
change_cb_wrapper at logical.c:1121:22
ReorderBufferApplyChange at reorderbuffer.c:2078:3
 (inlined by) ReorderBufferProcessTXN at reorderbuffer.c:2383:7
DecodeCommit at decode.c:743:3
 (inlined by) xact_decode at decode.c:242:5
LogicalDecodingProcessRecord at decode.c:123:1
XLogSendLogical at walsender.c:3442:33
WalSndLoop at walsender.c:2837:7
StartLogicalReplication at walsender.c:1504:2
 (inlined by) exec_replication_command at walsender.c:2158:6
PostgresMain at postgres.c:4762:10
BackendMain at backend_startup.c:80:2
postmaster_child_launch at launch_backend.c:291:3
BackendStartup at postmaster.c:3587:8
 (inlined by) ServerLoop at postmaster.c:1702:6
PostmasterMain at postmaster.c:1252:6
main at main.c:165:4

> > Hm. I'm not sure it's a good idea to forbid this. Couldn't there be sane C
> > code in an output functions calling GetTransactionSnapshot() or such to do
> > some internal lookups?

I think pglogical_output_plugin.c w/ plpgsql is largely sane when used with a
plpgsql function that consults only catalogs and the output tuple.  If the
pglogical test suite is representative, that's the usual case for a
row_filter.  A plpgsql function that reads user tables will be fragile with
concurrent pruning, but a user might sanely accept that fragility.  A plpgsql
function that writes tuples is not sane in a row_filter.  How do you see it?

So far, I know of these options:

1. Make pglogical block the row_filter feature for any v18+ origin.
2. Revert postgresql.git commit 1585ff7387.
3. Make pglogical use HistoricSnapshot where pglogical_output_plugin.c handles
   snapshots directly.  That should keep simple row_filter expressions like
   "col > 0" functioning.  Entering plpgsql or similarly-complex logic will
   fail with "cannot take query snapshot during logical decoding", and we'll
   consider that to be working as intended.
4. Fail later and lazily, for just the most-unreasonable cases.  For example,
   fail when HistoricSnapshot applies to a write operation.  (Maybe this
   already fails.  I didn't check.)

Which of those or other options should we consider?


For reference, https://github.com/2ndQuadrant/pglogical/pull/503 is an
otherwise-working port of pglogical to v18.  Its Makefile currently disables
the tests that reach "cannot take query snapshot during logical decoding".



pgsql-hackers by date:

Previous
From: Aleksander Alekseev
Date:
Subject: Re: VM corruption on standby
Next
From: Kirill Reshke
Date:
Subject: Re: VM corruption on standby