Re: logical replication empty transactions - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: logical replication empty transactions
Whole thread Raw
In response to Re: logical replication empty transactions  (Andres Freund <>)
Responses Re: logical replication empty transactions
Re: logical replication empty transactions
List pgsql-hackers
On Tue, 10 Mar 2020 at 02:30, Andres Freund <> wrote:

On 2020-03-06 13:53:02 +0800, Craig Ringer wrote:
> On Mon, 2 Mar 2020 at 19:26, Amit Kapila <> wrote:
> > One thing that is not clear to me is how will we advance restart_lsn
> > if we don't send any empty xact in a system where there are many such
> > xacts?
> Same way we already do it for writes that are not replicated over
> logical replication, like vacuum work etc. The upstream sends feedback
> with reply-requested. The downstream replies. The upstream advances
> confirmed_flush_lsn, and that lazily updates restart_lsn.

It'll still delay it a bit.

Right, but we don't generally care because there's no sync rep txn waiting for confirmation. If we lose progress due to a crash it doesn't matter. It does delay removal of old WAL a little, but it hardly matters.
Somewhat independent from the issue at hand: It'd be really good if we
could evolve the syncrep framework to support per-database waiting... It
shouldn't be that hard, and the current situation sucks quite a bit (and
yes, I'm to blame).

Hardly, you just didn't get the chance to fix that on top of the umpteen other things you had to change to make all the logical stuff work. You didn't break it, just didn't implement every single possible enhancement all at once. Shocking, I tell you.

I'm not quite sure what you mean by "poke the walsender"? Kinda sounds
like sending a signal, but decoding happens inside after the walsender,
so there's no need for that. Do you just mean somehow requesting that
walsender sends a feedback message?

Right. I had in mind something like sending a ProcSignal via our funky multiplexed signal mechanism to ask the walsender to immediately generate a keepalive message with a reply-requested flag, then set the walsender's latch so we wake it promptly.
To address the volume we could:

1a) Introduce a pgoutput message type to indicate that the LSN has
  advanced, without needing separate BEGIN/COMMIT. Right now BEGIN is
  21 bytes, COMMIT is 26. But we really don't need that much here. A
  single message should do the trick.

It would. Is it worth caring though? Especially since it seems rather unlikely that the actual network data volume of begin/commit msgs will be much of a concern. It's not like we're PITRing logical streams, and if we did, we could just filter out empty commits on the receiver side.

That message pretty much already exists in the form of a walsender keepalive anyway so we might as well re-use that and not upset the protocol.
1b) Add a LogicalOutputPluginWriterUpdateProgress parameter (and
  possibly rename) that indicates that we are intentionally "ignoring"
  WAL. For walsender that callback then could check if it could just
  forward the position of the client (if it was entirely caught up
  before), or if it should send a feedback request (if syncrep is
  enabled, or distance is big).

I can see something like that being very useful, because at present only the output plugin knows if a txn is "empty" as far as that particular slot and output plugin is concerned. The reorder buffering mechanism cannot do relation-level filtering before it sends the changes to the output plugin during ReorderBufferCommit, since it only knows about relfilenodes not relation oids. And the output plugin might be doing finer grained filtering using row-filter expressions or who knows what else.

But as described above that will only help for txns done in DBs other than the one the logical slot is for or txns known to have an empty ReorderBuffer when the commit is seen.

If there's a txn in the slot's db with a non-empty reorderbuffer, the output plugin won't know if the txn is empty or not until it finishes processing all callbacks and sees the commit for the txn. So it will generally have emitted the Begin message on the wire by the time it knows it has nothing useful to say. And Pg won't know that this txn is empty as far as this output plugin with this particular slot, set of output plugin params, and current user-catalog state is concerned, so it won't have any way to call the output plugin's "update progress" callback instead of the usual begin/change/commit callbacks.

But I think we can already skip empty txns unless sync-rep is enabled with no core changes, and send empty txns as walsender keepalives instead, by altering only output plugins, like this:

* Stash BEGIN data in plugin's LogicalDecodingContext.output_plugin_private when plugin's begin callback called, don't write anything to the outstream
* Write out BEGIN message lazily when any other callback generates a message that does need to be written out
* If no BEGIN written by the time COMMIT callback called, discard the COMMIT too. Check if sync rep enabled. if it is, call LogicalDecodingContext.update_progress from within the output plugin commit handler, otherwise just ignore the commit totally. Probably by calling OutputPluginUpdateProgress().

  We could e.g. have a new LogicalDecodingContext callback that is
  called whenever WalSndWaitForWal() would wait. That'd check if there's
  a pending "need" to send out a 'empty transaction'/feedback request
  message. The "need" flag would get cleared whenever we send out data
  bearing an LSN for other reasons.

I can see that being handy, yes. But it won't necessarily help with the sync rep issue, since other sync rep txns may continue to generate WAL while others wait for commit-confirmations that won't come from the logical replica.

While we're speaking of adding output plugin hooks, I keep on trying to think of a sensible way to do a plugin-defined reply handler, so the downstream end can send COPY BOTH messages of some new msgkind back to the walsender, which will pass them to the output plugin if it implements the appropriate handle_reply_message (or whatever) callback. That much is trivial to implement, where I keep getting a bit stuck is with whether there's a sensible snapshot that can be set to call the output plugin reply handler with. We wouldn't want to switch to a current non-historic snapshot because of all the cache flushes that'd cause, but there isn't necessarily a valid and safe historic snapshot to set when we're not within ReorderBufferCommit is there?

I'd love to get rid of the need to "connect back" to a provider over plain libpq connections to communicate with it. The ability to run SQL on the walsender conn helps. But really, so much more would be possible if we could just have the downstream end *reply* on the same connection using COPY BOTH, much like it sends replay progress updates right now. It'd let us manage relation/attribute/type metadata caches better for example. 


 Craig Ringer         
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

pgsql-hackers by date:

From: ""
Subject: RE: Planning counters in pg_stat_statements (using pgss_store)
From: ""
Subject: RE: Planning counters in pg_stat_statements (using pgss_store)