Re: Add logical_decoding_spill_limit to cap spill file disk usage per slot - Mailing list pgsql-hackers

From shawn wang
Subject Re: Add logical_decoding_spill_limit to cap spill file disk usage per slot
Date
Msg-id CA+T=_GUf17BqsLRUM36c_=h4hOcS9fYDMYWdRZL3ALL_M88GGA@mail.gmail.com
Whole thread
In response to RE: Add logical_decoding_spill_limit to cap spill file disk usage per slot  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
List pgsql-hackers
Hi Kuroda,

Thank you for the review and the great questions!


> We have provided the subscription option streaming=parallel since PG16. It
> replicates on-going transactions and applies immediately. Does it avoid the
> issue?

streaming=parallel does significantly reduce publisher-side spill files
in the common case — when enabled, the reorder buffer streams changes
directly instead of spilling to disk.

However, it cannot guarantee 100% avoidance of spilling.  There are
several fallback scenarios in the code where streaming is not possible
and the reorder buffer falls back to spill-to-disk even when
streaming=parallel is configured:

  1. Snapshot not yet consistent (snapbuild.c — SnapBuildCurrentState()
     < SNAPBUILD_CONSISTENT), e.g. right after slot creation.

  2. Transaction is being re-decoded after a restart
     (SnapBuildXactNeedsSkip() returns true).

  3. Transaction contains TOAST partial changes
     (rbtxn_has_partial_change), which cannot be streamed.

  4. Transaction contains speculative inserts (INSERT ... ON CONFLICT),
     also flagged as partial changes.

  5. Transaction has no streamable changes yet
     (!rbtxn_has_streamable_change).

  6. Output plugin does not support streaming callbacks
     (e.g. test_decoding without the streaming option).

  7. Parallel apply worker is busy for >10 seconds — the leader falls
     back to serializing changes to disk
     (applyparallelworker.c, SHM_SEND_TIMEOUT_MS).

  8. No parallel worker available — the leader serializes the entire
     streamed transaction to disk (worker.c,
     get_transaction_apply_action → TRANS_LEADER_SERIALIZE).

Additionally, streaming is a *subscription-level* parameter that only
applies to built-in logical replication.  Users of pg_recvlogical or
third-party CDC tools (Debezium, etc.) consume changes directly from
the publisher's walsender and have no subscription to configure.

So streaming=parallel and logical_decoding_spill_limit are
complementary: streaming reduces spilling in the common case, while
the spill limit provides a hard safety net for the cases where
spilling is unavoidable.


> Not sure, but doesn't it mean the error is repeating till the GUC is increased?

Good question.  Yes, if the same large transaction is re-decoded
without any configuration change, the same ERROR will occur again.
This is intentional — the behavior is analogous to temp_file_limit:
once the limit is hit, the operation fails, and it will keep failing
until the DBA takes action.

The DBA has several options to resolve it:

  - Increase logical_decoding_spill_limit.
  - Increase logical_decoding_work_mem (so less data is spilled).
  - Enable streaming on the subscriber (streaming=on or
    streaming=parallel), which avoids spilling in most cases.
  - Investigate and address the root cause (e.g. break up the
    large transaction).

The ERROR message includes the current spill size and the configured
limit, making it straightforward to diagnose.


> Also, is there any difference for the slots's behavior, with the normal walsender's
> exit case?

No, the slot behavior is the same as a normal walsender exit.
Specifically:

  - The slot remains valid (it is NOT invalidated).
  - restart_lsn and confirmed_flush are preserved.
  - The subscriber can reconnect and resume from where it left off.
  - In v2, spill files are properly cleaned up in the error path
    (via WalSndErrorCleanup), so no orphaned files are left behind.

The only difference is that the walsender's exit reason is logged as
an ERROR with ERRCODE_CONFIGURATION_LIMIT_EXCEEDED, rather than a
normal shutdown.  The slot itself is in exactly the same state as if
the walsender had exited normally or the connection was dropped.

Best regards,
Shawn

pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: MinGW CI tasks fail / timeout
Next
From: shawn wang
Date:
Subject: Re: Add logical_decoding_spill_limit to cap spill file disk usage per slot