Re: Eliminating SPI / SQL from some RI triggers - take 3 - Mailing list pgsql-hackers

From Haibo Yan
Subject Re: Eliminating SPI / SQL from some RI triggers - take 3
Date
Msg-id 2BE661BA-D909-4093-BF78-DB9B0C099337@gmail.com
Whole thread Raw
In response to Re: Eliminating SPI / SQL from some RI triggers - take 3  (Junwang Zhao <zhjwpku@gmail.com>)
List pgsql-hackers
Hi, Amit and Junwang

Thanks for the latest patch. I think the overall direction makes sense, and the single-column SK_SEARCHARRAY path looks like one of the most valuable optimizations here. The patch also seems to cover several important cases, including deferred constraints, duplicate FK values, and multi-column fallback behavior.

After reading through the patch, I have one major comments and a few smaller ones.

1. TopTransactionContext usage during batched flush may be too coarse-grained
My biggest concern is the use of TopTransactionContext around the batched flush path.
As written, ri_FastPathBatchFlush() switches to TopTransactionContext before calling ri_FastPathFlushArray() / ri_FastPathFlushLoop(). That seems broad enough that temporary allocations made during the flush may end up there.
In particular, in ri_FastPathFlushArray(), I think the objects worth checking carefully are the pass-by-reference Datums returned by the per-element cast call and stored in search_vals[], e.g.

```search_vals[i] = FunctionCall3(&entry->cast_func_finfo, ...);```
If those cast results are separately allocated in the current memory context, then pfree(arr) only frees the constructed array object itself; it does not obviously free those intermediate cast results. If so, those allocations could survive until end of transaction rather than just until the end of the current flush.
Maybe this is harmless in practice, but I think it needs a closer look. It might be better to use a dedicated short-lived context for per-flush temporary allocations, reset it after each flush, or otherwise separate allocations that really need transaction lifetime from those that are only needed transiently during batched processing.

2. RI_FastPathEntry comment mentions the wrong function name
The comment above RI_FastPathEntry says it contains resources needed by ri_FastPathFlushBatch(), but the function is named ri_FastPathBatchFlush().

3. RI_FASTPATH_BATCH_SIZE needs some rationale
RI_FASTPATH_BATCH_SIZE = 64 may well be a reasonable compromise, but right now it reads like a magic number.
This choice seems especially relevant because the patch has two opposing effects:
3-1. larger batches should amortize the array-scan work better,
3-2. but the matched[] bookkeeping in ri_FastPathFlushArray() is O(batch_size^2) in the worst case.
So I think it would help to include at least a brief rationale in a comment or in the commit message.

4. Commit message says the entry stashes fk_relid, but the code actually stashes riinfo
The commit message says the entry stashes fk_relid and can reopen the relation if needed. Unless I am misreading it, the code actually stores riinfo and later uses riinfo->fk_relid. The distinction is small, but I think the wording should match the implementation more closely.

Thanks again for working on this.

Best regards,

Haibo Yan


On Mar 10, 2026, at 5:28 AM, Junwang Zhao <zhjwpku@gmail.com> wrote:

Hi,

On Mon, Mar 2, 2026 at 11:30 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

On Sat, Feb 28, 2026 at 3:08 PM Amit Langote <amitlangote09@gmail.com> wrote:

Hi Junwang,

On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <zhjwpku@gmail.com> wrote:
On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <amitlangote09@gmail.com> wrote:
I re-ran the benchmarks (same test as yours, different machine):

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);
insert into fk select generate_series(1, 2000000, 2);

master: 2444 ms  (median of 3 runs)
0001: 1382 ms  (43% faster)
0001+0002: 1202 ms  (51% faster, 13% over 0001 alone)

I can get similar improvement on my old mac intel chip:

master: 12963.993 ms
0001: 6641.692 ms, 48.8% faster
0001+0002: 5771.703 ms, 55.5% faster

Also, with int PK / int FK (1M rows):

create table pk (a int primary key);
create table fk (a int references pk);
insert into pk select generate_series(1, 1000000);
insert into fk select generate_series(1, 1000000);

master: 1000 ms
0001: 520 ms  (48% faster)
0001+0002: 432 ms  (57% faster, 17% over 0001 alone)

master: 11134.583 ms
0001: 5240.298 ms, 52.9% faster
0001+0002: 4554.215 ms, 59.1% faster

Thanks for testing, good to see similar numbers.  I had forgotten to
note that these results are when these PK index probes don't do any
I/O, though you might be aware of that. Below, I report some numbers
that Tomas Vondra shared with me off-list where the probes do have to
perform I/O and there the benefits from only this patch set are only
marginal.

I don't have any additional comments on the patch except one minor nit,
maybe merge the following two if conditions into one, not a strong opinion
though.

if (use_cache)
{
/*
* The snapshot was registered once when the cache entry was created.
* We just patch curcid to reflect the new command counter.
* SnapshotSetCommandId() only patches process-global statics, not
* registered copies, so we do it directly.
*
* The xmin/xmax/xip fields don't need refreshing: within a single
* statement batch, only curcid changes between rows.
*/
Assert(fpentry && fpentry->snapshot != NULL);
snapshot = fpentry->snapshot;
snapshot->curcid = GetCurrentCommandId(false);
}
else
snapshot = RegisterSnapshot(GetLatestSnapshot());

if (use_cache)
{
pk_rel = fpentry->pk_rel;
idx_rel = fpentry->idx_rel;
scandesc = fpentry->scandesc;
slot = fpentry->slot;
}
else
{
pk_rel = table_open(riinfo->pk_relid, RowShareLock);
idx_rel = index_open(riinfo->conindid, AccessShareLock);
scandesc = index_beginscan(pk_rel, idx_rel,
snapshot, NULL,
riinfo->nkeys, 0);
slot = table_slot_create(pk_rel, NULL);
}

Good idea, done.

While polishing 0002, I revisited the snapshot caching semantics. The
previous commit message hand-waved about only curcid changing between
rows, but GetLatestSnapshot() also reflects other backends' commits,
so reusing the snapshot is a deliberate semantic change from the SPI
path. I think it's safe because curcid is all we need for
intra-statement visibility, concurrent commits either already happened
before our snapshot (and are visible) or are racing with our statement
and wouldn't be seen reliably even with per-row snapshots since the
order in which FK rows are checked is nondeterministic, and
LockTupleKeyShare prevents the PK row from disappearing regardless. In
essence, we're treating all the FK checks within a trigger-firing
cycle as a single plan execution that happens to scan N rows, rather
than N independent SPI queries each taking a fresh snapshot. That's
the natural model -- a normal SELECT ... FOR KEY SHARE plan doesn't
re-take GetLatestSnapshot() between rows either.

Similarly, the permission check (schema USAGE + table SELECT) is now
done once at cache entry creation in ri_FastPathGetEntry() rather than
on every flush.

nice improvement.

The RI check runs as the PK table owner, so we're
verifying that the owner can access their own table -- a condition
that won't change unless someone explicitly revokes from the owner,
which would also break the SPI path.

David Rowley mentioned off-list that it might be worth batching
multiple FK values into a single index probe, leveraging the
ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
to buffer FK values across trigger invocations in the per-constraint
cache (0002 already has the right structure for this), build a
SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
pages in one sorted traversal instead of one tree descent per row. The
locking and recheck would still be per-tuple, but the index traversal
cost drops significantly. Single-column FKs are the obvious starting
point. That seems worth exploring but can be done as a separate patch
on top of this.

I will take a look at this in the following weeks.

I ended up going ahead with the batching and SAOP idea that David
mentioned -- I had a proof-of-concept working shortly after posting v3
and kept iterating on it.  So attached set is now:

0001 - Core fast path (your 0001+0002 reworked, as before)

0002 - Per-batch resource caching (PK relation, index, scandesc, snapshot)

0003 - FK row buffering: materialize FK tuples into a per-constraint
batch buffer (64 rows), flush when full or at batch end

0004 - SK_SEARCHARRAY for single-column FKs: build an array from the
buffered FK values and do one index scan instead of 64 separate tree
descents.  Multi-column FKs fall back to a per-row loop.

0003 is pure infrastructure -- it doesn't improve performance on its
own because the per-row index descent still dominates.  The payoff
comes in 0004.

Numbers (same machine as before, median of 3 runs):

numeric PK / bigint FK, 1M rows:
master: 2487 ms
0001..0004: 1168 ms  (2.1x)

int PK / int FK, 500K rows:
master: 1043 ms
0001..0004: 335 ms  (3.1x)

The int/int case benefits most because the per-row cost is lower, so
the SAOP traversal savings are a larger fraction of the total. The
numeric/bigint case still sees a solid improvement despite the
cross-type cast overhead.

Tomas Vondra also tested with an I/O-intensive workload (dataset
larger than shared_buffers, combined with his and Peter Geoghegan's
I/O prefetching patches) and confirmed that the batching + SAOP
approach helps there too, not just in the CPU-bound / memory-resident
case.  In fact he showed that the patches here don't make a big dent
when the main bottleneck is I/O as shown in numbers that he shared in
an off-list email:

master: 161617 ms
ri-check (0001..0004): 149446 ms  (1.08x)
ri-check + i/o prefetching: 50885 ms  (3.2x)

So the RI patches alone only give ~8% here since most time is waiting
on reads.  But the batching gives the prefetch machinery a window of
upcoming probes to issue readahead against, so the two together yield
3.2x.

impressive!


Tomas also caught a memory context bug in the batch flush path: the
cached scandesc lives in TopTransactionContext, but the btree AM
defers _bt_preprocess_keys allocation to the first getnext call, which
pallocs into CurrentMemoryContext. If that's a short-lived
per-trigger-row context, the scandesc has dangling pointers on the
next rescan. Fixed by switching to TopTransactionContext before the
probe loop.

Finally, I've fixed a number of other small and not-so-small bugs
found while polishing the old patches and made other stylistic
improvements. One notable change is that I introduced a FastPathMeta

Yeah, this is much better than the fpmeta_valid field.

struct to store the fast path metadata instead of dumping those arrays
in the RI_ConstraintInfo. It's allocated lazily on first use and holds
the per-key compare entries, operator procedures, and index strategy
info needed by the scan key construction, so RI_ConstraintInfo doesn't
pay for them when the fast path isn't used.


On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

Hi Amit,

On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <amitlangote09@gmail.com> wrote:

Hi Junwang,

On Mon, Dec 1, 2025 at 3:09 PM Junwang Zhao <zhjwpku@gmail.com> wrote:
As Amit has already stated, we are approaching a hybrid "fast-path + fallback"
design.

0001 adds a fast path optimization for foreign key constraint checks
that bypasses the SPI executor, the fast path applies when the referenced
table is not partitioned, and the constraint does not involve temporal
semantics.

With the following test:

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);

head:

[local] zhjwpku@postgres:5432-90419=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 13516.177 ms (00:13.516)

[local] zhjwpku@postgres:5432-90419=# update fk set a = a + 1;
UPDATE 1000000
Time: 15057.638 ms (00:15.058)

patched:

[local] zhjwpku@postgres:5432-98673=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 8248.777 ms (00:08.249)

[local] zhjwpku@postgres:5432-98673=# update fk set a = a + 1;
UPDATE 1000000
Time: 10117.002 ms (00:10.117)

0002 cache fast-path metadata used by the index probe, at the current
time only comparison operator hash entries, operator function OIDs
and strategy numbers and subtypes for index scans. But this cache
doesn't buy any performance improvement.

Caching additional metadata should improve performance for foreign key checks.

Amit suggested introducing a mechanism for ri_triggers.c to register a
cleanup callback in the EState, which AfterTriggerEndQuery() could then
invoke to release per-statement cached metadata (such as the IndexScanDesc).
However, I haven't been able to implement this mechanism yet.

Thanks for working on this.  I've taken your patches as a starting
point and reworked the series into two patches (attached): 1st is your
0001+0002 as the core patch that adds a gated fast-path alternative to
SPI and 2nd where I added per-statement resource caching.  Doing the
latter turned out to be not so hard thanks to the structure you chose
to build the core fast path.  Good call on adding the RLS and ACL test
cases, btw.

So, 0001 is a functionally complete fast path: concurrency handling,
REPEATABLE READ crosscheck, cross-type operators, security context,
and metadata caching. 0002 implements the per-statement resource
caching we discussed, though instead of sharing the EState between
trigger.c and ri_triggers.c it uses a new AfterTriggerBatchCallback
mechanism that fires at the end of each trigger-firing cycle
(per-statement for immediate constraints, or until COMMIT for deferred
ones). It layers resource caching on top so that the PK relation,
index, scan descriptor, and snapshot stay open across all FK trigger
invocations within a single trigger-firing cycle rather than being
opened and closed per row.

Note that phe previous 0002 (metadata caching) is folded into 0001,
and most of the new fast-path logic added in 0001 now lives in
ri_FastPathCheck() rather than inline in RI_FKey_check(), so the
RI_FKey_check diff is just the gating call and SPI fallback.

I re-ran the benchmarks (same test as yours, different machine):

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);
insert into fk select generate_series(1, 2000000, 2);

master: 2444 ms  (median of 3 runs)
0001: 1382 ms  (43% faster)
0001+0002: 1202 ms  (51% faster, 13% over 0001 alone)

I can get similar improvement on my old mac intel chip:

master: 12963.993 ms
0001: 6641.692 ms, 48.8% faster
0001+0002: 5771.703 ms, 55.5% faster


Also, with int PK / int FK (1M rows):

create table pk (a int primary key);
create table fk (a int references pk);
insert into pk select generate_series(1, 1000000);
insert into fk select generate_series(1, 1000000);

master: 1000 ms
0001: 520 ms  (48% faster)
0001+0002: 432 ms  (57% faster, 17% over 0001 alone)

master: 11134.583 ms
0001: 5240.298 ms, 52.9% faster
0001+0002: 4554.215 ms, 59.1% faster


The incremental gain from 0002 comes from eliminating per-row relation
open/close, scan begin/end, slot alloc/free, and replacing per-row
GetSnapshotData() with only curcid adjustment on the registered
snapshot copy in the cache.

The two current limitations are partitioned referenced tables and
temporal foreign keys. Partitioned PKs are relatively uncommon in
practice, so the non-partitioned case should cover most FK workloads,
so I'm not sure it's worth the added complexity to support them.
Temporal FKs are inherently multi-row, so they're a poor fit for a
single-probe fast path.

David Rowley mentioned off-list that it might be worth batching
multiple FK values into a single index probe, leveraging the
ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
to buffer FK values across trigger invocations in the per-constraint
cache (0002 already has the right structure for this), build a
SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
pages in one sorted traversal instead of one tree descent per row. The
locking and recheck would still be per-tuple, but the index traversal
cost drops significantly. Single-column FKs are the obvious starting
point. That seems worth exploring but can be done as a separate patch
on top of this.

I will take a look at this in the following weeks.


I think the series is in reasonable shape but would appreciate extra
eyeballs, especially on the concurrency handling in ri_LockPKTuple()
in 0001 and the snapshot lifecycle in 0002. Or anything else that
catches one's eye.

--
Thanks, Amit Langote

I don't have any additional comments on the patch except one minor nit,
maybe merge the following two if conditions into one, not a strong opinion
though.

if (use_cache)
{
/*
* The snapshot was registered once when the cache entry was created.
* We just patch curcid to reflect the new command counter.
* SnapshotSetCommandId() only patches process-global statics, not
* registered copies, so we do it directly.
*
* The xmin/xmax/xip fields don't need refreshing: within a single
* statement batch, only curcid changes between rows.
*/
Assert(fpentry && fpentry->snapshot != NULL);
snapshot = fpentry->snapshot;
snapshot->curcid = GetCurrentCommandId(false);
}
else
snapshot = RegisterSnapshot(GetLatestSnapshot());

if (use_cache)
{
pk_rel = fpentry->pk_rel;
idx_rel = fpentry->idx_rel;
scandesc = fpentry->scandesc;
slot = fpentry->slot;
}
else
{
pk_rel = table_open(riinfo->pk_relid, RowShareLock);
idx_rel = index_open(riinfo->conindid, AccessShareLock);
scandesc = index_beginscan(pk_rel, idx_rel,
snapshot, NULL,
riinfo->nkeys, 0);
slot = table_slot_create(pk_rel, NULL);
}

--
Regards
Junwang Zhao



--
Thanks, Amit Langote



--
Regards
Junwang Zhao

I had an offline discussion with Amit today. There were a few small things
that could be improved, so I posted a new version of the patch set.

1.

+ if (ri_fastpath_is_applicable(riinfo))
+ {
+ bool found = ri_FastPathCheck(riinfo, fk_rel, newslot);
+
+ if (found)
+ return PointerGetDatum(NULL);
+
+ /*
+ * ri_FastPathCheck opens pk_rel internally; we need it for
+ * ri_ReportViolation.  Re-open briefly.
+ */
+ pk_rel = table_open(riinfo->pk_relid, RowShareLock);
+ ri_ReportViolation(riinfo, pk_rel, fk_rel,
+    newslot, NULL,
+    RI_PLAN_CHECK_LOOKUPPK, false, false);
+ }

Move ri_ReportViolation into ri_FastPathCheck, so table_open is no
longer needed, and ri_FastPathCheck now returns void. Since Amit
agreed this is the right approach, I included it directly in v5-0001.

2.

After adding the batch fast path, the original ri_FastPathCheck is only
used by the ALTER TABLE validation path. This path cannot use the
cache because the registered AfterTriggerBatch callback will never run.
Therefore, the use_cache branch can be removed.

I made this change in v5-0004 and also updated some related comments.
Once we agree the changes are correct, it can be merged into v5-0003.

3.

+ fk_slot = MakeSingleTupleTableSlot(RelationGetDescr(fk_rel),
+    &TTSOpsHeapTuple);

ri_FastPathBatchFlush creates a new fk_slot but does not cache it in
RI_FastPathEntry. I tried caching it in v5-0006 and ran some benchmarks,
it didn't show much improvement. This might be because the slot creation
function is called once per batch rather than once per row, so the overall
impact is minimal. I'm posting this here for Amit to take a look and decide
whether we should adopt it or drop it, since I mentioned the idea to
him earlier.

4.

ri_FastPathFlushArray currently uses SK_SEARCHARRAY only for
single-column checks. I asked whether this could be extended to support
multi-column cases, and Amit encouraged me to look into it.

After a brief investigation, it seems that ScanKeyEntryInitialize only allows
passing a single subtype/collation/procedure, which makes it difficult to
handle multiple types. Based on this, my current understanding is that
SK_SEARCHARRAY may not work for multi-column checks.

--
Regards
Junwang Zhao
<v5-0005-Use-SK_SEARCHARRAY-for-batched-fast-path-FK-probe.patch><v5-0006-Reuse-FK-tuple-slot-across-fast-path-batches.patch><v5-0002-Cache-per-batch-resources-for-fast-path-foreign-k.patch><v5-0004-Refine-fast-path-FK-validation-path.patch><v5-0003-Buffer-FK-rows-for-batched-fast-path-probing.patch><v5-0001-Add-fast-path-for-foreign-key-constraint-checks.patch>

pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: Improve OAuth discovery logging
Next
From: Xuneng Zhou
Date:
Subject: Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery