Re: More efficient RI checks - take 2 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: More efficient RI checks - take 2
Date
Msg-id 20200428222142.vqfetg7m32rxnjtj@alap3.anarazel.de
Whole thread Raw
In response to Re: More efficient RI checks - take 2  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2020-04-28 10:44:58 -0400, Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Robert Haas (robertmhaas@gmail.com) wrote:
> >> As you say, perhaps there's room for both things, but also as you say,
> >> it's not obvious how to decide intelligently between them.
>
> > The single-row case seems pretty clear and also seems common enough that
> > it'd be worth paying the cost to figure out if it's a single-row
> > statement or not.

It's not that obvious to me that it's going to be beneficial to go down
a planned path in all that many cases. If all that the RI check does is
a index_rescan() && index_getnext_slot(), there's not that many
realistic types of plans that are going to be better.

IIUC a query to check a transition table would, simplified, boil down to
either:

SELECT * FROM transition_table tt
WHERE
    -- for a insertion/update into the referencing table and
    (
        NOT EXISTS (SELECT * FROM referenced_table rt WHERE rt.referenced_column = tt.referencing_column)
        [AND ... , ]
    )
    -- for update / delete of referenced table
    OR EXISTS (SELECT * FROM referencing_table rt1 WHERE rt1.referencing_column = tt.referenced_column1)
    [OR ... , ]
LIMIT 1;

Where a returned row would signal an error. But it would need to handle
row locking, CASCADE/SET NULL/SET DEFAULT etc. More on that below.

While it's tempting to want to write the latter check as

-- for update / delete of referenced table
SELECT * FROM referencing_table rt
WHERE referencing_column IN (SELECT referenced_column FROM transition_table tt)
LIMIT 1;
that'd make it harder to know the violating row.


As the transition table isn't ordered it seems like there's not that
many realistic ways to execute this:

1) A nestloop semi/anti-join with an inner index scan
2) Sort transition table, do a merge semi/anti-join between sort and an
  ordered index scan on the referenced / referencing table(s).
3) Hash semi/anti-join, requiring a full table scan of the tables


I think 1) would be worse than just doing the indexscan manually. 2)
would probably be beneficial if there's a lot of rows on the inner side,
due to the ordered access and deduplication. 3) would sometimes be
beneficial because it'd avoid an index scan for each tuple in the
transition table.

The cases in which it is clear to me that a bulk check could
theoretically be significantly better than a fast per-row check are:

1) There's a *lot* of rows in the transition table to comparatively small
  referenced / referencing tables. As those tables can cheaply be
  hashed, a hashjoin will be be more efficient than doing a index lookup
  for each transition table row.
2) There's a lot of duplicate content in the transition
  table. E.g. because there's a lot of references to the same row.

Did I forget important ones?


With regard to the row locking etc that I elided above: I think that
actually will prevent most if not all interesting optimizations: Because
of the FOR KEY SHARE that's required, the planner plan will pretty much
always degrade to a per row subplan check anyway.  Is there any
formulation of the necessary queries that don't suffer from this
problem?


> That seems hard to do in advance ... but it would be easy to code
> a statement-level AFTER trigger along the lines of
>
>     if (transition table contains one row)
>         // fast special case here
>     else
>         // slow general case here.

I suspect we'd need something more complicated than this for it to be
beneficial. My gut feeling would be that the case where a transition
table style check would be most commonly beneficial is when you have a
very small referenced table, and a *lot* of rows get inserted.  But
clearly we wouldn't want to have bulk insertion suddenly also store all
rows in a transition table.

Nor would we want to have a bulk UPDATE cause all the updated rows to be
stored in the transition table, even though none of the relevant columns
changed (i.e. the RI_FKey_[f]pk_upd_check_required logic in
AfterTriggerSaveEvent()).


I still don't quite see how shunting RI checks through triggers saves us
more than it costs:

Especially for the stuff we do as AFTER: Most of the time we could do
the work we defer till query end immediately, rather than building up an
in-memory queue. Besides saving memory, in a lot of cases that'd also
make it unnecessary to refetch the row at a later time, possibly needing
to chase updated row versions.

But even for the BEFORE checks, largely going through generic trigger
code means it's much harder to batch checks without suddenly requiring
memory proportional to the number of inserted rows.


There obviously are cases where it's not possible to check just after
each row. Deferrable constraints, as well as CASCADING / SET NOT NULL /
SET DEFAULT on tables with user defined triggers, for example. But it'd
likely be sensible to handle that in the way we already handle deferred
uniqueness checks, i.e. we only queue something if there's a potential
for a problem.


> I think the question really comes down to this: is the per-row overhead of
> the transition-table mechanism comparable to that of the AFTER trigger
> queue?  Or if not, can we make it so?

It's probably more expensive, in some ways, at the moment. The biggest
difference is that the transition table stores complete rows, valid as
of the moment they've been inserted/updated/deleted, whereas the trigger
queue only stores enough information to fetch the tuple again during
trigger execution.  Several RI checks however re-check visiblity before
executing, so that's another fetch, that'd likely not be elided by a
simple change to using transition tables.

Both have significant downsides, obviously. Storing complete rows can
take a lot more memory, and refetching rows is expensive, especially if
it happens much later (with the row pushed out of shared_buffers
potentially).


I think it was a mistake to have these two different systems in
trigger.c. When transition tables were added we shouldn't have kept
per-tuple state both in the queue and in the transition
tuplestore. Instead we should have only used the tuplestore, and
optimized what information we store inside depending on the need of the
various after triggers.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Poll: are people okay with function/operator table redesign?
Next
From: James Coleman
Date:
Subject: Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays