Robert Haas <robertmhaas@gmail.com> writes:
> Well ... I agree that brittle is bad, but it's not clear to me which
> way is actually less brittle. As for performant, I think you might be
> misjudging the situation. My original design for removing SnapshotNow
> didn't even have the catalog snapshot - it just took a new snapshot
> every time. That was mostly fine, but Andres found a somewhat extreme
> test case where it exhibited a significant regression, so I added the
> catalog snapshot stuff to work around that. So I'm not AT ALL
> convinced that giving catalog snapshots longer lifetimes is a relevant
> thing to do.
Perhaps not. But right now we have to first think about correctness and
how much trouble it will be to get to correctness. ISTM you are arguing
for a design in which it's required that there is always a registered
or active snapshot that's older than the catalog snapshot (if any).
I tried revising snapmgr.c to enforce that, starting with adding
@@ -421,6 +421,13 @@ GetNonHistoricCatalogSnapshot(Oid relid)
if (CatalogSnapshot == NULL)
{
+ /*
+ * The catalog snapshot must always be newer than some active or
+ * registered snapshot. (XXX explain why)
+ */
+ Assert(ActiveSnapshot != NULL ||
+ !pairingheap_is_empty(&RegisteredSnapshots));
+
/* Get new snapshot. */
CatalogSnapshot = GetSnapshotData(&CatalogSnapshotData);
and this blew up with truly impressive thoroughness. The autovac
launcher, logical replication launcher, and incoming backends all
fail this assertion instantly, making it impossible to find out
what else might be broken --- but I'm sure there is a lot.
(If you want to try this for yourself, remember that the postmaster
will relaunch the AV and LR launchers on failure, meaning that your
disk will fill with core files very very quickly. Just sayin'.)
So maybe we can go that direction, but it's going to require a lot of
code additions to push extra snapshots in places that haven't bothered
to date; and I'm not convinced that that'd be buying us anything
except more GetSnapshotData() calls.
Plan B is to grant catalog snapshots some more-durable status than
what Plan A envisions. I'm not very sure about the details, but if
you don't want to go that route then you'd better set about making
the above assertion work.
In the meantime, since it's clear that HaveRegisteredOrActiveSnapshot
is failing to meet its contract, I'm going to go fix it. I think
(based on the above argument) that what it intends to enforce is not
really the system design we need, but it certainly isn't helping
anyone that it enforces that design incorrectly.
regards, tom lane