On Wed, Apr 1, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote:
> The problem is that there's no protection again the xids in the
> ringbuffer getting old enough to wrap around. Given that practical uses
> of old_snapshot_threshold are likely to be several hours to several
> days, that's not particularly hard to hit.
Presumably this could be fixed by changing it to use FullTransactionId.
> The problem, as far as I can tell, is that
> oldSnapshotControl->head_timestamp appears to be intended to be the
> oldest value in the ring. But we update it unconditionally in the "need
> a new bucket, but it might not be the very next one" branch of
> MaintainOldSnapshotTimeMapping().
I agree, that doesn't look right. It's correct, I think, for the "if
(advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)" case, but not in the
"else" case. In the "else" case, it should advance by 1 (wrapping if
needed) each time we take the "if (oldSnapshotControl->count_used ==
OLD_SNAPSHOT_TIME_MAP_ENTRIES)" branch, and should remain unchanged in
the "else" branch for that if statement.
> As far as I can tell, this code has been wrong since the feature has
> been committed. The tests don't show a problem, because none of this
> code is reached when old_snapshot_threshold = 0 (which has no real world
> use, it's purely for testing).
I'm pretty sure I complained about the fact that only the
old_snapshot_threshold = 0 case was tested at the time this went in,
but I don't think Kevin was too convinced that we needed to do
anything else, and realistically, if he'd tried for a regression test
that ran for 15 minutes, Tom would've gone ballistic.
> I really don't know what to do here. The feature never worked and will
> silently cause wrong query results. Fixing it seems like a pretty large
> task - there's a lot more bugs. But ripping out a feature in stable
> branches is pretty bad too.
I don't know what other bugs there are, but the two you mention above
look fixable. Even if we decide that the feature can't be salvaged, I
would vote against ripping it out in back branches. I would instead
argue for telling people not to use it and ripping it out in master.
However, much as I'm not in love with all of the complexity this
feature adds, I don't see the problems you've reported here as serious
enough to justify ripping it out.
What exactly is the interaction of this patch with your snapshot
scalability work?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company