Re: Improving connection scalability: GetSnapshotData() - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Improving connection scalability: GetSnapshotData()
Date
Msg-id 20200320060045.qljlxm2qx2z54eph@alap3.anarazel.de
Whole thread Raw
In response to Re: Improving connection scalability: GetSnapshotData()  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
Hi,

Thanks for looking!

On 2020-03-20 18:23:03 +1300, Thomas Munro wrote:
> On Sun, Mar 1, 2020 at 9:36 PM Andres Freund <andres@anarazel.de> wrote:
> > I'm still working on cleaning that part of the patch up, I'll post it in
> > a bit.
> 
> I looked at that part on your public pgxact-split branch.  In that
> version you used "CSN" rather than something based on LSNs, which I
> assume avoids complications relating to WAL locking or something like
> that.

Right, I first tried to use LSNs, but after further tinkering found that
it's too hard to address the difference between visiblity order and LSN
order. I don't think there's an easy way to address the difference.


> We should probably be careful to avoid confusion with the
> pre-existing use of the term "commit sequence number" (CommitSeqNo,
> CSN) that appears in predicate.c.

I looked at that after you mentioned it on IM. But I find it hard to
grok what it's precisely defined at. There's basically no comments
explaining what it's really supposed to do, and I find the relevant code
far from easy to grok :(.


> This also calls to mind the 2013-2016 work by Ants Aasma and others[1]
> on CSN-based snapshots, which is obviously a much more radical change,
> but really means what it says (commits).

Well, I think you could actually build some form of more dense snapshots
ontop of "my" CSN, with a bit of effort (and lot of handwaving). I don't
think they're that different concepts.


> The CSN in your patch set is used purely as a level-change for
> snapshot cache invalidation IIUC, and it advances also for aborts --
> so maybe it should be called something like completed_xact_count,
> using existing terminology from procarray.c.

I expect it to be used outside of snapshots too, in the future, FWIW.

completed_xact_count sounds good to me.


> +       if (snapshot->csn != 0 && MyProc->xidCopy == InvalidTransactionId &&
> +               UINT64_ACCESS_ONCE(ShmemVariableCache->csn) == snapshot->csn)
> 
> Why is it OK to read ShmemVariableCache->csn without at least a read
> barrier?  I suppose this allows a cached snapshot to be used very soon
> after a transaction commits and should be visible to you, but ...
> hmmmrkwjherkjhg... I guess it must be really hard to observe any
> anomaly.  Let's see... maybe it's possible on a relaxed memory system
> like POWER or ARM, if you use a shm flag to say "hey I just committed
> a transaction", and the other guy sees the flag but can't yet see the
> new CSN, so an SPI query can't see the transaction?

Yea, it does need more thought / comments. I can't really see an actual
correctness violation though. As far as I can tell you'd never be able
to get an "older" ShmemVariableCache->csn than one since *after* the
last lock acquired/released by the current backend - which then also
means a different "ordering" would have been possible allowing the
current backend to take the snapshot earlier.


> Another theoretical problem is the non-atomic read of a uint64 on some
> 32 bit platforms.

Yea, it probably should be a pg_atomic_uint64 to address that. I don't
think it really would cause problems, because I think it'd always end up
causing an unnecessary snapshot build. But there's no need to go there.


> > 0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.
> >
> >       This is the most crucial piece. Instead of code directly using
> >       RecentOldestXmin, there's a new set of functions for testing
> >       whether an xid is visible (InvisibleToEveryoneTestXid() et al).
> >
> >       Those function use new horizon boundaries computed as part of
> >       GetSnapshotData(), and recompute an accurate boundary when the
> >       tested xid falls inbetween.
> >
> >       There's a bit more infrastructure needed - we need to limit how
> >       often an accurate snapshot is computed. Probably to once per
> >       snapshot? Or once per transaction?
> >
> >
> >       To avoid issues with the lower boundary getting too old and
> >       presenting a wraparound danger, I made all the xids be
> >       FullTransactionIds. That imo is a good thing?
> 
> +1, as long as we don't just move the wraparound danger to the places
> where we convert xids to fxids!
> 
> +/*
> + * Be very careful about when to use this function. It can only safely be used
> + * when there is a guarantee that, at the time of the call, xid is within 2
> + * billion xids of rel. That e.g. can be guaranteed if the the caller assures
> + * a snapshot is held by the backend, and xid is from a table (where
> + * vacuum/freezing ensures the xid has to be within that range).
> + */
> +static inline FullTransactionId
> +FullXidViaRelative(FullTransactionId rel, TransactionId xid)
> +{
> +       uint32 rel_epoch = EpochFromFullTransactionId(rel);
> +       TransactionId rel_xid = XidFromFullTransactionId(rel);
> +       uint32 epoch;
> +
> +       /*
> +        * TODO: A function to easily write an assertion ensuring that xid is
> +        * between [oldestXid, nextFullXid) woudl be useful here, and in plenty
> +        * other places.
> +        */
> +
> +       if (xid > rel_xid)
> +               epoch = rel_epoch - 1;
> +       else
> +               epoch = rel_epoch;
> +
> +       return FullTransactionIdFromEpochAndXid(epoch, xid);
> +}
> 
> I hate it, but I don't have a better concrete suggestion right now.
> Whatever I come up with amounts to the same thing on some level,
> though I feel like it might be better to used an infrequently updated
> oldestFxid as the lower bound in a conversion.

I am not sure it's as clearly correct to use oldestFxid in as many
cases. Normally PGPROC->xmin (PGXACT->xmin currently) should prevent the
"system wide" xid horizon to move too far relative to that, but I think
there are more plausible problems with the "oldest" xid horizon to move
concurrently with the a backend inspecting values.

It shouldn't be a problem here since the values are taken under a lock
preventing both from being moved I think, and since we're only comparing
those two values without taking anything else into account, the "global"
horizon changing concurrently wouldn't matter.

But it seems easier to understand the correctness when comparing to
nextXid?

What's the benefit of looking at an "infrequently updated" value
instead? I guess you can argue that it'd be more likely to be in cache,
but since all of this lives in a single cacheline...


> An upper bound would also seem better, though requires much trickier
> interlocking.  What you have means "it's near here!"... isn't that too
> prone to bugs that are hidden because of the ambient fuzziness?

I can't follow the last sentence. Could you expand?


> On a more constructive note:
> 
> GetOldestXminInt() does:
> 
>         LWLockAcquire(ProcArrayLock, LW_SHARED);
> 
> +       nextfxid = ShmemVariableCache->nextFullXid;
> +
> ...
>         LWLockRelease(ProcArrayLock);
> ...
> +       return FullXidViaRelative(nextfxid, result);
> 
> But nextFullXid is protected by XidGenLock; maybe that's OK from a
> data freshness point of view (I'm not sure), but from an atomicity
> point of view, you can't do that can you?

Hm. Yea, I think it's not safe against torn 64bit reads, you're right.


> >       This patch currently breaks old_snapshot_threshold, as I've not
> >       yet integrated it with the new functions. I think we can make the
> >       old snapshot stuff a lot more precise with this - instead of
> >       always triggering conflicts when a RecentGlobalXmin is too old, we
> >       can do so only in the cases we actually remove a row. I ran out of
> >       energy threading that through the heap_page_prune and
> >       HeapTupleSatisfiesVacuum.
> 
> CCing Kevin as an FYI.

If anybody has an opinion on this sketch I'd be interested. I've started
to implement it, so ...


> > 0008: Move PGXACT->xmin back to PGPROC.
> >
> >       Now that GetSnapshotData() doesn't access xmin anymore, we can
> >       make it a normal field in PGPROC again.
> >
> >
> > 0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.
> > 0010: Improve GetSnapshotData() performance by avoiding indirection for vacuumFlags
> > 0011: Improve GetSnapshotData() performance by avoiding indirection for nsubxids access
> >
> >       These successively move the remaining PGXACT fields into separate
> >       arrays in ProcGlobal, and adjust GetSnapshotData() to take
> >       advantage.  Those arrays are dense in the sense that they only
> >       contain data for PGPROCs that are in use (i.e. when disconnecting,
> >       the array is moved around)..
> >
> >       I think xid, and vacuumFlags are pretty reasonable. But need
> >       cleanup, obviously:
> >       - The biggest cleanup would be to add a few helper functions for
> >         accessing the values, rather than open coding that.
> >       - Perhaps we should call the ProcGlobal ones 'cached', and name
> >         the PGPROC ones as the one true source of truth?
> >
> >       For subxid I thought it'd be nice to have nxids and overflow be
> >       only one number. But that probably was the wrong call? Now
> >       TransactionIdInProgress() cannot look at at the subxids that did
> >       fit in PGPROC.subxid. I'm not sure that's important, given the
> >       likelihood of misses?  But I'd probably still have the subxid
> >       array be one of {uint8 nsubxids; bool overflowed} instead.
> >
> >
> >       To keep the arrays dense they copy the logic for pgprocnos. Which
> >       means that ProcArrayAdd/Remove move things around. Unfortunately
> >       that requires holding both ProcArrayLock and XidGenLock currently
> >       (to avoid GetNewTransactionId() having to hold ProcArrayLock). But
> >       that doesn't seem too bad?
> 
> In the places where you now acquire both, I guess you also need to
> release both in the error path?

Hm. I guess you mean:

    if (arrayP->numProcs >= arrayP->maxProcs)
    {
        /*
         * Oops, no room.  (This really shouldn't happen, since there is a
         * fixed supply of PGPROC structs too, and so we should have failed
         * earlier.)
         */
        LWLockRelease(ProcArrayLock);
        ereport(FATAL,
                (errcode(ERRCODE_TOO_MANY_CONNECTIONS),
                 errmsg("sorry, too many clients already")));
    }

I think we should just remove the LWLockRelease? At this point we
already have set up ProcKill(), which would release all lwlocks after
the error was thrown?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: Berserk Autovacuum (let's save next Mandrill)
Next
From: Laurenz Albe
Date:
Subject: Re: Berserk Autovacuum (let's save next Mandrill)