Hello, Andres.
Thanks for the feedback again.
> The problem is that we don't want to add a lot of work
> KnownAssignedXidsAdd/Remove, because very often nobody will build a snapshot
> for that moment and building a sorted, gap-free, linear array of xids isn't
> cheap. In my experience it's more common to be bottlenecked by replay CPU
> performance than on replica snapshot building.
Yes, but my patch adds almost the smallest possible amount for
KnownAssignedXidsAdd/Remove - a single write to the array by index.
It differs from the first version in this thread which is based on linked lists.
The "next valid offset" is just "optimistic optimization" - it means
"you could safely skip KnownAssignedXidsNext[i] while finding the next
valid".
But KnownAssignedXidsNext is not updated by Add/Remove.
> During recovery, where there's only one writer to the procarray / known xids,
> it might not be hard to opportunistically build a dense version of the known
> xids whenever a snapshot is built.
AFAIU the patch does exactly the same.
On the first snapshot building, offsets to the next valid entry are
set. So, a dense version is created on-demand.
And this version is reused (even partly if something was removed) on
the next snapshot building.
> I'm not entirely sure what datastructure would work best. I can see something
> like a radix tree work well, or a skiplist style approach. Or a hashtable:
We could try to use some other structure (for example - linked hash
map) - but the additional cost (memory management, links, hash
calculation) will probably significantly reduce performance.
And it is a much harder step to perform.
So, I think "next valid offset" optimization is a good trade-off for now:
* KnownAssignedXidsAdd/Remove are almost not affected in their complexity
* KnownAssignedXidsGetAndSetXmin is eliminated from the CPU top on
typical read scenario - so, more TPS, less ProcArrayLock contention
* it complements GetSnapshotData() scalability - now on standby
* changes are small
Thanks,
Michail.