Re: Slow standby snapshot - Mailing list pgsql-hackers

From Andrey Borodin
Subject Re: Slow standby snapshot
Date
Msg-id 5FDD25C2-7806-490A-8D28-483F1A70D8D5@yandex-team.ru
Whole thread Raw
In response to Re: Slow standby snapshot  (Andres Freund <andres@anarazel.de>)
Responses Re: Slow standby snapshot
List pgsql-hackers
Sorry for so late reply. I've been thinking about possible approaches.
KnownAssignedXids over hashtable in fact was implemented long before and rejected [0].

> 3 авг. 2021 г., в 22:35, Andres Freund <andres@anarazel.de> написал(а):
>
> On 2021-08-03 10:33:50 +0500, Andrey Borodin wrote:
>>> 3 авг. 2021 г., в 03:01, Andres Freund <andres@anarazel.de> написал(а):
>>> On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:
>>>> The main idea is simple optimistic optimization - store offset to next
>>>> valid entry. So, in most cases, we could just skip all the gaps.
>>>> Of course, it adds some additional impact for workloads without long
>>>> (few seconds) transactions but it is almost not detectable (because of
>>>> CPU caches).
>>>
>>> I'm doubtful that that's really the right direction. For workloads that
>>> are replay heavy we already often can see the cost of maintaining the
>>> known xids datastructures show up significantly - not surprising, given
>>> the datastructure. And for standby workloads with active primaries the
>>> cost of searching through the array in all backends is noticeable as
>>> well.  I think this needs a bigger data structure redesign.
>>
>> KnownAssignedXids implements simple membership test idea. What kind of
>> redesign would you suggest? Proposed optimisation makes it close to optimal,
>> but needs eventual compression.
>
> Binary searches are very ineffecient on modern CPUs (unpredictable memory
> accesses, unpredictable branches). We constantly need to do binary searches
> during replay to remove xids from the array. I don't see how you can address
> that with just the current datastructure.
Current patch addresses another problem. In presence of old enough transaction enumeration of KnownAssignedXids with
sharedlock prevents adding new transactions with exclusive lock. And recovery effectively pauses. 

Binary searches can consume 10-15 cache misses, which is unreasonable amount of memory waits. But that's somewhat
differentproblem. 
Also binsearch is not that expensive when we compress KnownAssignedXids often.

>> Maybe use a hashtable of running transactions? It will be slightly faster
>> when adding\removing single transactions. But much worse when doing
>> KnownAssignedXidsRemove().
>
> Why would it be worse for KnownAssignedXidsRemove()? Were you intending to
> write KnownAssignedXidsGet[AndSetXmin]()?
I was thinking about inefficient KnownAssignedXidsRemovePreceding() in hashtable. But, probably, this is not so
frequentoperation. 

>> Maybe use a tree? (AVL\RB or something like that) It will be slightly better, because it does not need eventual
compressionlike exiting array. 
>
> I'm not entirely sure what datastructure would work best. I can see something
> like a radix tree work well, or a skiplist style approach. Or a hashtable:
>
> I'm not sure that we need to care as much about the cost of
> KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
> less frequent. But more importantly, it'd not be hard to maintain an
> occasionally (or opportunistically) maintained denser version for
> GetSnapshotData() - there's only a single writer, making the concurrency
> issues a lot simpler.

I've been prototyping Radix tree for a while.
Here every 4 xids are summarized my minimum Xid and number of underlying Xids. Of cause 4 is arbitrary number,
summarizationarea must be of cacheline size. 
┌───────┐
│ 1 / 9 │
├───────┴────┐
│            └────┐
│                 └────┐
│                      └────┐
▼                           └───▶
┌───────────────────────────────┐
│ 1 / 3 | 5 / 0 | 9 / 3 | D / 3 │
├───────┬───────┬────────┬──────┴────┐
│       └─┐     └───┐    └────┐      └─────┐
│         └─┐       └──┐      └────┐       └─────┐
│           └─┐        └──┐        └────┐        └────┐
▼             └─┐         └──┐          └───┐         └────┐
┌───────────────▼────────────┴─▶────────────┴──▶───────────┴───▶
│ 1 | 2 |   | 4 |   |   |   |   | 9 |   | B | C | D | E | F |  │
└──────────────────────────────────────────────────────────────┘
Bottom layer is current array (TransactionId *KnownAssignedXids).
When we remove Xid we need theoretical minimum of cachelines touched. I'd say 5-7 instead of 10-15 of binsearch (in
caseof millions of entries in KnownAssignedXids). 
Enumerating running Xids is not that difficult too: we will need to scan O(xip) memory, not whole KnownAssignedXids
array.

But the overall complexity of this approach does not seem good to me.

All in all, I think using proposed "KnownAssignedXidsNext" patch solves real problem and the problem of binary searches
shouldbe addressed by compressing KnownAssignedXids more often. 

Currently we do not compress array
        if (nelements < 4 * PROCARRAY_MAXPROCS ||  // It's not that big yet OR
            nelements < 2 * pArray->numKnownAssignedXids) // It's contains less than a half of a bloat
            return;
From my POV arbitrary number 4 is just too high.

Summary: I think ("KnownAssignedXidsNext" patch + compressing KnownAssignedXids more often) is better than major
KnownAssignedXidsredesign. 


Best regards, Andrey Borodin.

[0] https://github.com/postgres/postgres/commit/2871b4618af1acc85665eec0912c48f8341504c4


pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: Commitfest 2021-11 Patch Triage - Part 1
Next
From: Dinesh Chemuduru
Date:
Subject: Re: [PROPOSAL] new diagnostic items for the dynamic sql