Re: RFC: Improve CPU cache locality of syscache searches - Mailing list pgsql-hackers

From Yura Sokolov
Subject Re: RFC: Improve CPU cache locality of syscache searches
Date
Msg-id 95b6948cdd258c73a5f7ae4109b43706@postgrespro.ru
Whole thread Raw
In response to Re: RFC: Improve CPU cache locality of syscache searches  (Andres Freund <andres@anarazel.de>)
Responses Re: RFC: Improve CPU cache locality of syscache searches  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Andres Freund писал 2021-08-05 23:12:
> Hi,
> 
> On 2021-08-05 12:27:49 -0400, John Naylor wrote:
>> On Wed, Aug 4, 2021 at 3:44 PM Andres Freund <andres@anarazel.de> 
>> wrote:
>> > On 2021-08-04 12:39:29 -0400, John Naylor wrote:
>> > > typedef struct cc_bucket
>> > > {
>> > >   uint32 hashes[4];
>> > >   catctup *ct[4];
>> > >   dlist_head;
>> > > };
>> >
>> > I'm not convinced that the above the right idea though. Even if the hash
>> > matches, you're still going to need to fetch at least catctup->keys[0]
>> from
>> > a separate cacheline to be able to return the cache entry.
>> 
>> I see your point. It doesn't make sense to inline only part of the
>> information needed.
> 
> At least not for the unconditionally needed information.
> 
> 
>> Although I'm guessing inlining just two values in the 4-key case 
>> wouldn't
>> buy much.
> 
> Not so sure about that. I'd guess that two key comparisons take more 
> cycles
> than a cacheline fetch the further keys (perhaps not if we had inlined 
> key
> comparisons). I.e. I'd expect out-of-order + speculative execution to 
> hide the
> latency for fetching the second cacheline for later key values.
> 
> 
>> > If we stuffed four values into one bucket we could potentially SIMD the
>> hash
>> > and Datum comparisons ;)
>> 
>> ;-) That's an interesting future direction to consider when we support
>> building with x86-64-v2. It'd be pretty easy to compare a vector of 
>> hashes
>> and quickly get the array index for the key comparisons (ignoring for 
>> the
>> moment how to handle the rare case of multiple identical hashes).
>> However, we currently don't memcmp() the Datums and instead call an
>> "eqfast" function, so I don't see how that part  would work in a 
>> vector
>> setting.
> 
> It definitely couldn't work unconditionally - we have to deal with 
> text,
> oidvector, comparisons after all. But we could use it for the other
> types. However, looking at the syscaches, I think it'd not very often 
> be
> applicable for caches with enough columns.
> 
> 
> I have wondered before whether we should have syscache definitions 
> generate
> code specific to each syscache definition. I do think that'd give a 
> good bit
> of performance boost. But I don't see a trivial way to get there 
> without
> notational overhead.
> 
> We could define syscaches in a header using a macro that's defined 
> differently
> in syscache.c than everywhere else. The header would declare a set of
> functions for each syscache, syscache.c would define them to call an
> always_inline function with the relevant constants.
> 
> Or perhaps we should move syscache definitions into the pg_*.h headers, 
> and
> generate the relevant code as part of their processing. That seems like 
> it
> could be nice from a modularity POV alone. And it could do better than 
> the
> current approach, because we could hardcode the types for columns in 
> the
> syscache definition without increasing notational overhead.

Why don't use simplehash or something like that? Open-addressing schemes
show superior cache locality.

It could be combined: use hashValue as a key in simplehash and dlist for
hashValue collision handling. 99.99% of time there will be no collisions
on hashValue itself, therefore it will be almost always two-three memory
lookups: one-two for dlist_head in simple_hash by hashValue and then
fetching first element in dlist.

And code will remain almost same. Just "bucket" search will change a 
bit.

(And I'd recommend use lower fill factor for this simplehash. At most 
0.85).

Well, simplehash entry will be 24 bytes this way. If simplehash template
supports external key/element storage, then it could be shrunk to 16 
bytes,
and syscache entries will not need dlist_node. (But it doesn't at the
moment).

And custom open-addressing table could be even with 8 bytes per element:
- element is a pointer to entry,
- missing node is encoded as NULL,
- with fill factor as low as 0.66 there will be small amount of 
collisions,
   therefore non-empty entry will be matched entry most of time, and 
memory
   lookup for key comparison will be amortized free.
Note that 8byte entry with fill factor 0.66 consumes amortized 12.12 
byte,
while 16byte entry with fill factor 0.99 consumes amortized 16.16byte.

regards,
Yura Sokolov

y.sokolov@postgrespro.ru
funny.falcon@gmail.com



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Worth using personality(ADDR_NO_RANDOMIZE) for EXEC_BACKEND on linux?
Next
From: Andres Freund
Date:
Subject: Re: RFC: Improve CPU cache locality of syscache searches