RFC: Improve CPU cache locality of syscache searches - Mailing list pgsql-hackers

From John Naylor
Subject RFC: Improve CPU cache locality of syscache searches
Date
Msg-id CAFBsxsE35VLJ3hHkjJARB3QWqJ0zWeDw-jzqrfzkzMPuD_Ctvw@mail.gmail.com
Whole thread Raw
Responses Re: RFC: Improve CPU cache locality of syscache searches
List pgsql-hackers
CPU caches have multiple levels, so I had an idea to use that concept in the syscaches.

Imagine if catcache buckets were cacheline-sized. In that case, we would store the most recently accessed hash values and pointers to catctup, in addition to the dlist_head:

typedef struct cc_bucket
{
  uint32 hashes[4];
  catctup *ct[4];
  dlist_head;
};

You can think of the bucket here as a 4-way set associative L1 and the list walk as an inclusive L2. There might be an existing term for this scheme, but I don't know what it is offhand.

Creating entries:

Instead of calling dlist_push_head(), we call dlist_push_tail() and then stash the entry in the L1 so it's still quickly available if it's accessed in the near future. This way, older entries evicted from L1 will tend to remain toward the front of the list. Walking the list will always go from oldest to newest, which might have better prefetch behavior (not sure).

Search:

Buckets with <= 4 entries would only need to access a single cacheline to get the catctup pointer with the matching hash value. If we have to walk the list to find a match, we stash it in the L1, which is faster than calling dlist_move_head().

L1 Eviction:

When putting an entry here, we memmove() everything down in each array and place the pointer and the hash value in the front, evicting the fourth (possibly NULL) entry.

The buckets would now be 4 times larger on 64-bit machines, but that might not be a problem if we just divide the initial bucket size by four as well. If the minimum nbuckets is 2, then the smaller caches would waste a bit of space, but it doesn't seem too bad. As far as when to rehash the bucket array, the fill factor would logically jump from 2 to 8. The worst-case list walk would be longer, but with a better overall memory access pattern, it might be fine.

I think it would be straightforward to code this up -- the difficulty is testing and accounting for random effects due to binary layout changes. Thoughts?

--

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Lowering the ever-growing heap->pd_lower
Next
From: Andres Freund
Date:
Subject: Re: A varint implementation for PG?