Re: optimize lookups in snapshot [sub]xip arrays - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: optimize lookups in snapshot [sub]xip arrays
Date
Msg-id 20220717035957.GA3631071@nathanxps13
Whole thread Raw
In response to Re: optimize lookups in snapshot [sub]xip arrays  (Andres Freund <andres@anarazel.de>)
Responses Re: optimize lookups in snapshot [sub]xip arrays
List pgsql-hackers
Hi Andres,

Thanks for taking a look.

On Fri, Jul 15, 2022 at 01:08:57PM -0700, Andres Freund wrote:
> Hm. Are there any contexts where we'd not want the potential for failing due
> to OOM?

I'm not sure about this one.

> I wonder if we additionally / alternatively could use a faster method of
> searching the array linearly, e.g. using SIMD.

I looked into using SIMD.  The patch is attached, but it is only intended
for benchmarking purposes and isn't anywhere close to being worth serious
review.  There may be a simpler/better way to implement the linear search,
but this seemed to work well.  Overall, SIMD provided a decent improvement.
I had to increase the number of writers quite a bit in order to demonstrate
where the hash tables began winning.  Here are the numbers:

    writers head simd hash
    256     663  632  694
    512     530  618  637
    768     489  544  573
    1024    364  508  562
    2048    185  306  485
    4096    146  197  441

While it is unsurprising that the hash tables perform the best, there are a
couple of advantages to SIMD that might make that approach worth
considering.  For one, there's really no overhead (i.e., you don't need to
sort the array or build a hash table), so we can avoid picking an arbitrary
threshold and just have one code path.  Also, a SIMD implementation for a
linear search through an array of integers could likely be easily reused
elsewhere.

> Another thing that might be worth looking into is to sort the xip/subxip
> arrays into a binary-search optimized layout. That'd make the binary search
> faster, wouldn't require additional memory (a boolean indicating whether
> sorted somewhere, I guess), and would easily persist across copies of the
> snapshot.

I spent some time looking into this, but I haven't attempted to implement
it.  IIUC the most difficult part of this is sorting the array in place to
the special layout.

>> These hash tables are regarded as ephemeral; they only live in
>> process-local memory and are never rewritten, copied, or
>> serialized.
> 
> What does rewriting refer to here?

I mean that a hash table created for one snapshot will not be cleared and
reused for another.

> Not convinced that's the right idea in case of copying. I think we often end
> up copying snapshots frequently, and building & allocating the hashed xids
> separately every time seems not great.

Right.  My concern with reusing the hash tables is that we'd need to
allocate much more space that would go largely unused in many cases.

>> +    snapshot->xiph = NULL;
>> +    snapshot->subxiph = NULL;
> 
> Do we need separate hashes for these? ISTM that if we're overflowed then we
> don't need ->subxip[h], and if not, then the action for an xid being in ->xip
> and ->subxiph is the same?

Do you mean that we can combine these into one hash table?  That might
work.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Proposal to introduce a shuffle function to intarray extension
Next
From: Thomas Munro
Date:
Subject: Re: Proposal to introduce a shuffle function to intarray extension