Re: Do we want a hashset type? - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: Do we want a hashset type?
Date
Msg-id 64d3fe63-86ff-45c3-956a-17b2645167e2@app.fastmail.com
Whole thread Raw
In response to Re: Do we want a hashset type?  (jian he <jian.universality@gmail.com>)
Responses Re: Do we want a hashset type?
List pgsql-hackers
On Fri, Jun 16, 2023, at 13:57, jian he wrote:
> similar to (int[] || int4) and (int4 || int[])
> should we expect ('{1,2}'::int4hashset || 3) == (3 ||  
> '{1,2}'::int4hashset) == (select hashset_add('{1,2}'::int4hashset,3)); 
> *?*

Good idea, makes sense to support it.
Implemented in attached patch.

> CREATE OPERATOR || (
>     leftarg = int4,
>     rightarg = int4hashset,
>     function = int4_add_int4hashset,
>     commutator = ||
> );
> while creating an operator. I am not sure how to specify 
> NEGATOR,RESTRICT,JOIN clause.

I don't think we need those for this operator, might be wrong though.

>
-----------------------------------------------------------------------------------------------------------------------------
> also. I think the following query should return one row only? but now 
> it doesn't.
> select hashset_cmp('{1,2}','{2,1}')
> union 
> select hashset_cmp('{1,2}','{1,2,1}')
> union 
> select hashset_cmp('{1,2}','{1,2}');

Good point.

I realise int4hashset_hash() is broken,
since two int4hashset's that are considered equal,
can by coincidence get different hashes:

SELECT '{1,2}'::int4hashset = '{2,1}'::int4hashset;
 ?column?
----------
 t
(1 row)

SELECT hashset_hash('{1,2}'::int4hashset);
 hashset_hash
--------------
    990882385
(1 row)

SELECT hashset_hash('{2,1}'::int4hashset);
 hashset_hash
--------------
    996377797
(1 row)

Do we have any ideas on how to fix this without sacrificing performance?

We of course want to avoid having to sort the hashsets,
which is the naive solution.

To understand why this is happening, consider this example:

SELECT '{1,2}'::int4hashset;
 int4hashset
-------------
 {1,2}
(1 row)

SELECT '{2,1}'::int4hashset;
 int4hashset
-------------
 {2,1}
(1 row)

If the hash of `1` and `2` modulus the capacity results in the same value,
they will be attempted to be inserted into the same position,
and since the input text is parsed left-to-right, in the first case `1` will win
the first position, and `2` will get a collision and try the next position.

In the second case, the opposite happens.

Since we do modulus capacity, the position depends on the capacity,
which is why the output string can be different for the same input.

SELECTint4hashset() || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=1) || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=2) || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=3) || 1 || 2 || 3;
 {3,2,1}

SELECTint4hashset(capacity:=4) || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=5) || 1 || 2 || 3;
 {1,2,3}

SELECTint4hashset(capacity:=6) || 1 || 2 || 3;
 {1,3,2}


>
----------------------------------------------------------------------------------------------------------------------
> similar to elem_contained_by_range, range_contains_elem. we should 
> already consider the operator *<@* and @*>? *

That could perhaps be nice.
Apart from possible syntax convenience,
are there any other benefits than just using the function hashset_contains(int4hashset, integer) directly?

/Joel
Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics
Next
From: "Tristan Partin"
Date:
Subject: Re: Support to define custom wait events for extensions