Re: Do we want a hashset type? - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: Do we want a hashset type?
Date
Msg-id 2040c023-1a52-4366-9716-8c8507bb6e32@app.fastmail.com
Whole thread Raw
In response to Re: Do we want a hashset type?  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Do we want a hashset type?
Re: Do we want a hashset type?
List pgsql-hackers
On Sat, Jun 10, 2023, at 22:26, Tomas Vondra wrote:
> On 6/10/23 17:46, Andrew Dunstan wrote:
>> 
>> Maybe you can post a full patch as well as incremental?
>> 
>
> I wonder if we should keep discussing this extension here, considering
> it's going to be out of core (at least for now). Not sure how many
> pgsql-hackers are interested in this, so maybe we should just move it to
> github PRs or something ...

I think there are some good arguments that speaks in favour of including it in core:

1. It's a fundamental data structure. Perhaps "set" would have been a better name,
since the use of hash functions from an end-user perspective is implementation
details, but we cannot use that word since it's a reserved keyword, hence "hashset".

2. The addition of SQL/PGQ in SQL:2023 is evidence of a general perceived need
among SQL users to evaluate graph queries. Even if a future implementation of SQL/PGQ
would mean users wouldn't need to deal with the hashset type directly, the same
type could hopefully be used, in part or in whole, under the hood by the future 
SQL/PGQ implementation. If low-level functionality is useful on its own, I think it's
a benefit of exposing it to users, since it allows system testing of the component
in isolation, even if it's primarily gonna be used as a smaller part of a larger more
high-level component.

3. I think there is a general need for hashset, experienced by myself, Andrew and
I would guess lots of others users. The general pattern that will be improved is
when you currently would do array_agg(DISTINCT ...)
probably there are other situations too, since it's a fundamental data structure.

On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
>>> 3) support for other types (now it only works with int32)
> I think we should decide what types we want/need to support, and add one
> or two types early. Otherwise we'll have code / on-disk format making
> various assumptions about the type length etc.
>
> I have no idea what types people use as node IDs - is it likely we'll
> need to support types passed by reference / varlena types? Or can we
> just assume it's int/bigint?

I think we should just support data types that would be sensible
to use as a PRIMARY KEY in a fully normalised data model,
which I believe would only include "int", "bigint" and "uuid".

/Joel



pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Implement generalized sub routine find_in_log for tap test
Next
From: Tomas Vondra
Date:
Subject: Should heapam_estimate_rel_size consider fillfactor?