Home > mailing lists

Re: Do we want a hashset type? - Mailing list pgsql-hackers

From	jian he
Subject	Re: Do we want a hashset type?
Date	June 29, 2023 06:54:44
Msg-id	CACJufxFxpMYEkMpoRrYP7YMTp5TkyzWq7piYE5gYB6sCr2mW1g@mail.gmail.com Whole thread
In response to	Re: Do we want a hashset type? ("Joel Jacobson" <joel@compiler.org>)
Responses	Re: Do we want a hashset type?
List	pgsql-hackers

Tree view

On Wed, Jun 28, 2023 at 4:50 PM Joel Jacobson <joel@compiler.org> wrote:
>
> On Wed, Jun 28, 2023, at 08:26, jian he wrote:
>
> > Hi there.
> > I changed the function hashset_contains to strict.
>
> Changing hashset_contains to STRICT would cause it to return NULL
> if any of the operands are NULL, which I don't believe is correct, since:
>
> SELECT NULL = ANY('{}'::int4[]);
>  ?column?
> ----------
>  f
> (1 row)
>
> Hence, `hashset_contains('{}'::int4hashset, NULL)` should also return FALSE,
> to mimic the semantics of arrays and MULTISET's MEMBER OF predicate in SQL:2023.
>
> Did you try running `make installcheck` after your change?
> You would then have seen one of the tests failing:
>
> test array-and-multiset-semantics ... FAILED       21 ms
>
> Check the content of `regression.diffs` to see why:
>
> % cat regression.diffs
> diff -U3 /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out
/Users/joel/src/hashset/results/array-and-multiset-semantics.out
> --- /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out      2023-06-27 10:07:38
> +++ /Users/joel/src/hashset/results/array-and-multiset-semantics.out    2023-06-28 10:13:27
> @@ -158,7 +158,7 @@
>          |      | {NULL}      | {NULL}       |                  |
>          |    1 | {1}         | {1}          |                  |
>          |    4 | {4}         | {4}          |                  |
> - {}     |      | {NULL}      | {NULL}       | f                | f
> + {}     |      | {NULL}      | {NULL}       |                  | f
>   {}     |    1 | {1}         | {1}          | f                | f
>   {}     |    4 | {4}         | {4}          | f                | f
>   {NULL} |      | {NULL}      | {NULL,NULL}  |                  |
> @@ -284,7 +284,8 @@
>      "= ANY(...)";
>   arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
>  ------+------+-------------+--------------+------------------+------------
> -(0 rows)
> + {}   |      | {NULL}      | {NULL}       |                  | f
> +(1 row)
>
>
> > also change the way to return an empty array.
>
> Nice.
> I agree the `Datum d` variable was unnecessary.
> I also removed the unused includes.
>
> > in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that
> > int4hashset can speed distinct aggregate and distinct counts?
> > like the following:
> >
> > explain(analyze, costs off, timing off, buffers)
> > SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3
> >
> > explain(analyze, costs off, timing off, buffers)
> > SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3
>
> The 100k tables seems to be too small to give any meaningful results,
> when trying to measure individual queries:
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT i) FROM benchmark_input_100k;
>  Execution Time: 26.790 ms
>  Execution Time: 30.616 ms
>  Execution Time: 33.253 ms
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(i) FROM benchmark_input_100k;
>  Execution Time: 32.797 ms
>  Execution Time: 27.605 ms
>  Execution Time: 26.228 ms
>
> If we instead try the 10M tables, it looks like array_agg(DISTINCT ...)
> is actually faster for the `i` column where all input integers are unique:
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT i) FROM benchmark_input_10M;
>  Execution Time: 799.017 ms
>  Execution Time: 796.008 ms
>  Execution Time: 799.121 ms
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(i) FROM benchmark_input_10M;
>  Execution Time: 1204.873 ms
>  Execution Time: 1221.822 ms
>  Execution Time: 1216.340 ms
>
> For random integers, hashset is a win though:
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M;
>  Execution Time: 1874.722 ms
>  Execution Time: 1878.760 ms
>  Execution Time: 1861.640 ms
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(rnd) FROM benchmark_input_10M;
>  Execution Time: 1253.709 ms
>  Execution Time: 1222.651 ms
>  Execution Time: 1237.849 ms
>
> > explain(costs off,timing off, analyze,buffers)
> > select count(distinct rnd) from benchmark_input_100k \watch c=3
> >
> > explain(costs off,timing off, analyze,buffers)
> > SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM
> > benchmark_input_100k) sub(x) \watch c=3
>
> I tried these with 10M:
>
> EXPLAIN(costs off,timing off, analyze,buffers)
> SELECT COUNT(DISTINCT rnd) FROM benchmark_input_10M;
>  Execution Time: 1733.320 ms
>  Execution Time: 1725.214 ms
>  Execution Time: 1716.636 ms
>
> EXPLAIN(costs off,timing off, analyze,buffers)
> SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM benchmark_input_10M) sub(x);
>  Execution Time: 1249.612 ms
>  Execution Time: 1240.558 ms
>  Execution Time: 1252.103 ms
>
> Not sure what I think of the current benchmark suite.
>
> I think it would be better to only include some realistic examples from
> real-life, such as the graph query which was the reason I personally started
> working on this. Otherwise there is a risk we optimise for some hypothetical
> scenario that is not relevant in practise.
>
> Would be good with more examples of typical work loads for when the hashset
> type would be useful.
>
> /Joel

> Did you try running `make installcheck` after your change?

First I use make installcheck
PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config
I found out it uses another active cluster.
So I killed another active cluster.
later i found another database port so I took me sometime to found out
I need use following:
make installcheck
PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config
PGPORT=5421

Anyway, this time, I added another macro,which seems to simplify the code.

#define SET_DATA_PTR(a) \
(((char *) (a->data)) + CEIL_DIV(a->capacity, 8))

it passed all the tests on my local machine.
I should have only made a patch, but when I was committed, I forgot to
mention one file, so I needed 2 commits.


> Not sure what I think of the current benchmark suite.
your result is so different from mine. I use the default  config. I
see a big difference. yech, I agree, the performance test should be
more careful.

Attachment

pgsql-hackers by date:

From: Richard Guo
Date: 29 June 2023, 06:44:43
Subject: Re: Assert !bms_overlap(joinrel->relids, required_outer)

From: Heikki Linnakangas
Date: 29 June 2023, 07:06:35
Subject: Re: Add more sanity checks around callers of changeDependencyFor()

Re: Do we want a hashset type? - Mailing list pgsql-hackers

Attachment

Previous

Next