Re: Do we want a hashset type? - Mailing list pgsql-hackers
From | jian he |
---|---|
Subject | Re: Do we want a hashset type? |
Date | |
Msg-id | CACJufxFxpMYEkMpoRrYP7YMTp5TkyzWq7piYE5gYB6sCr2mW1g@mail.gmail.com Whole thread Raw |
In response to | Re: Do we want a hashset type? ("Joel Jacobson" <joel@compiler.org>) |
Responses |
Re: Do we want a hashset type?
|
List | pgsql-hackers |
On Wed, Jun 28, 2023 at 4:50 PM Joel Jacobson <joel@compiler.org> wrote: > > On Wed, Jun 28, 2023, at 08:26, jian he wrote: > > > Hi there. > > I changed the function hashset_contains to strict. > > Changing hashset_contains to STRICT would cause it to return NULL > if any of the operands are NULL, which I don't believe is correct, since: > > SELECT NULL = ANY('{}'::int4[]); > ?column? > ---------- > f > (1 row) > > Hence, `hashset_contains('{}'::int4hashset, NULL)` should also return FALSE, > to mimic the semantics of arrays and MULTISET's MEMBER OF predicate in SQL:2023. > > Did you try running `make installcheck` after your change? > You would then have seen one of the tests failing: > > test array-and-multiset-semantics ... FAILED 21 ms > > Check the content of `regression.diffs` to see why: > > % cat regression.diffs > diff -U3 /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out /Users/joel/src/hashset/results/array-and-multiset-semantics.out > --- /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out 2023-06-27 10:07:38 > +++ /Users/joel/src/hashset/results/array-and-multiset-semantics.out 2023-06-28 10:13:27 > @@ -158,7 +158,7 @@ > | | {NULL} | {NULL} | | > | 1 | {1} | {1} | | > | 4 | {4} | {4} | | > - {} | | {NULL} | {NULL} | f | f > + {} | | {NULL} | {NULL} | | f > {} | 1 | {1} | {1} | f | f > {} | 4 | {4} | {4} | f | f > {NULL} | | {NULL} | {NULL,NULL} | | > @@ -284,7 +284,8 @@ > "= ANY(...)"; > arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...) > ------+------+-------------+--------------+------------------+------------ > -(0 rows) > + {} | | {NULL} | {NULL} | | f > +(1 row) > > > > also change the way to return an empty array. > > Nice. > I agree the `Datum d` variable was unnecessary. > I also removed the unused includes. > > > in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that > > int4hashset can speed distinct aggregate and distinct counts? > > like the following: > > > > explain(analyze, costs off, timing off, buffers) > > SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3 > > > > explain(analyze, costs off, timing off, buffers) > > SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3 > > The 100k tables seems to be too small to give any meaningful results, > when trying to measure individual queries: > > EXPLAIN(analyze, costs off, timing off, buffers) > SELECT array_agg(DISTINCT i) FROM benchmark_input_100k; > Execution Time: 26.790 ms > Execution Time: 30.616 ms > Execution Time: 33.253 ms > > EXPLAIN(analyze, costs off, timing off, buffers) > SELECT hashset_agg(i) FROM benchmark_input_100k; > Execution Time: 32.797 ms > Execution Time: 27.605 ms > Execution Time: 26.228 ms > > If we instead try the 10M tables, it looks like array_agg(DISTINCT ...) > is actually faster for the `i` column where all input integers are unique: > > EXPLAIN(analyze, costs off, timing off, buffers) > SELECT array_agg(DISTINCT i) FROM benchmark_input_10M; > Execution Time: 799.017 ms > Execution Time: 796.008 ms > Execution Time: 799.121 ms > > EXPLAIN(analyze, costs off, timing off, buffers) > SELECT hashset_agg(i) FROM benchmark_input_10M; > Execution Time: 1204.873 ms > Execution Time: 1221.822 ms > Execution Time: 1216.340 ms > > For random integers, hashset is a win though: > > EXPLAIN(analyze, costs off, timing off, buffers) > SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M; > Execution Time: 1874.722 ms > Execution Time: 1878.760 ms > Execution Time: 1861.640 ms > > EXPLAIN(analyze, costs off, timing off, buffers) > SELECT hashset_agg(rnd) FROM benchmark_input_10M; > Execution Time: 1253.709 ms > Execution Time: 1222.651 ms > Execution Time: 1237.849 ms > > > explain(costs off,timing off, analyze,buffers) > > select count(distinct rnd) from benchmark_input_100k \watch c=3 > > > > explain(costs off,timing off, analyze,buffers) > > SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM > > benchmark_input_100k) sub(x) \watch c=3 > > I tried these with 10M: > > EXPLAIN(costs off,timing off, analyze,buffers) > SELECT COUNT(DISTINCT rnd) FROM benchmark_input_10M; > Execution Time: 1733.320 ms > Execution Time: 1725.214 ms > Execution Time: 1716.636 ms > > EXPLAIN(costs off,timing off, analyze,buffers) > SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM benchmark_input_10M) sub(x); > Execution Time: 1249.612 ms > Execution Time: 1240.558 ms > Execution Time: 1252.103 ms > > Not sure what I think of the current benchmark suite. > > I think it would be better to only include some realistic examples from > real-life, such as the graph query which was the reason I personally started > working on this. Otherwise there is a risk we optimise for some hypothetical > scenario that is not relevant in practise. > > Would be good with more examples of typical work loads for when the hashset > type would be useful. > > /Joel > Did you try running `make installcheck` after your change? First I use make installcheck PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config I found out it uses another active cluster. So I killed another active cluster. later i found another database port so I took me sometime to found out I need use following: make installcheck PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config PGPORT=5421 Anyway, this time, I added another macro,which seems to simplify the code. #define SET_DATA_PTR(a) \ (((char *) (a->data)) + CEIL_DIV(a->capacity, 8)) it passed all the tests on my local machine. I should have only made a patch, but when I was committed, I forgot to mention one file, so I needed 2 commits. > Not sure what I think of the current benchmark suite. your result is so different from mine. I use the default config. I see a big difference. yech, I agree, the performance test should be more careful.
Attachment
pgsql-hackers by date: