Re: Questions regarding distinct operation implementation - Mailing list pgsql-hackers

From David Rowley
Subject Re: Questions regarding distinct operation implementation
Date
Msg-id CAApHDvrz6ebRcVNwGRV8+t6Yyr1YUxW5ZiFHbuFn4qeB1sOnzg@mail.gmail.com
Whole thread Raw
In response to Re: Questions regarding distinct operation implementation  (Ankit Kumar Pandey <itsankitkp@gmail.com>)
List pgsql-hackers
On Mon, 5 Dec 2022 at 02:34, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:
> Interesting problem, Hashtables created in normal aggregates (AGG_HASHED
> mode) may provide some reference as they have hashagg_spill_tuple but I
> am not sure of any prior implementation of hashtable with counter and
> spill.

I'm unsure if there's such guidance that can be gleaned from studying
nodeAgg.c. IIRC, that works by deciding up-front how many partitions
we're going to split the hash key space into and then writing out
tuples to files based on "hashkey MOD number-of-partitions".  At the
end of that, you can just aggregate tuples one partition at a time.
All groups are in the same file/partition.

The reason this does not seem useful to your case is that you need to
be able to quickly look up a given Datum or set of Datums to check if
they are unique or not. For that, you'd need to reload the hash table
every time your lookup lands on a different partition of the hashkey
space. I fail to see how that could ever be fast unless there happened
to only be 1 partition.   To make that worse, when a tuple goes out of
the frame and the counter that's tracking how many times the Datum(s)
appeared reaches 0, you need to write the entire file out again minus
that tuple.  Let's say you're window function is on a column which is
distinct or *very* close to it and the given window is moving the
window frame forward 1 tuple per input tuple. If each subsequent Datum
hashes to a different partition, then you're going to need to load the
file for that hash key space to check if that Datum has already been
seen, then you're going to have to evict that tuple from the file as
it moves out of frame, so that means reading and writing that entire
file per input tuple consumed.  That'll perform very poorly!   It's
possible that you could maybe speed it up a bit with some lossy hash
table that sits atop of this can only tell you if the given key
definitely does *not* exists.  You'd then be able to just write that
tuple out to the partition and you'd not have to read or write out the
file again.  It's going to slow down to a crawl when the lossy table
contains too many false positives though.

> Major concern is, if we go through tuplesort route (without order
> by case), would we get handicapped in future if we want order by or more
> features?

Yeah, deciding that before you go down one of the paths is going to be
important. I imagine the reason that you've not found another database
that supports DISTINCT window functions in a window with an ORDER BY
clause is that it's very hard to make it in a way where it performs
well in all cases.

Maybe another way to go about it that will give you less lock-in if we
decide to make ORDER BY work later would be to design some new
tuple-store-like data structure that can be defined with a lookup key
so you could ask it if a given key is stored and it would return the
answer quickly without having to trawl through all stored tuples. It
would also need to support the same positional lookups as tuplestore
does today so that all evicting-tuples-from-the-window-frame stuff
works as it does today. If you made something like that, then the
changes required in nodeWindowAgg.c would be significantly reduced.
You'd also just have 1 work_mem limit to abide by instead of having to
consider sharing that between a tuplestore and a hashtable/tuplesort.

Maybe as step 1, you could invent keyedtuplestore.c and consume
tuplestore's functions but layer on the lossy hashtable idea that I
mentioned above. That'd have to be more than just a bloom filter as
you need a payload of the count of tuples matching the given hashkey
MOD nbuckets.  If you then had a function like
keyedtuplestore_key_definately_does_not_exist() (can't think of a
better name now) then you can just lookup the lossy table and if there
are 0 tuples at that lossy bucket, then you can
keyedtuplestore_puttupleslot() from nodeWindowAgg.c.
keyedtuplestore_key_definately_does_not_exist() would have to work
much harder if there were>0 tuples with the same lossy hashkey. You'd
need to trawl through the tuples and check each one.  Perhaps that
could be tuned a bit so if you get too many collisions then the lossy
table could be rehashed to a larger size. It's going to fall flat on
its face, performance-wise, when the hash table can't be made larger
due to work_mem constraints.

Anyway, that's a lot of only partially thought-through ideas above. If
you're working on a patch like this, you should expect to have to
rewrite it a dozen or 2 times as new ideas arrive. If you're good at
using the fact that the new patch is better than the old one as
motivation to continue, then you're onto an attitude that is
PostgreSQL-community-proof :-)  (thankfully) we're often not good at
"let's just commit it now and make it better later".

David



pgsql-hackers by date:

Previous
From: Nikolay Samokhvalov
Date:
Subject: Re: Transaction timeout
Next
From: Nikolay Samokhvalov
Date:
Subject: Re: Transaction timeout