Thread: Functionally dependent columns in SELECT DISTINCT
Hello! Postgres lets us omit columns from a GROUP BY clause if they are functionally dependent on a grouped key, which is a nice quality-of-life feature. I'm wondering if a similar relaxation could be permitted for the SELECT DISTINCT list? I have a query where I want to find the most recent few items from a table that match some complex condition, where the condition involves joining other tables. Here's an example, with two approaches: -- Store some data: an "item" has one or more "parts". CREATE TABLE items(id int PRIMARY KEY, create_time timestamptz); CREATE TABLE parts(part_id int PRIMARY KEY, item_id int); INSERT INTO items(id, create_time) SELECT i, now() - make_interval(secs => i) FROM generate_series(1, 1000000) s(i); INSERT INTO parts(item_id, part_id) SELECT items.id, 2 * items.id + delta FROM items, (VALUES(0), (1)) delta(delta); CREATE INDEX ON items(create_time DESC); CREATE INDEX ON parts(item_id); ANALYZE items, parts; -- Suppose we want to find the most recent few items with a part -- whose part ID is threeven. Two approaches: -- SELECT DISTINCT: fast, but superfluous column: EXPLAIN ANALYZE SELECT DISTINCT items.id, create_time FROM items JOIN parts ON items.id = parts.item_id WHERE part_id % 3 = 0 ORDER BY create_time DESC LIMIT 5; -- 4ms: -- parallel index scan on items_create_time_idx -- -> nested loop index scan parts_item_id_idx -- -> incremental sort -> gather merge -> unique -> limit -- GROUP BY: slow, but functional dependency recognized: EXPLAIN ANALYZE SELECT items.id FROM items JOIN parts ON items.id = parts.item_id WHERE part_id % 3 = 0 GROUP BY items.id ORDER BY create_time DESC LIMIT 5; -- 400ms: -- parallel seq scan on parts -- -> parallel hash join on item_id via seq scan on items -- -> sort -> group -> gather merge -> group -> sort -> limit These timings are Postgres 14.5 on a Linux i7-1165G7. With Postgres 16.3 on an Apple M3 Pro, the shape is the same: the GROUP BY is about 300ms, and the SELECT DISTINCT is way faster still, at 0.07ms. (It declines to parallelize, which seems to help.) I want to use the faster approach, and it works without issue so far. But that extra column in the SELECT list is a bit inconvenient. My questions are: - Do I understand right that these kinds of queries are equivalent? - If so, does the SQL standard permit Postgres to recognize functional dependency in this case, so that users may omit the order column column from the `SELECT DISTINCT` list? (I don't have a copy of the standard to check myself.) - Might future versions of Postgres allow this? thanks! ~Willow
Willow Chargin schrieb am 13.09.2024 um 07:20: > Hello! Postgres lets us omit columns from a GROUP BY clause if they are > functionally dependent on a grouped key, which is a nice quality-of-life > feature. I'm wondering if a similar relaxation could be permitted for > the SELECT DISTINCT list? > > I have a query where I want to find the most recent few items from a > table that match some complex condition, where the condition involves > joining other tables. Here's an example, with two approaches: What about using DISTINCT ON () ? SELECT DISTINCT ON (items.id) items.* FROM items JOIN parts ON items.id = parts.item_id WHERE part_id % 3 = 0 ORDER BY items.id,items.create_time DESC LIMIT 5; This gives me this plan: https://explain.depesz.com/s/QHr6 on 16.2 (Windows, i7-1260P)
On Thu, Sep 12, 2024 at 11:13 PM <shammat@gmx.net> wrote: > > What about using DISTINCT ON () ? > SELECT DISTINCT ON (items.id) items.* > FROM items > JOIN parts ON items.id = parts.item_id > WHERE part_id % 3 = 0 > ORDER BY items.id,items.create_time DESC > LIMIT 5; > > This gives me this plan: https://explain.depesz.com/s/QHr6 on 16.2 (Windows, i7-1260P) Ordering by items.id changes the answer, though. In the example I gave, items.id and items.create_time happened to be in the same order, but that needn't hold. In reality I really do want the ID columns of the *most recent* items. You can see the difference if you build the test dataset a bit differently: INSERT INTO items(id, create_time) SELECT i, now() - make_interval(secs => random() * 1e6) FROM generate_series(1, 1000000) s(i); We want the returned create_times to be all recent, and the IDs now should look roughly random.
On Friday, September 13, 2024, Willow Chargin <postgresql@wchargin.com> wrote:
In reality I really do want the ID columns of the
*most recent* items.
Use a window function to rank them and pull out rank=1, or use a lateral subquery to surgically (fetch first 1) retrieve the first row when sorted by recency descending.
David J.
Thanks both for your suggestions so far. On Fri, Sep 13, 2024 at 8:43 AM David G. Johnston <david.g.johnston@gmail.com> wrote: > > On Friday, September 13, 2024, Willow Chargin <postgresql@wchargin.com> wrote: >> >> In reality I really do want the ID columns of the >> *most recent* items. > > > Use a window function to rank them and pull out rank=1 Hmm, like this? noting that it's rank<=5, not rank=1: -- 1. rank all item-part combinations, densely since an item may have multiple parts -- 2. limit by rank, still retaining multiple copies of each item -- 3. de-duplicate IDs SELECT DISTINCT id FROM ( SELECT id, dense_rank FROM ( SELECT items.id, dense_rank() OVER (ORDER BY create_time DESC) FROM items JOIN parts ON items.id = parts.item_id WHERE part_id % 3 = 0 ) q WHERE dense_rank <= 5 ) q I've done this before, but my experience is that it's usually far slower because the rank is computed eagerly even for rows that don't match the rank bound. And indeed here it takes 20% longer than even the slower GROUP BY from before: https://explain.depesz.com/s/mQIi > or use a lateral subquery to surgically (fetch first 1) retrieve the first row when sorted by recency descending. I'm not sure that I see how to apply this when I need top-k, not top-1.
> or use a lateral subquery to surgically (fetch first 1) retrieve the first row when sorted by recency descending.
I'm not sure that I see how to apply this when I need top-k, not top-1.
Fetch first k
It's just a modern limit clause.
David J.