Thread: Functionally dependent columns in SELECT DISTINCT

Functionally dependent columns in SELECT DISTINCT

From

Willow Chargin

Date:

13 September 2024, 08:20:10

Hello! Postgres lets us omit columns from a GROUP BY clause if they are
functionally dependent on a grouped key, which is a nice quality-of-life
feature. I'm wondering if a similar relaxation could be permitted for
the SELECT DISTINCT list?

I have a query where I want to find the most recent few items from a
table that match some complex condition, where the condition involves
joining other tables. Here's an example, with two approaches:

    -- Store some data: an "item" has one or more "parts".
    CREATE TABLE items(id int PRIMARY KEY, create_time timestamptz);
    CREATE TABLE parts(part_id int PRIMARY KEY, item_id int);
    INSERT INTO items(id, create_time)
        SELECT i, now() - make_interval(secs => i)
        FROM generate_series(1, 1000000) s(i);
    INSERT INTO parts(item_id, part_id)
        SELECT items.id, 2 * items.id + delta
        FROM items, (VALUES(0), (1)) delta(delta);
    CREATE INDEX ON items(create_time DESC);
    CREATE INDEX ON parts(item_id);
    ANALYZE items, parts;

    -- Suppose we want to find the most recent few items with a part
    -- whose part ID is threeven. Two approaches:

    -- SELECT DISTINCT: fast, but superfluous column:
    EXPLAIN ANALYZE
        SELECT DISTINCT items.id, create_time
        FROM items JOIN parts ON items.id = parts.item_id
        WHERE part_id % 3 = 0
        ORDER BY create_time DESC
        LIMIT 5;
    -- 4ms:
    --   parallel index scan on items_create_time_idx
    --   -> nested loop index scan parts_item_id_idx
    --   -> incremental sort -> gather merge -> unique -> limit

    -- GROUP BY: slow, but functional dependency recognized:
    EXPLAIN ANALYZE
        SELECT items.id
        FROM items JOIN parts ON items.id = parts.item_id
        WHERE part_id % 3 = 0
        GROUP BY items.id
        ORDER BY create_time DESC
        LIMIT 5;
    -- 400ms:
    --   parallel seq scan on parts
    --   -> parallel hash join on item_id via seq scan on items
    --   -> sort -> group -> gather merge -> group -> sort -> limit

These timings are Postgres 14.5 on a Linux i7-1165G7. With Postgres 16.3
on an Apple M3 Pro, the shape is the same: the GROUP BY is about 300ms,
and the SELECT DISTINCT is way faster still, at 0.07ms. (It declines to
parallelize, which seems to help.)

I want to use the faster approach, and it works without issue so far.
But that extra column in the SELECT list is a bit inconvenient.

My questions are:

  - Do I understand right that these kinds of queries are equivalent?

  - If so, does the SQL standard permit Postgres to recognize functional
    dependency in this case, so that users may omit the order column
    column from the `SELECT DISTINCT` list? (I don't have a copy of the
    standard to check myself.)

  - Might future versions of Postgres allow this?

thanks!
~Willow

Re: Functionally dependent columns in SELECT DISTINCT

From

shammat@gmx.net

Date:

13 September 2024, 09:13:25

Willow Chargin schrieb am 13.09.2024 um 07:20:
> Hello! Postgres lets us omit columns from a GROUP BY clause if they are
> functionally dependent on a grouped key, which is a nice quality-of-life
> feature. I'm wondering if a similar relaxation could be permitted for
> the SELECT DISTINCT list?
>
> I have a query where I want to find the most recent few items from a
> table that match some complex condition, where the condition involves
> joining other tables. Here's an example, with two approaches:


What about using DISTINCT ON () ?
    SELECT DISTINCT ON (items.id) items.*
    FROM items
      JOIN parts ON items.id = parts.item_id
    WHERE part_id % 3 = 0
    ORDER BY items.id,items.create_time DESC
    LIMIT 5;

This gives me this plan: https://explain.depesz.com/s/QHr6 on 16.2  (Windows, i7-1260P)

Re: Functionally dependent columns in SELECT DISTINCT

From

Willow Chargin

Date:

13 September 2024, 18:26:53

On Thu, Sep 12, 2024 at 11:13 PM <shammat@gmx.net> wrote:
>
> What about using DISTINCT ON () ?
>     SELECT DISTINCT ON (items.id) items.*
>     FROM items
>       JOIN parts ON items.id = parts.item_id
>     WHERE part_id % 3 = 0
>     ORDER BY items.id,items.create_time DESC
>     LIMIT 5;
>
> This gives me this plan: https://explain.depesz.com/s/QHr6 on 16.2  (Windows, i7-1260P)

Ordering by items.id changes the answer, though. In the example I gave,
items.id and items.create_time happened to be in the same order, but
that needn't hold. In reality I really do want the ID columns of the
*most recent* items.

You can see the difference if you build the test dataset a bit
differently:

    INSERT INTO items(id, create_time)
        SELECT i, now() - make_interval(secs => random() * 1e6)
        FROM generate_series(1, 1000000) s(i);

We want the returned create_times to be all recent, and the IDs now
should look roughly random.

Re: Functionally dependent columns in SELECT DISTINCT

From

"David G. Johnston"

Date:

13 September 2024, 18:43:54

On Friday, September 13, 2024, Willow Chargin <postgresql@wchargin.com> wrote:

In reality I really do want the ID columns of the
*most recent* items.

Use a window function to rank them and pull out rank=1, or use a lateral subquery to surgically (fetch first 1) retrieve the first row when sorted by recency descending.

David J.

Re: Functionally dependent columns in SELECT DISTINCT

From

Willow Chargin

Date:

13 September 2024, 19:12:12

Thanks both for your suggestions so far.

On Fri, Sep 13, 2024 at 8:43 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Friday, September 13, 2024, Willow Chargin <postgresql@wchargin.com> wrote:
>>
>> In reality I really do want the ID columns of the
>> *most recent* items.
>
>
> Use a window function to rank them and pull out rank=1

Hmm, like this? noting that it's rank<=5, not rank=1:

    -- 1. rank all item-part combinations, densely since an item may
          have multiple parts
    -- 2. limit by rank, still retaining multiple copies of each item
    -- 3. de-duplicate IDs
    SELECT DISTINCT id FROM (
        SELECT id, dense_rank FROM (
            SELECT
                items.id,
                dense_rank() OVER (ORDER BY create_time DESC)
            FROM items JOIN parts ON items.id = parts.item_id
            WHERE part_id % 3 = 0
        ) q
        WHERE dense_rank <= 5
    ) q

I've done this before, but my experience is that it's usually far slower
because the rank is computed eagerly even for rows that don't match the
rank bound. And indeed here it takes 20% longer than even the slower
GROUP BY from before: https://explain.depesz.com/s/mQIi

> or use a lateral subquery to surgically (fetch first 1) retrieve the first row when sorted by recency descending.

I'm not sure that I see how to apply this when I need top-k, not top-1.

Re: Functionally dependent columns in SELECT DISTINCT

From

"David G. Johnston"

Date:

13 September 2024, 20:46:10

> or use a lateral subquery to surgically (fetch first 1) retrieve the first row when sorted by recency descending.

I'm not sure that I see how to apply this when I need top-k, not top-1.

Fetch first k

It's just a modern limit clause.

David J.