[PATCH] distinct aggregates within a window function WIP - Mailing list pgsql-hackers

From Krasiyan Andreev
Subject [PATCH] distinct aggregates within a window function WIP
Date
Msg-id CAN1Pwonf4waD+PWkEFK8ANLua8fPjZ4DmV+hixO62+LiR8gwaA@mail.gmail.com
Whole thread Raw
Responses Re: [PATCH] distinct aggregates within a window function WIP  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi hackers,

I want to propose to you an old patch for Postgres 11, off-site developed by Oliver Ford,
but I have permission from him to publish it and to continue it's development,
that allow distinct aggregates, like select sum(distinct nums) within a window function.

I have rebased it for current git master branch and have made necessary changes to it to work with Postgres 13devel.

It's a WIP, because it doesn't have tests yet (I will add them later) and also, it works for a int, float, and numeric types,
but probably distinct check can be rewritten for possible performance improvement,
with storing the distinct elements in a hash table which should give a performance improvement.

If you find the implementation of patch acceptable from committers perspective,
I will answer to all yours design and review notes and will try to go ahead with it,
also, I will add this patch to the March commit fest.

For example usage of a patch, if you have time series data, with current Postgres you will get an error:

postgres=# CREATE TABLE t_demo AS
postgres-#     SELECT   ordinality, day, date_part('week', day) AS week
postgres-#     FROM    generate_series('2020-01-02', '2020-01-15', '1 day'::interval)
postgres-#             WITH ORDINALITY AS day;
SELECT 14
postgres=# SELECT * FROM t_demo;
 ordinality |          day           | week
------------+------------------------+------
          1 | 2020-01-02 00:00:00+02 |    1
          2 | 2020-01-03 00:00:00+02 |    1
          3 | 2020-01-04 00:00:00+02 |    1
          4 | 2020-01-05 00:00:00+02 |    1
          5 | 2020-01-06 00:00:00+02 |    2
          6 | 2020-01-07 00:00:00+02 |    2
          7 | 2020-01-08 00:00:00+02 |    2
          8 | 2020-01-09 00:00:00+02 |    2
          9 | 2020-01-10 00:00:00+02 |    2
         10 | 2020-01-11 00:00:00+02 |    2
         11 | 2020-01-12 00:00:00+02 |    2
         12 | 2020-01-13 00:00:00+02 |    3
         13 | 2020-01-14 00:00:00+02 |    3
         14 | 2020-01-15 00:00:00+02 |    3
(14 rows)

postgres=# SELECT   *,
postgres-#         array_agg(DISTINCT week) OVER (ORDER BY day ROWS
postgres(#                                 BETWEEN 2 PRECEDING AND 2 FOLLOWING)
postgres-# FROM    t_demo;
ERROR:  DISTINCT is not implemented for window functions
LINE 2:         array_agg(DISTINCT week) OVER (ORDER BY day ROWS
                ^

So you will need to write something like this:

postgres=# SELECT  *, (SELECT array_agg(DISTINCT unnest) FROM unnest(x)) AS b
postgres-# FROM
postgres-# (
postgres(#         SELECT  *,
postgres(#                 array_agg(week) OVER (ORDER BY day ROWS
postgres(#                         BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS x
postgres(#         FROM    t_demo
postgres(# ) AS a;
 ordinality |          day           | week |      x      |   b  
------------+------------------------+------+-------------+-------
          1 | 2020-01-02 00:00:00+02 |    1 | {1,1,1}     | {1}
          2 | 2020-01-03 00:00:00+02 |    1 | {1,1,1,1}   | {1}
          3 | 2020-01-04 00:00:00+02 |    1 | {1,1,1,1,2} | {1,2}
          4 | 2020-01-05 00:00:00+02 |    1 | {1,1,1,2,2} | {1,2}
          5 | 2020-01-06 00:00:00+02 |    2 | {1,1,2,2,2} | {1,2}
          6 | 2020-01-07 00:00:00+02 |    2 | {1,2,2,2,2} | {1,2}
          7 | 2020-01-08 00:00:00+02 |    2 | {2,2,2,2,2} | {2}
          8 | 2020-01-09 00:00:00+02 |    2 | {2,2,2,2,2} | {2}
          9 | 2020-01-10 00:00:00+02 |    2 | {2,2,2,2,2} | {2}
         10 | 2020-01-11 00:00:00+02 |    2 | {2,2,2,2,3} | {2,3}
         11 | 2020-01-12 00:00:00+02 |    2 | {2,2,2,3,3} | {2,3}
         12 | 2020-01-13 00:00:00+02 |    3 | {2,2,3,3,3} | {2,3}
         13 | 2020-01-14 00:00:00+02 |    3 | {2,3,3,3}   | {2,3}
         14 | 2020-01-15 00:00:00+02 |    3 | {3,3,3}     | {3}
(14 rows)

With attached version, you will get the desired results:

postgres=# SELECT   *,
postgres-#         array_agg(DISTINCT week) OVER (ORDER BY day ROWS
postgres(#                                 BETWEEN 2 PRECEDING AND 2 FOLLOWING)
postgres-# FROM    t_demo;
 ordinality |          day           | week | array_agg
------------+------------------------+------+-----------
          1 | 2020-01-02 00:00:00+02 |    1 | {1}
          2 | 2020-01-03 00:00:00+02 |    1 | {1}
          3 | 2020-01-04 00:00:00+02 |    1 | {1,2}
          4 | 2020-01-05 00:00:00+02 |    1 | {1,2}
          5 | 2020-01-06 00:00:00+02 |    2 | {1,2}
          6 | 2020-01-07 00:00:00+02 |    2 | {1,2}
          7 | 2020-01-08 00:00:00+02 |    2 | {2}
          8 | 2020-01-09 00:00:00+02 |    2 | {2}
          9 | 2020-01-10 00:00:00+02 |    2 | {2}
         10 | 2020-01-11 00:00:00+02 |    2 | {2,3}
         11 | 2020-01-12 00:00:00+02 |    2 | {2,3}
         12 | 2020-01-13 00:00:00+02 |    3 | {2,3}
         13 | 2020-01-14 00:00:00+02 |    3 | {2,3}
         14 | 2020-01-15 00:00:00+02 |    3 | {3}
(14 rows)





Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Question regarding heap_multi_insert documentation
Next
From: Sergei Kornilov
Date:
Subject: Re: [HACKERS] Block level parallel vacuum