[PATCH] distinct aggregates within a window function WIP - Mailing list pgsql-hackers
From | Krasiyan Andreev |
---|---|
Subject | [PATCH] distinct aggregates within a window function WIP |
Date | |
Msg-id | CAN1Pwonf4waD+PWkEFK8ANLua8fPjZ4DmV+hixO62+LiR8gwaA@mail.gmail.com Whole thread Raw |
Responses |
Re: [PATCH] distinct aggregates within a window function WIP
|
List | pgsql-hackers |
Hi hackers,
It's a WIP, because it doesn't have tests yet (I will add them later) and also, it works for a int, float, and numeric types,
but probably distinct check can be rewritten for possible performance improvement,
I want to propose to you an old patch for Postgres 11, off-site developed by Oliver Ford,
but I have permission from him to publish it and to continue it's development,
that allow distinct aggregates, like select sum(distinct nums) within a window function.
I have rebased it for current git master branch and have made necessary changes to it to work with Postgres 13devel.
It's a WIP, because it doesn't have tests yet (I will add them later) and also, it works for a int, float, and numeric types,
but probably distinct check can be rewritten for possible performance improvement,
with storing the distinct elements in a hash table which should give a performance improvement.
If you find the implementation of patch acceptable from committers perspective,
I will answer to all yours design and review notes and will try to go ahead with it,
also, I will add this patch to the March commit fest.
I will answer to all yours design and review notes and will try to go ahead with it,
also, I will add this patch to the March commit fest.
For example usage of a patch, if you have time series data, with current Postgres you will get an error:
postgres=# CREATE TABLE t_demo AS
postgres-# SELECT ordinality, day, date_part('week', day) AS week
postgres-# FROM generate_series('2020-01-02', '2020-01-15', '1 day'::interval)
postgres-# WITH ORDINALITY AS day;
SELECT 14
postgres=# SELECT * FROM t_demo;
ordinality | day | week
------------+------------------------+------
1 | 2020-01-02 00:00:00+02 | 1
2 | 2020-01-03 00:00:00+02 | 1
3 | 2020-01-04 00:00:00+02 | 1
4 | 2020-01-05 00:00:00+02 | 1
5 | 2020-01-06 00:00:00+02 | 2
6 | 2020-01-07 00:00:00+02 | 2
7 | 2020-01-08 00:00:00+02 | 2
8 | 2020-01-09 00:00:00+02 | 2
9 | 2020-01-10 00:00:00+02 | 2
10 | 2020-01-11 00:00:00+02 | 2
11 | 2020-01-12 00:00:00+02 | 2
12 | 2020-01-13 00:00:00+02 | 3
13 | 2020-01-14 00:00:00+02 | 3
14 | 2020-01-15 00:00:00+02 | 3
(14 rows)
postgres=# SELECT *,
postgres-# array_agg(DISTINCT week) OVER (ORDER BY day ROWS
postgres(# BETWEEN 2 PRECEDING AND 2 FOLLOWING)
postgres-# FROM t_demo;
ERROR: DISTINCT is not implemented for window functions
LINE 2: array_agg(DISTINCT week) OVER (ORDER BY day ROWS
^
postgres-# SELECT ordinality, day, date_part('week', day) AS week
postgres-# FROM generate_series('2020-01-02', '2020-01-15', '1 day'::interval)
postgres-# WITH ORDINALITY AS day;
SELECT 14
postgres=# SELECT * FROM t_demo;
ordinality | day | week
------------+------------------------+------
1 | 2020-01-02 00:00:00+02 | 1
2 | 2020-01-03 00:00:00+02 | 1
3 | 2020-01-04 00:00:00+02 | 1
4 | 2020-01-05 00:00:00+02 | 1
5 | 2020-01-06 00:00:00+02 | 2
6 | 2020-01-07 00:00:00+02 | 2
7 | 2020-01-08 00:00:00+02 | 2
8 | 2020-01-09 00:00:00+02 | 2
9 | 2020-01-10 00:00:00+02 | 2
10 | 2020-01-11 00:00:00+02 | 2
11 | 2020-01-12 00:00:00+02 | 2
12 | 2020-01-13 00:00:00+02 | 3
13 | 2020-01-14 00:00:00+02 | 3
14 | 2020-01-15 00:00:00+02 | 3
(14 rows)
postgres=# SELECT *,
postgres-# array_agg(DISTINCT week) OVER (ORDER BY day ROWS
postgres(# BETWEEN 2 PRECEDING AND 2 FOLLOWING)
postgres-# FROM t_demo;
ERROR: DISTINCT is not implemented for window functions
LINE 2: array_agg(DISTINCT week) OVER (ORDER BY day ROWS
^
So you will need to write something like this:
postgres=# SELECT *, (SELECT array_agg(DISTINCT unnest) FROM unnest(x)) AS b
postgres-# FROM
postgres-# (
postgres(# SELECT *,
postgres(# array_agg(week) OVER (ORDER BY day ROWS
postgres(# BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS x
postgres(# FROM t_demo
postgres(# ) AS a;
ordinality | day | week | x | b
------------+------------------------+------+-------------+-------
1 | 2020-01-02 00:00:00+02 | 1 | {1,1,1} | {1}
2 | 2020-01-03 00:00:00+02 | 1 | {1,1,1,1} | {1}
3 | 2020-01-04 00:00:00+02 | 1 | {1,1,1,1,2} | {1,2}
4 | 2020-01-05 00:00:00+02 | 1 | {1,1,1,2,2} | {1,2}
5 | 2020-01-06 00:00:00+02 | 2 | {1,1,2,2,2} | {1,2}
6 | 2020-01-07 00:00:00+02 | 2 | {1,2,2,2,2} | {1,2}
7 | 2020-01-08 00:00:00+02 | 2 | {2,2,2,2,2} | {2}
8 | 2020-01-09 00:00:00+02 | 2 | {2,2,2,2,2} | {2}
9 | 2020-01-10 00:00:00+02 | 2 | {2,2,2,2,2} | {2}
10 | 2020-01-11 00:00:00+02 | 2 | {2,2,2,2,3} | {2,3}
11 | 2020-01-12 00:00:00+02 | 2 | {2,2,2,3,3} | {2,3}
12 | 2020-01-13 00:00:00+02 | 3 | {2,2,3,3,3} | {2,3}
13 | 2020-01-14 00:00:00+02 | 3 | {2,3,3,3} | {2,3}
14 | 2020-01-15 00:00:00+02 | 3 | {3,3,3} | {3}
(14 rows)
postgres-# FROM
postgres-# (
postgres(# SELECT *,
postgres(# array_agg(week) OVER (ORDER BY day ROWS
postgres(# BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS x
postgres(# FROM t_demo
postgres(# ) AS a;
ordinality | day | week | x | b
------------+------------------------+------+-------------+-------
1 | 2020-01-02 00:00:00+02 | 1 | {1,1,1} | {1}
2 | 2020-01-03 00:00:00+02 | 1 | {1,1,1,1} | {1}
3 | 2020-01-04 00:00:00+02 | 1 | {1,1,1,1,2} | {1,2}
4 | 2020-01-05 00:00:00+02 | 1 | {1,1,1,2,2} | {1,2}
5 | 2020-01-06 00:00:00+02 | 2 | {1,1,2,2,2} | {1,2}
6 | 2020-01-07 00:00:00+02 | 2 | {1,2,2,2,2} | {1,2}
7 | 2020-01-08 00:00:00+02 | 2 | {2,2,2,2,2} | {2}
8 | 2020-01-09 00:00:00+02 | 2 | {2,2,2,2,2} | {2}
9 | 2020-01-10 00:00:00+02 | 2 | {2,2,2,2,2} | {2}
10 | 2020-01-11 00:00:00+02 | 2 | {2,2,2,2,3} | {2,3}
11 | 2020-01-12 00:00:00+02 | 2 | {2,2,2,3,3} | {2,3}
12 | 2020-01-13 00:00:00+02 | 3 | {2,2,3,3,3} | {2,3}
13 | 2020-01-14 00:00:00+02 | 3 | {2,3,3,3} | {2,3}
14 | 2020-01-15 00:00:00+02 | 3 | {3,3,3} | {3}
(14 rows)
With attached version, you will get the desired results:
postgres=# SELECT *,
postgres-# array_agg(DISTINCT week) OVER (ORDER BY day ROWS
postgres(# BETWEEN 2 PRECEDING AND 2 FOLLOWING)
postgres-# FROM t_demo;
ordinality | day | week | array_agg
------------+------------------------+------+-----------
1 | 2020-01-02 00:00:00+02 | 1 | {1}
2 | 2020-01-03 00:00:00+02 | 1 | {1}
3 | 2020-01-04 00:00:00+02 | 1 | {1,2}
4 | 2020-01-05 00:00:00+02 | 1 | {1,2}
5 | 2020-01-06 00:00:00+02 | 2 | {1,2}
6 | 2020-01-07 00:00:00+02 | 2 | {1,2}
7 | 2020-01-08 00:00:00+02 | 2 | {2}
8 | 2020-01-09 00:00:00+02 | 2 | {2}
9 | 2020-01-10 00:00:00+02 | 2 | {2}
10 | 2020-01-11 00:00:00+02 | 2 | {2,3}
11 | 2020-01-12 00:00:00+02 | 2 | {2,3}
12 | 2020-01-13 00:00:00+02 | 3 | {2,3}
13 | 2020-01-14 00:00:00+02 | 3 | {2,3}
14 | 2020-01-15 00:00:00+02 | 3 | {3}
(14 rows)
postgres-# array_agg(DISTINCT week) OVER (ORDER BY day ROWS
postgres(# BETWEEN 2 PRECEDING AND 2 FOLLOWING)
postgres-# FROM t_demo;
ordinality | day | week | array_agg
------------+------------------------+------+-----------
1 | 2020-01-02 00:00:00+02 | 1 | {1}
2 | 2020-01-03 00:00:00+02 | 1 | {1}
3 | 2020-01-04 00:00:00+02 | 1 | {1,2}
4 | 2020-01-05 00:00:00+02 | 1 | {1,2}
5 | 2020-01-06 00:00:00+02 | 2 | {1,2}
6 | 2020-01-07 00:00:00+02 | 2 | {1,2}
7 | 2020-01-08 00:00:00+02 | 2 | {2}
8 | 2020-01-09 00:00:00+02 | 2 | {2}
9 | 2020-01-10 00:00:00+02 | 2 | {2}
10 | 2020-01-11 00:00:00+02 | 2 | {2,3}
11 | 2020-01-12 00:00:00+02 | 2 | {2,3}
12 | 2020-01-13 00:00:00+02 | 3 | {2,3}
13 | 2020-01-14 00:00:00+02 | 3 | {2,3}
14 | 2020-01-15 00:00:00+02 | 3 | {3}
(14 rows)
Attachment
pgsql-hackers by date: