Thread: proposal: window function - change_number
Hi
I tried to solve following task:
I have a table
start, reason, km
=============
2014-01-01 08:00:00, private, 10
2014-01-01 09:00:00, commerc, 20
2014-01-01 10:00:00, commerc, 20
2014-01-01 11:00:00, private, 8
and I would reduce these rows to
2014-01-01 08:00:00, private, 10
2014-01-01 09:00:00, commerc, 20 + 20 = 40
2014-01-01 11:00:00, private, 8
I tried to solve following task:
I have a table
start, reason, km
=============
2014-01-01 08:00:00, private, 10
2014-01-01 09:00:00, commerc, 20
2014-01-01 10:00:00, commerc, 20
2014-01-01 11:00:00, private, 8
and I would reduce these rows to
2014-01-01 08:00:00, private, 10
2014-01-01 09:00:00, commerc, 20 + 20 = 40
2014-01-01 11:00:00, private, 8
It is relative hard to it now with SQL only. But we can simplify this task with window function that returns number of change in some column. Then this task can be solved by
select min(start), min(reason), sum(km)
from (select start, reason, km, change_number(reason) over (order by start))
group by change_number;
Do you think, so it has sense?
Regards
Pavel
On Sun, Sep 21, 2014 at 9:27 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hi
I tried to solve following task:
I have a table
start, reason, km
=============
2014-01-01 08:00:00, private, 10
2014-01-01 09:00:00, commerc, 20
2014-01-01 10:00:00, commerc, 20
2014-01-01 11:00:00, private, 8
and I would reduce these rows to
2014-01-01 08:00:00, private, 10
2014-01-01 09:00:00, commerc, 20 + 20 = 40
2014-01-01 11:00:00, private, 8It is relative hard to it now with SQL only. But we can simplify this task with window function that returns number of change in some column. Then this task can be solved byselect min(start), min(reason), sum(km)from (select start, reason, km, change_number(reason) over (order by start))group by change_number;
I guess that might be quite useful, otherwise the only way that comes to mind to do this would be something along the lines of:
select *,sum(case when reason <> lastreason then 1 else 0 end) over (order by start) as chg_num from (select *,lag(reason) over (order by start) vnext from sometable) sometable;
This way might not be too bad as I think the outer window will have no need to perform another sort, since the inner window clause has sorted it the right way already. Though something like change_number() would make this a bit more pretty. It's almost like rank(), but with a parameter.
Regards
David Rowley
Hi Pavel (and others), Pavel Stehule wrote: > Hi > I tried to solve following task: > > I have a table > > start, reason, km > ============= > 2014-01-01 08:00:00, private, 10 > 2014-01-01 09:00:00, commerc, 20 > 2014-01-01 10:00:00, commerc, 20 > 2014-01-01 11:00:00, private, 8 > > and I would reduce these rows to > > 2014-01-01 08:00:00, private, 10 > 2014-01-01 09:00:00, commerc, 20 + 20 = 40 > 2014-01-01 11:00:00, private, 8 > > It is relative hard to it now with SQL only. But we can simplify this task > with window function that returns number of change in some column. Then > this task can be solved by > > select min(start), min(reason), sum(km) > from (select start, reason, km, change_number(reason) over (order by > start)) > group by change_number; What about select srk.reason, min(srk.start), sum(srk.km) from start_reason_km srk group by srk.reason, (select max(start) from start_reason_km other WHERE other.start < srk.start and other.reason != srk.reason); In general, I think window function are very specific in how the queryplan must look like, leaving not much room for the optimizer. On the other hand, if there happends to be an efficient way to get the results of the table ordered by "start", then the window function will very likely much faster then a join. I would be nice if the optimizer is able to add such stream order operations. > Do you think, so it has sense? > > Regards > > Pavel Regards, Mart PS: This is my first post to the mailing list. I am a software developer interest is performance making webapplications with a different database server during working hours.
2014-09-21 14:30 GMT+02:00 Mart Kelder <mart@kelder31.nl>:
Hi Pavel (and others),What about
Pavel Stehule wrote:
> Hi
> I tried to solve following task:
>
> I have a table
>
> start, reason, km
> =============
> 2014-01-01 08:00:00, private, 10
> 2014-01-01 09:00:00, commerc, 20
> 2014-01-01 10:00:00, commerc, 20
> 2014-01-01 11:00:00, private, 8
>
> and I would reduce these rows to
>
> 2014-01-01 08:00:00, private, 10
> 2014-01-01 09:00:00, commerc, 20 + 20 = 40
> 2014-01-01 11:00:00, private, 8
>
> It is relative hard to it now with SQL only. But we can simplify this task
> with window function that returns number of change in some column. Then
> this task can be solved by
>
> select min(start), min(reason), sum(km)
> from (select start, reason, km, change_number(reason) over (order by
> start))
> group by change_number;
select srk.reason, min(srk.start), sum(srk.km)
from start_reason_km srk
group by srk.reason, (select max(start) from start_reason_km other WHERE
other.start < srk.start and other.reason != srk.reason);
This query is Cartesian product, so for some large data it is significantly slower then window function (required only sorts without joins)
My motivation was a) to implement described task without Cartesian product. b) introduce some tool for this kind of problems. I seen more times a request .. reduce a time series, and a window function "change_number" (or maybe "consistent_series_number") can be good candidate.
In general, I think window function are very specific in how the queryplan
must look like, leaving not much room for the optimizer. On the other hand,
if there happends to be an efficient way to get the results of the table
ordered by "start", then the window function will very likely much faster
then a join. I would be nice if the optimizer is able to add such stream
order operations.
> Do you think, so it has sense?
>
> Regards
>
> Pavel
Regards,
Mart
PS: This is my first post to the mailing list. I am a software developer
interest is performance making webapplications with a different database
server during working hours.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Pavel (and others), Op zondag 21 september 2014 15:35:46 schreef u: > 2014-09-21 14:30 GMT+02:00 Mart Kelder <mart@kelder31.nl>: > > Hi Pavel (and others), > > > > Pavel Stehule wrote: > > > Hi > > > I tried to solve following task: > > > > > > I have a table > > > > > > start, reason, km > > > ============= > > > > > > 2014-01-01 08:00:00, private, 10 > > > 2014-01-01 09:00:00, commerc, 20 > > > 2014-01-01 10:00:00, commerc, 20 > > > 2014-01-01 11:00:00, private, 8 > > > > > > and I would reduce these rows to > > > > > > 2014-01-01 08:00:00, private, 10 > > > 2014-01-01 09:00:00, commerc, 20 + 20 = 40 > > > 2014-01-01 11:00:00, private, 8 > > > > > > It is relative hard to it now with SQL only. But we can simplify this > > > task > > > > > > with window function that returns number of change in some column. Then > > > this task can be solved by > > > > > > select min(start), min(reason), sum(km) > > > from (select start, reason, km, change_number(reason) over (order by > > > start)) > > > > > > group by change_number; > > > > What about > > > > select srk.reason, min(srk.start), sum(srk.km) > > from start_reason_km srk > > group by srk.reason, (select max(start) from start_reason_km other WHERE > > other.start < srk.start and other.reason != srk.reason); > > This query is Cartesian product, so for some large data it is significantly > slower then window function (required only sorts without joins) I think we have the same queryplan in mind (with only one scan). As far as I know, SQL is a language where you define the result you want to get, and let the server find a way how to find the data. I think windowed function also say how the server needs to get the information. The real challenge is of course of finding heuristics to remove the additional join. In this particular case, I can tell how to remove the inner join from the subquery: * the where-clause of the self-join contains other.start < srk.start. From that we can conclude that if the table is (or can be) sorted on "start", we have seen the data before the subquery is executed * because we only need an aggregate, we need to store the intermediate "max" for each "reason". And then add the result to the stream. Recently, I had a simular problem with a table containing a timestamp, a state and a object where the state belongs to. A object remains in a state until there is a more recent tuple in the table. I needed basically to query all the previous state for each tuple, but preverably without the additional join. > My motivation was a) to implement described task without Cartesian product. Good reason (if you consider the queryplan and not the query). > b) introduce some tool for this kind of problems. I seen more times a > request .. reduce a time series, and a window function "change_number" (or > maybe "consistent_series_number") can be good candidate. I also need to note that there is a lot of difference in complexity between the possible solutions of this problem. Where a new window function can probably be very easy implemented, the optimizer changes descripted above are complex and not easy to implement. I also want to note that I am not really against new window functions, I only want to point out that a more generic solution also might be possible. Regards, Mart
2014-09-21 17:00 GMT+02:00 Mart Kelder <mart@kelder31.nl>:
Hi Pavel (and others),
Op zondag 21 september 2014 15:35:46 schreef u:I think we have the same queryplan in mind (with only one scan). As far as I> 2014-09-21 14:30 GMT+02:00 Mart Kelder <mart@kelder31.nl>:
> > Hi Pavel (and others),
> >
> > Pavel Stehule wrote:
> > > Hi
> > > I tried to solve following task:
> > >
> > > I have a table
> > >
> > > start, reason, km
> > > =============
> > >
> > > 2014-01-01 08:00:00, private, 10
> > > 2014-01-01 09:00:00, commerc, 20
> > > 2014-01-01 10:00:00, commerc, 20
> > > 2014-01-01 11:00:00, private, 8
> > >
> > > and I would reduce these rows to
> > >
> > > 2014-01-01 08:00:00, private, 10
> > > 2014-01-01 09:00:00, commerc, 20 + 20 = 40
> > > 2014-01-01 11:00:00, private, 8
> > >
> > > It is relative hard to it now with SQL only. But we can simplify this
> > > task
> > >
> > > with window function that returns number of change in some column. Then
> > > this task can be solved by
> > >
> > > select min(start), min(reason), sum(km)
> > > from (select start, reason, km, change_number(reason) over (order by
> > > start))
> > >
> > > group by change_number;
> >
> > What about
> >
> > select srk.reason, min(srk.start), sum(srk.km)
> > from start_reason_km srk
> > group by srk.reason, (select max(start) from start_reason_km other WHERE
> > other.start < srk.start and other.reason != srk.reason);
>
> This query is Cartesian product, so for some large data it is significantly
> slower then window function (required only sorts without joins)
know, SQL is a language where you define the result you want to get, and let
the server find a way how to find the data. I think windowed function also say
how the server needs to get the information.
What I know It is not true now. Any subselect enforce individual scan of source relation. Postgres has no any special support for selfjoin.
The real challenge is of course of finding heuristics to remove the additional
join. In this particular case, I can tell how to remove the inner join from
the subquery:
* the where-clause of the self-join contains other.start < srk.start. From
that we can conclude that if the table is (or can be) sorted on "start", we
have seen the data before the subquery is executed
* because we only need an aggregate, we need to store the intermediate "max"
for each "reason". And then add the result to the stream.
Recently, I had a simular problem with a table containing a timestamp, a state
and a object where the state belongs to. A object remains in a state until
there is a more recent tuple in the table. I needed basically to query all the
previous state for each tuple, but preverably without the additional join.
> My motivation was a) to implement described task without Cartesian product.
Good reason (if you consider the queryplan and not the query).
yes.
There is probably big space for improvements in more directions. For example - we have application, where is often used pattern SELECT FROM A JOIN (SELECT someagg() FROM A) .. ON
Sometimes these queries are slow due terrible low estimation. It is one example of more
Pavel
> b) introduce some tool for this kind of problems. I seen more times a
> request .. reduce a time series, and a window function "change_number" (or
> maybe "consistent_series_number") can be good candidate.
I also need to note that there is a lot of difference in complexity between the
possible solutions of this problem. Where a new window function can probably
be very easy implemented, the optimizer changes descripted above are complex
and not easy to implement.
I also want to note that I am not really against new window functions, I only
want to point out that a more generic solution also might be possible.
Regards,
Mart
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
>>>>> "Pavel" == Pavel Stehule <pavel.stehule@gmail.com> writes: Pavel> HiPavel> I tried to solve following task: Pavel> I have a table Pavel> start, reason, kmPavel> =============Pavel> 2014-01-01 08:00:00, private, 10Pavel> 2014-01-01 09:00:00, commerc,20Pavel> 2014-01-01 10:00:00, commerc, 20Pavel> 2014-01-01 11:00:00, private, 8 Pavel> and I would reduce these rows to Pavel> 2014-01-01 08:00:00, private, 10Pavel> 2014-01-01 09:00:00, commerc, 20 + 20 = 40Pavel> 2014-01-01 11:00:00, private,8 Pavel> It is relative hard to it now with SQL only. Only relatively. My standard solution is something like this: select start_time, reason, sum(km) as km from (select max(label_time) over (order by start) as start_time, reason,km from (select start, reason, km, case when reason isdistinct from lag(reason) over (order by start) then start end as label_time from yourtable ) s2 ) s1group by start_time, reasonorderby start_time; (Your change_number idea is essentially equivalent to doing sum(case when x is distinct from lag(x) over w then 1 end) over w, except that since window functions can't be nested, that expression requires a subquery.) -- Andrew (irc:RhodiumToad)
Hey, sorry I what I say is obvious for you .
If I understood your problem correctly, it is strictly equivalent to this one :
http://postgresql.1045698.n5.nabble.com/Count-of-records-in-a-row-td5775363.html
http://postgresql.1045698.n5.nabble.com/Count-of-records-in-a-row-td5775363.html
there is a postgres trick to solve this problem :
is to generate a row number by the order you want , then a row number by the group ,
2014-09-21 17:51 GMT+02:00 Andrew Gierth <andrew@tao11.riddles.org.uk>:
>>>>> "Pavel" == Pavel Stehule <pavel.stehule@gmail.com> writes:
Pavel> Hi
Pavel> I tried to solve following task:
Pavel> I have a table
Pavel> start, reason, km
Pavel> =============
Pavel> 2014-01-01 08:00:00, private, 10
Pavel> 2014-01-01 09:00:00, commerc, 20
Pavel> 2014-01-01 10:00:00, commerc, 20
Pavel> 2014-01-01 11:00:00, private, 8
Pavel> and I would reduce these rows to
Pavel> 2014-01-01 08:00:00, private, 10
Pavel> 2014-01-01 09:00:00, commerc, 20 + 20 = 40
Pavel> 2014-01-01 11:00:00, private, 8
Pavel> It is relative hard to it now with SQL only.
Only relatively. My standard solution is something like this:
select start_time, reason, sum(km) as km
from (select max(label_time) over (order by start) as start_time,
reason, km
from (select start, reason, km,
case when reason
is distinct from
lag(reason) over (order by start)
then start
end as label_time
from yourtable
) s2
) s1
group by start_time, reason
order by start_time;
(Your change_number idea is essentially equivalent to doing
sum(case when x is distinct from lag(x) over w then 1 end) over w,
except that since window functions can't be nested, that expression
requires a subquery.)
--
Andrew (irc:RhodiumToad)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2014-09-21 17:51 GMT+02:00 Andrew Gierth <andrew@tao11.riddles.org.uk>:
>>>>> "Pavel" == Pavel Stehule <pavel.stehule@gmail.com> writes:
Pavel> Hi
Pavel> I tried to solve following task:
Pavel> I have a table
Pavel> start, reason, km
Pavel> =============
Pavel> 2014-01-01 08:00:00, private, 10
Pavel> 2014-01-01 09:00:00, commerc, 20
Pavel> 2014-01-01 10:00:00, commerc, 20
Pavel> 2014-01-01 11:00:00, private, 8
Pavel> and I would reduce these rows to
Pavel> 2014-01-01 08:00:00, private, 10
Pavel> 2014-01-01 09:00:00, commerc, 20 + 20 = 40
Pavel> 2014-01-01 11:00:00, private, 8
Pavel> It is relative hard to it now with SQL only.
Only relatively. My standard solution is something like this:
select start_time, reason, sum(km) as km
from (select max(label_time) over (order by start) as start_time,
reason, km
from (select start, reason, km,
case when reason
is distinct from
lag(reason) over (order by start)
then start
end as label_time
from yourtable
) s2
) s1
group by start_time, reason
order by start_time;
(Your change_number idea is essentially equivalent to doing
sum(case when x is distinct from lag(x) over w then 1 end) over w,
except that since window functions can't be nested, that expression
requires a subquery.)
yes, I found this solution in third iteration too.
so this proposal lost a main benefit
Regards
Pavel
Pavel
--
Andrew (irc:RhodiumToad)
2014-09-21 18:08 GMT+02:00 Rémi Cura <remi.cura@gmail.com>:
The cost is that you have to use 2 windows function., hence 2 scans I guess.then a subtraction of the 2 row number gives you an unique id per group.The solutionbut one that depends of an order of row not defined in the group.what you want is essentially generate a unique group_id,Hey, sorry I what I say is obvious for you .If I understood your problem correctly, it is strictly equivalent to this one :
http://postgresql.1045698.n5.nabble.com/Count-of-records-in-a-row-td5775363.html
there is a postgres trick to solve this problem :
is to generate a row number by the order you want , then a row number by the group ,
yes, it is little bit similar - I found a pattern described by Andrew is well too.
regards
Pavel
Pavel
Rémi-CCheers,2014-09-21 17:51 GMT+02:00 Andrew Gierth <andrew@tao11.riddles.org.uk>:>>>>> "Pavel" == Pavel Stehule <pavel.stehule@gmail.com> writes:
Pavel> Hi
Pavel> I tried to solve following task:
Pavel> I have a table
Pavel> start, reason, km
Pavel> =============
Pavel> 2014-01-01 08:00:00, private, 10
Pavel> 2014-01-01 09:00:00, commerc, 20
Pavel> 2014-01-01 10:00:00, commerc, 20
Pavel> 2014-01-01 11:00:00, private, 8
Pavel> and I would reduce these rows to
Pavel> 2014-01-01 08:00:00, private, 10
Pavel> 2014-01-01 09:00:00, commerc, 20 + 20 = 40
Pavel> 2014-01-01 11:00:00, private, 8
Pavel> It is relative hard to it now with SQL only.
Only relatively. My standard solution is something like this:
select start_time, reason, sum(km) as km
from (select max(label_time) over (order by start) as start_time,
reason, km
from (select start, reason, km,
case when reason
is distinct from
lag(reason) over (order by start)
then start
end as label_time
from yourtable
) s2
) s1
group by start_time, reason
order by start_time;
(Your change_number idea is essentially equivalent to doing
sum(case when x is distinct from lag(x) over w then 1 end) over w,
except that since window functions can't be nested, that expression
requires a subquery.)
--
Andrew (irc:RhodiumToad)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers