Thread: window function ordering not working as expected

window function ordering not working as expected

From
Lonni J Friedman
Date:
Greetings,
I have a postgresql-9.3.x database with a table with a variety of date
stamped test results, some of which are stored in json format
(natively in the database). I'm attempting to use some window
functions to pull out specific data from the test results over a a
time window, but part of the results are not making sense. Some tests
run every day, others less frequently. For each unique test's results,
I'm interested in seeing:

* the date for the most recent result
* test name (identifier)
* most recent result (decimal value)
* the worst (lowest decimal value) test result from the past 21 days
* the date which corresponds with the worst test result from the past 21 days
* the 2nd worst (2nd lowest decimal value) test result

Here's a sample of the data and resulting score for one test (tname)
from the past few weeks:

  tstamp   |      tname      | score
------------+-----------------+--------
2015-02-17 | dfw001.ix-cr-02 | 0.7
2015-02-15 | dfw001.ix-cr-02 | 0.6
2015-02-14 | dfw001.ix-cr-02 | 0.6
2015-02-14 | dfw001.ix-cr-02 | 0.7
2015-02-13 | dfw001.ix-cr-02 | 0.6
2015-02-12 | dfw001.ix-cr-02 | 0.7
2015-02-11 | dfw001.ix-cr-02 | 0.7
2015-02-10 | dfw001.ix-cr-02 | 0.7
2015-02-09 | dfw001.ix-cr-02 | 0.7
2015-02-08 | dfw001.ix-cr-02 | 0.7
2015-02-08 | dfw001.ix-cr-02 | 0.5
2015-02-07 | dfw001.ix-cr-02 | 0.7
2015-02-07 | dfw001.ix-cr-02 | 0.5
2015-02-06 | dfw001.ix-cr-02 | 0.7
2015-02-05 | dfw001.ix-cr-02 | 0.7
2015-02-04 | dfw001.ix-cr-02 | 0.7
2015-01-30 | dfw001.ix-cr-02 | 0.7

Here's the SQL query that I'm running:

SELECT * FROM
    (SELECT tstamp,
            concat_ws('/',attrs->>'RCluster ID',
regexp_replace(replace(replace(attrs->>'ASN HTML','</a>',''),'<a href=
''http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(&d=5d''
target=''_blank''>)','')) AS tname ,
            metrics->>'PT TWBR' AS score,
            first_value(metrics->>'PT TWBR') OVER
                (PARTITION BY concat_ws('/',attrs->>'Route Cluster
ID', regexp_replace(replace(replace(attrs->>'ASN HTML','</a>',''),'<a
href= ''http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(&d=5d''
target=''_blank''>)',''))
                ORDER BY metrics->>'PT TWBR') AS worst_score,
            first_value(tstamp) OVER
                (PARTITION BY concat_ws('/',attrs->>'Route Cluster
ID', regexp_replace(replace(replace(attrs->>'ASN HTML','</a>',''),'<a
href= ''http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(&d=5d''
target=''_blank''>)',''))
                ORDER BY metrics->>'PT TWBR') AS worst_date,
            lead(metrics->>'PT TWBR', 1) OVER
                (PARTITION BY concat_ws('/',attrs->>'Route Cluster
ID', regexp_replace(replace(replace(attrs->>'ASN HTML','</a>',''),'<a
href= ''http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(&d=5d''
target=''_blank''>)',''))
                ORDER BY metrics->>'PT TWBR') AS prv_score
    FROM btworks
    WHERE
            age(now(),tstamp) < '21 days'
    ORDER BY tstamp DESC, rank
        ) AS stuff
WHERE
    tstamp = '2015-02-17';

Here's the data from the above query as it pertains to the data
(tname='dfw001.ix-cr-02') set that I posted above:

   tstamp   |         tname         | score | worst_score | worst_date
| prv_score
------------+-----------------------+-------+-------------+------------+-----------
 2015-02-17 | dfw001.ix-cr-02       | 0.7   | 0.5         | 2015-02-08 | 0.7

The problem that I'm seeing is in the prv_score column. It should show
a value of 0.6, which corresponds with 2015-02-13, however instead its
returning 0.7. I thought by ordering by metrics->>'PT TWBR' I'd always
be sorting by the scores, and as a result, the lead(metrics->>'PT
TWBR', 1) would give me the next greatest value of the score. Thus my
confusion as to why ORDER BY metrics->>'PT TWBR' isn't working as
expected.


thanks in advance for any pointers.


Re: window function ordering not working as expected

From
Tom Lane
Date:
Lonni J Friedman <netllama@gmail.com> writes:
> I'm interested in seeing:
> * the date for the most recent result
> * test name (identifier)
> * most recent result (decimal value)
> * the worst (lowest decimal value) test result from the past 21 days
> * the date which corresponds with the worst test result from the past 21 days
> * the 2nd worst (2nd lowest decimal value) test result
> ...
> The problem that I'm seeing is in the prv_score column. It should show
> a value of 0.6, which corresponds with 2015-02-13, however instead its
> returning 0.7. I thought by ordering by metrics->>'PT TWBR' I'd always
> be sorting by the scores, and as a result, the lead(metrics->>'PT
> TWBR', 1) would give me the next greatest value of the score. Thus my
> confusion as to why ORDER BY metrics->>'PT TWBR' isn't working as
> expected.

lead() and lag() retrieve values from rows that are N away from the
current row in the specified ordering.  That isn't what you want here
AFAICS.

I think the worst test result would be obtained with
    nth_value(metrics->>'PT TWBR', 1)
which is equivalent to what you used,
    first_value(metrics->>'PT TWBR')
while the 2nd worst result would be obtained with
    nth_value(metrics->>'PT TWBR', 2)

However, "worst" and "2nd worst" with this implementation would mean
"worst and 2nd worst within the partition", which isn't the stated
goal either, at least not with the partition definition you're using.

What you really want for the "worst in last 21 days" is something like

    min(metrics->>'PT TWBR') OVER (
       PARTITION BY ... that same mess you used ...
       ORDER BY tstamp
       RANGE BETWEEN '21 days'::interval PRECEDING AND CURRENT ROW)

However Postgres doesn't implement RANGE x PRECEDING yet.  You could
get "worst in last 21 observations" easily:

    min(metrics->>'PT TWBR') OVER (
       PARTITION BY ... that mess ...
       ORDER BY tstamp
       ROWS BETWEEN 20 PRECEDING AND CURRENT ROW)

and maybe that's close enough.

I do not know an easy way to get "second worst" :-(.  You could build a
user-defined aggregate to produce "second smallest value among the inputs"
and then apply it in the same way as I used min() here.

            regards, tom lane


Re: window function ordering not working as expected

From
Lonni J Friedman
Date:
On Tue, Feb 17, 2015 at 4:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Lonni J Friedman <netllama@gmail.com> writes:
>> I'm interested in seeing:
>> * the date for the most recent result
>> * test name (identifier)
>> * most recent result (decimal value)
>> * the worst (lowest decimal value) test result from the past 21 days
>> * the date which corresponds with the worst test result from the past 21 days
>> * the 2nd worst (2nd lowest decimal value) test result
>> ...
>> The problem that I'm seeing is in the prv_score column. It should show
>> a value of 0.6, which corresponds with 2015-02-13, however instead its
>> returning 0.7. I thought by ordering by metrics->>'PT TWBR' I'd always
>> be sorting by the scores, and as a result, the lead(metrics->>'PT
>> TWBR', 1) would give me the next greatest value of the score. Thus my
>> confusion as to why ORDER BY metrics->>'PT TWBR' isn't working as
>> expected.
>
> lead() and lag() retrieve values from rows that are N away from the
> current row in the specified ordering.  That isn't what you want here
> AFAICS.
>
> I think the worst test result would be obtained with
>         nth_value(metrics->>'PT TWBR', 1)
> which is equivalent to what you used,
>         first_value(metrics->>'PT TWBR')
> while the 2nd worst result would be obtained with
>         nth_value(metrics->>'PT TWBR', 2)
>
> However, "worst" and "2nd worst" with this implementation would mean
> "worst and 2nd worst within the partition", which isn't the stated
> goal either, at least not with the partition definition you're using.
>
> What you really want for the "worst in last 21 days" is something like
>
>         min(metrics->>'PT TWBR') OVER (
>            PARTITION BY ... that same mess you used ...
>            ORDER BY tstamp
>            RANGE BETWEEN '21 days'::interval PRECEDING AND CURRENT ROW)
>
> However Postgres doesn't implement RANGE x PRECEDING yet.  You could
> get "worst in last 21 observations" easily:
>
>         min(metrics->>'PT TWBR') OVER (
>            PARTITION BY ... that mess ...
>            ORDER BY tstamp
>            ROWS BETWEEN 20 PRECEDING AND CURRENT ROW)
>
> and maybe that's close enough.
>
> I do not know an easy way to get "second worst" :-(.  You could build a
> user-defined aggregate to produce "second smallest value among the inputs"
> and then apply it in the same way as I used min() here.

Thanks Tom, much appreciate the fast reply.  I'll chew this over and
see if I have any other questions.