Thread: window function ordering not working as expected
Greetings, I have a postgresql-9.3.x database with a table with a variety of date stamped test results, some of which are stored in json format (natively in the database). I'm attempting to use some window functions to pull out specific data from the test results over a a time window, but part of the results are not making sense. Some tests run every day, others less frequently. For each unique test's results, I'm interested in seeing: * the date for the most recent result * test name (identifier) * most recent result (decimal value) * the worst (lowest decimal value) test result from the past 21 days * the date which corresponds with the worst test result from the past 21 days * the 2nd worst (2nd lowest decimal value) test result Here's a sample of the data and resulting score for one test (tname) from the past few weeks: tstamp | tname | score ------------+-----------------+-------- 2015-02-17 | dfw001.ix-cr-02 | 0.7 2015-02-15 | dfw001.ix-cr-02 | 0.6 2015-02-14 | dfw001.ix-cr-02 | 0.6 2015-02-14 | dfw001.ix-cr-02 | 0.7 2015-02-13 | dfw001.ix-cr-02 | 0.6 2015-02-12 | dfw001.ix-cr-02 | 0.7 2015-02-11 | dfw001.ix-cr-02 | 0.7 2015-02-10 | dfw001.ix-cr-02 | 0.7 2015-02-09 | dfw001.ix-cr-02 | 0.7 2015-02-08 | dfw001.ix-cr-02 | 0.7 2015-02-08 | dfw001.ix-cr-02 | 0.5 2015-02-07 | dfw001.ix-cr-02 | 0.7 2015-02-07 | dfw001.ix-cr-02 | 0.5 2015-02-06 | dfw001.ix-cr-02 | 0.7 2015-02-05 | dfw001.ix-cr-02 | 0.7 2015-02-04 | dfw001.ix-cr-02 | 0.7 2015-01-30 | dfw001.ix-cr-02 | 0.7 Here's the SQL query that I'm running: SELECT * FROM (SELECT tstamp, concat_ws('/',attrs->>'RCluster ID', regexp_replace(replace(replace(attrs->>'ASN HTML','</a>',''),'<a href= ''http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(&d=5d'' target=''_blank''>)','')) AS tname , metrics->>'PT TWBR' AS score, first_value(metrics->>'PT TWBR') OVER (PARTITION BY concat_ws('/',attrs->>'Route Cluster ID', regexp_replace(replace(replace(attrs->>'ASN HTML','</a>',''),'<a href= ''http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(&d=5d'' target=''_blank''>)','')) ORDER BY metrics->>'PT TWBR') AS worst_score, first_value(tstamp) OVER (PARTITION BY concat_ws('/',attrs->>'Route Cluster ID', regexp_replace(replace(replace(attrs->>'ASN HTML','</a>',''),'<a href= ''http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(&d=5d'' target=''_blank''>)','')) ORDER BY metrics->>'PT TWBR') AS worst_date, lead(metrics->>'PT TWBR', 1) OVER (PARTITION BY concat_ws('/',attrs->>'Route Cluster ID', regexp_replace(replace(replace(attrs->>'ASN HTML','</a>',''),'<a href= ''http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(&d=5d'' target=''_blank''>)','')) ORDER BY metrics->>'PT TWBR') AS prv_score FROM btworks WHERE age(now(),tstamp) < '21 days' ORDER BY tstamp DESC, rank ) AS stuff WHERE tstamp = '2015-02-17'; Here's the data from the above query as it pertains to the data (tname='dfw001.ix-cr-02') set that I posted above: tstamp | tname | score | worst_score | worst_date | prv_score ------------+-----------------------+-------+-------------+------------+----------- 2015-02-17 | dfw001.ix-cr-02 | 0.7 | 0.5 | 2015-02-08 | 0.7 The problem that I'm seeing is in the prv_score column. It should show a value of 0.6, which corresponds with 2015-02-13, however instead its returning 0.7. I thought by ordering by metrics->>'PT TWBR' I'd always be sorting by the scores, and as a result, the lead(metrics->>'PT TWBR', 1) would give me the next greatest value of the score. Thus my confusion as to why ORDER BY metrics->>'PT TWBR' isn't working as expected. thanks in advance for any pointers.
Lonni J Friedman <netllama@gmail.com> writes: > I'm interested in seeing: > * the date for the most recent result > * test name (identifier) > * most recent result (decimal value) > * the worst (lowest decimal value) test result from the past 21 days > * the date which corresponds with the worst test result from the past 21 days > * the 2nd worst (2nd lowest decimal value) test result > ... > The problem that I'm seeing is in the prv_score column. It should show > a value of 0.6, which corresponds with 2015-02-13, however instead its > returning 0.7. I thought by ordering by metrics->>'PT TWBR' I'd always > be sorting by the scores, and as a result, the lead(metrics->>'PT > TWBR', 1) would give me the next greatest value of the score. Thus my > confusion as to why ORDER BY metrics->>'PT TWBR' isn't working as > expected. lead() and lag() retrieve values from rows that are N away from the current row in the specified ordering. That isn't what you want here AFAICS. I think the worst test result would be obtained with nth_value(metrics->>'PT TWBR', 1) which is equivalent to what you used, first_value(metrics->>'PT TWBR') while the 2nd worst result would be obtained with nth_value(metrics->>'PT TWBR', 2) However, "worst" and "2nd worst" with this implementation would mean "worst and 2nd worst within the partition", which isn't the stated goal either, at least not with the partition definition you're using. What you really want for the "worst in last 21 days" is something like min(metrics->>'PT TWBR') OVER ( PARTITION BY ... that same mess you used ... ORDER BY tstamp RANGE BETWEEN '21 days'::interval PRECEDING AND CURRENT ROW) However Postgres doesn't implement RANGE x PRECEDING yet. You could get "worst in last 21 observations" easily: min(metrics->>'PT TWBR') OVER ( PARTITION BY ... that mess ... ORDER BY tstamp ROWS BETWEEN 20 PRECEDING AND CURRENT ROW) and maybe that's close enough. I do not know an easy way to get "second worst" :-(. You could build a user-defined aggregate to produce "second smallest value among the inputs" and then apply it in the same way as I used min() here. regards, tom lane
On Tue, Feb 17, 2015 at 4:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Lonni J Friedman <netllama@gmail.com> writes: >> I'm interested in seeing: >> * the date for the most recent result >> * test name (identifier) >> * most recent result (decimal value) >> * the worst (lowest decimal value) test result from the past 21 days >> * the date which corresponds with the worst test result from the past 21 days >> * the 2nd worst (2nd lowest decimal value) test result >> ... >> The problem that I'm seeing is in the prv_score column. It should show >> a value of 0.6, which corresponds with 2015-02-13, however instead its >> returning 0.7. I thought by ordering by metrics->>'PT TWBR' I'd always >> be sorting by the scores, and as a result, the lead(metrics->>'PT >> TWBR', 1) would give me the next greatest value of the score. Thus my >> confusion as to why ORDER BY metrics->>'PT TWBR' isn't working as >> expected. > > lead() and lag() retrieve values from rows that are N away from the > current row in the specified ordering. That isn't what you want here > AFAICS. > > I think the worst test result would be obtained with > nth_value(metrics->>'PT TWBR', 1) > which is equivalent to what you used, > first_value(metrics->>'PT TWBR') > while the 2nd worst result would be obtained with > nth_value(metrics->>'PT TWBR', 2) > > However, "worst" and "2nd worst" with this implementation would mean > "worst and 2nd worst within the partition", which isn't the stated > goal either, at least not with the partition definition you're using. > > What you really want for the "worst in last 21 days" is something like > > min(metrics->>'PT TWBR') OVER ( > PARTITION BY ... that same mess you used ... > ORDER BY tstamp > RANGE BETWEEN '21 days'::interval PRECEDING AND CURRENT ROW) > > However Postgres doesn't implement RANGE x PRECEDING yet. You could > get "worst in last 21 observations" easily: > > min(metrics->>'PT TWBR') OVER ( > PARTITION BY ... that mess ... > ORDER BY tstamp > ROWS BETWEEN 20 PRECEDING AND CURRENT ROW) > > and maybe that's close enough. > > I do not know an easy way to get "second worst" :-(. You could build a > user-defined aggregate to produce "second smallest value among the inputs" > and then apply it in the same way as I used min() here. Thanks Tom, much appreciate the fast reply. I'll chew this over and see if I have any other questions.