Thread: Time differences between rows, not columns?
Using postgresSQL 8.0 for windows, running on windows 2003 server 16gb RAM, 3Ghz dual p4. Language: Java - JDBC postgres driver: postgresql-8.0-310.jdbc3.jar I have an application with 3 tables (in this context that is) Table bills bill_id NOT NULL serial ... And other columns Table bill_status bill_id (references bills.bill_id) statusid int4 (references bill_statuslookup.statusid) statustime datetime Table bill_statuslookup statusid serial not null statusname varchar(255) The application basically tracks a workflow of bills flowing fromone department to another. Everytime the bill moves one step, an entry is made into the bill_status table. Eg. Bills table ------------- Bill_id otherfield1 1 ........ Bill_status table: --------------------- Bill_id statusid statustime 1 10 2005-04-04 00:34:31 1 20 2005-04-05 00:55:00 Bill_statuslookup table: ------------------------- Statusid Statusname 10 submitted 20 received 30 rejected 40 accepted .. .. Now my problem is this: 1. Find the time taken for each bill to reach from status 10 to status 20 , given the time of status 10 should be between t1 and t2. Eg I want to know how much time it took for a bill to be accepted after it was submitted (criteria: submitted between yesterday and today) 2. I want to know how many bills took <7 days, how many tok 7-15 days, how many took >15 days etc. The status is a lookup table because the workflow constantly changes, and I can't have submitted_on, recd_on accepted_on times etc in the main bills table as columns because there are way too many statuses in the life of a bill (read invoice). Hope its clear as to what I'm looking for. Any help is greatly appreciated!! Regards, Amit
Amit, You say "I want to know how much time it took for a bill to be accepted after it was submitted" So, do you want between 10 and 40, not 10 and 20? I assume you meant 10 and 40. Ok, there's a few approaches to your questions, first To get "how much time it took for a bill to be accepted after it was submitted": select bill_id, (max(case when status_id = 40 then statustime end) - max(case when status_id =10 then statustime end)) from bill_status where status_id in ( 10,40 )group by bill_id; that will give you the amt of time (simple date arithmetic, so, in days) from submission to being accepted. You also, want a simple distribution, you can use the technique above But with SUM: select sum(case when diff < 7 then 1 else 0 end) as lt_7, sum(case when diff between 7 and 15 then 1 else 0 end) as btw_7_15, sum(case when diff > 15 then 1 else 0 end) as gt_15 from ( select bill_id, (max(case when status_id = 40 then statustime end) - max(case when status_id =10 then statustime end)) as diff from bill_status where status_id in ( 10,40 )group by bill_id )x; I have a recipe in my upcoming book ('The SQL Cookbook' by O'Reilly) So, I'd like to know if what I suggested works out for you. Obviously, you'll need whatever tweaks to make it perfect for your system, the feedback I am concerned about is the technique. hope that helps, Anthony -----Original Message----- From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org] On Behalf Of Amit_Wadhwa@Dell.com Sent: Tuesday, August 30, 2005 5:45 AM To: pgsql-sql@postgresql.org Subject: [SQL] Time differences between rows, not columns? Importance: High Using postgresSQL 8.0 for windows, running on windows 2003 server 16gb RAM, 3Ghz dual p4. Language: Java - JDBC postgres driver: postgresql-8.0-310.jdbc3.jar I have an application with 3 tables (in this context that is) Table bills bill_id NOT NULL serial ... And other columns Table bill_status bill_id (references bills.bill_id) statusid int4 (references bill_statuslookup.statusid) statustime datetime Table bill_statuslookup statusid serial not null statusname varchar(255) The application basically tracks a workflow of bills flowing fromone department to another. Everytime the bill moves one step, an entry is made into the bill_status table. Eg. Bills table ------------- Bill_id otherfield1 1 ........ Bill_status table: --------------------- Bill_id statusid statustime 1 10 2005-04-04 00:34:31 1 20 2005-04-05 00:55:00 Bill_statuslookup table: ------------------------- Statusid Statusname 10 submitted 20 received 30 rejected 40 accepted .. .. Now my problem is this: 1. Find the time taken for each bill to reach from status 10 to status 20 , given the time of status 10 should be between t1 and t2. Eg I want to know how much time it took for a bill to be accepted after it was submitted (criteria: submitted between yesterday and today) 2. I want to know how many bills took <7 days, how many tok 7-15 days, how many took >15 days etc. The status is a lookup table because the workflow constantly changes, and I can't have submitted_on, recd_on accepted_on times etc in the main bills table as columns because there are way too many statuses in the life of a bill (read invoice). Hope its clear as to what I'm looking for. Any help is greatly appreciated!! Regards, Amit ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypesdo not match
Thanks! <snip>Amit,You say "I want to know how much time it took for a bill to be accepted after it was submitted" So, do you want between 10 and 40, not 10 and 20? I assume you meant 10 and 40... </snip> ...Could be any status to any status, I wanted to generalize the concept for future usage. I tried the below: 1. select bill_id,(max(case when statusid = 40 then statustime end) - max(case when statusid = 10 then statustime end)) from ap.vits_statusupdate where statusid in ( 10,40 ) group by bill_id; Took 16 ms... But meanwhile I also tried: select a.bill_id, b.statustime-a.statustime from ap.vits_statusupdate a, ap.vits_statusupdate b where a.statusid = 10 and b.statusid = 40 and a.bill_id = b.bill_id Took 15 ms The second query is taking approximately 1 ms lesser time.. (have tried about 30 times in pgadmin) Got about 10 records in my test resultset. Surprisingly, if I add the order by bill_id cluase at the end of both queries, then your query performs 1 ms faster than mine, Don't know why, Is there any performance issues if this is run over 10000+ rows? Which (yours or mine?) do you think will be faster, and more efficient? 2. select sum(case when diff < 7 then 1 else 0 end) as lt_7,sum(case when diff between 7 and 15 then 1 else 0 end) as btw_7_15,sum(case when diff > 15 then 1 else 0 end) as gt_15 from (select bill_id,(max(case when statusid = 40 then statustime end) - max(case when statusid = 10 then statustime end)) as diff from ap.vits_statusupdate where statusid in ( 10,40 ) group by bill_id) x; Tried this one too with both the queries in the sub-select (yours and mine), and works perfectly well for me, with accurate answers, and exactly what I'm looking for! shows 0 ms (??).. Next question.. Which one should I use? The joined query? Or the max() query? Thanks a lot for your time! Regards, Amit -----Original Message----- From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org] On Behalf Of Anthony Molinaro Sent: Tuesday, August 30, 2005 4:59 PM To: Wadhwa, Amit; pgsql-sql@postgresql.org Subject: Re: [SQL] Time differences between rows, not columns? Amit, You say "I want to know how much time it took for a bill to be accepted after it was submitted" So, do you want between 10 and 40, not 10 and 20? I assume you meant 10 and 40. Ok, there's a few approaches to your questions, first To get "how much time it took for a bill to be accepted after it was submitted": select bill_id, (max(case when status_id = 40 then statustime end) - max(case when status_id =10 then statustime end)) from bill_status where status_id in ( 10,40 )group by bill_id; that will give you the amt of time (simple date arithmetic, so, in days) from submission to being accepted. You also, want a simple distribution, you can use the technique above But with SUM: select sum(case when diff < 7 then 1 else 0 end) as lt_7, sum(case when diff between 7 and 15 then 1 else 0 end) as btw_7_15, sum(case when diff > 15 then 1 else 0 end) as gt_15 from ( select bill_id, (max(case when status_id = 40 then statustime end) - max(case when status_id =10 then statustime end)) as diff from bill_status where status_id in ( 10,40 )group by bill_id )x; I have a recipe in my upcoming book ('The SQL Cookbook' by O'Reilly) So, I'd like to know if what I suggested works out for you. Obviously, you'll need whatever tweaks to make it perfect for your system, the feedback I am concerned about is the technique. hope that helps, Anthony -----Original Message----- From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org] On Behalf Of Amit_Wadhwa@Dell.com Sent: Tuesday, August 30, 2005 5:45 AM To: pgsql-sql@postgresql.org Subject: [SQL] Time differences between rows, not columns? Importance: High Using postgresSQL 8.0 for windows, running on windows 2003 server 16gb RAM, 3Ghz dual p4. Language: Java - JDBC postgres driver: postgresql-8.0-310.jdbc3.jar I have an application with 3 tables (in this context that is) Table bills bill_id NOT NULL serial ... And other columns Table bill_status bill_id (references bills.bill_id) statusid int4 (references bill_statuslookup.statusid) statustime datetime Table bill_statuslookup statusid serial not null statusname varchar(255) The application basically tracks a workflow of bills flowing fromone department to another. Everytime the bill moves one step, an entry is made into the bill_status table. Eg. Bills table ------------- Bill_id otherfield1 1 ........ Bill_status table: --------------------- Bill_id statusid statustime 1 10 2005-04-04 00:34:31 1 20 2005-04-05 00:55:00 Bill_statuslookup table: ------------------------- Statusid Statusname 10 submitted 20 received 30 rejected 40 accepted .. .. Now my problem is this: 1. Find the time taken for each bill to reach from status 10 to status 20 , given the time of status 10 should be between t1 and t2. Eg I want to know how much time it took for a bill to be accepted after it was submitted (criteria: submitted between yesterday and today) 2. I want to know how many bills took <7 days, how many tok 7-15 days, how many took >15 days etc. The status is a lookup table because the workflow constantly changes, and I can't have submitted_on, recd_on accepted_on times etc in the main bills table as columns because there are way too many statuses in the life of a bill (read invoice). Hope its clear as to what I'm looking for. Any help is greatly appreciated!! Regards, Amit ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypesdo not match ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
Amit, Glad it worked out :) As for the performance, lemme say that while I'm a huge fan Of postgres, my experience in regards to optimization in a production Environment is limited to Oracle and DB2. In oracle for example, if you have an index on a numeric field and perform min/max on it, there's a very cool algorithm that allows very fast retrieval of that data (basically the index scan is optimized cuz oracle knows you want only the extreme values). So, on oracle I'd use the MAX and SUM versions I suggested because There's a specific optimization to facilitate it and the query, since It accesses the table only once, is doing less logical reads. So, at the very least, in the version I suggested, since you are not Joining, you are performing less logical reads, which is always good. But, ultimately, you have to test and see what works for you. The self join might be just fine. 10k rows should not be a problem for either method, assuming you have an index on statusid and bill_id. Give it a spin and update this thread. Regards, Anthony -----Original Message----- From: Amit_Wadhwa@Dell.com [mailto:Amit_Wadhwa@Dell.com] Sent: Tuesday, August 30, 2005 8:08 AM To: Anthony Molinaro; pgsql-sql@postgresql.org Subject: RE: [SQL] Time differences between rows, not columns? Importance: High Thanks! <snip>Amit,You say "I want to know how much time it took for a bill to be accepted after it was submitted" So, do you want between 10 and 40, not 10 and 20? I assume you meant 10 and 40... </snip> ...Could be any status to any status, I wanted to generalize the concept for future usage. I tried the below: 1. select bill_id,(max(case when statusid = 40 then statustime end) - max(case when statusid = 10 then statustime end)) from ap.vits_statusupdate where statusid in ( 10,40 ) group by bill_id; Took 16 ms... But meanwhile I also tried: select a.bill_id, b.statustime-a.statustime from ap.vits_statusupdate a, ap.vits_statusupdate b where a.statusid = 10 and b.statusid = 40 and a.bill_id = b.bill_id Took 15 ms The second query is taking approximately 1 ms lesser time.. (have tried about 30 times in pgadmin) Got about 10 records in my test resultset. Surprisingly, if I add the order by bill_id cluase at the end of both queries, then your query performs 1 ms faster than mine, Don't know why, Is there any performance issues if this is run over 10000+ rows? Which (yours or mine?) do you think will be faster, and more efficient? 2. select sum(case when diff < 7 then 1 else 0 end) as lt_7,sum(case when diff between 7 and 15 then 1 else 0 end) as btw_7_15,sum(case when diff > 15 then 1 else 0 end) as gt_15 from (select bill_id,(max(case when statusid = 40 then statustime end) - max(case when statusid = 10 then statustime end)) as diff from ap.vits_statusupdate where statusid in ( 10,40 ) group by bill_id) x; Tried this one too with both the queries in the sub-select (yours and mine), and works perfectly well for me, with accurate answers, and exactly what I'm looking for! shows 0 ms (??).. Next question.. Which one should I use? The joined query? Or the max() query? Thanks a lot for your time! Regards, Amit -----Original Message----- From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org] On Behalf Of Anthony Molinaro Sent: Tuesday, August 30, 2005 4:59 PM To: Wadhwa, Amit; pgsql-sql@postgresql.org Subject: Re: [SQL] Time differences between rows, not columns? Amit, You say "I want to know how much time it took for a bill to be accepted after it was submitted" So, do you want between 10 and 40, not 10 and 20? I assume you meant 10 and 40. Ok, there's a few approaches to your questions, first To get "how much time it took for a bill to be accepted after it was submitted": select bill_id, (max(case when status_id = 40 then statustime end) - max(case when status_id =10 then statustime end)) from bill_status where status_id in ( 10,40 )group by bill_id; that will give you the amt of time (simple date arithmetic, so, in days) from submission to being accepted. You also, want a simple distribution, you can use the technique above But with SUM: select sum(case when diff < 7 then 1 else 0 end) as lt_7, sum(case when diff between 7 and 15 then 1 else 0 end) as btw_7_15, sum(case when diff > 15 then 1 else 0 end) as gt_15 from ( select bill_id, (max(case when status_id = 40 then statustime end) - max(case when status_id =10 then statustime end)) as diff from bill_status where status_id in ( 10,40 )group by bill_id )x; I have a recipe in my upcoming book ('The SQL Cookbook' by O'Reilly) So, I'd like to know if what I suggested works out for you. Obviously, you'll need whatever tweaks to make it perfect for your system, the feedback I am concerned about is the technique. hope that helps, Anthony -----Original Message----- From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org] On Behalf Of Amit_Wadhwa@Dell.com Sent: Tuesday, August 30, 2005 5:45 AM To: pgsql-sql@postgresql.org Subject: [SQL] Time differences between rows, not columns? Importance: High Using postgresSQL 8.0 for windows, running on windows 2003 server 16gb RAM, 3Ghz dual p4. Language: Java - JDBC postgres driver: postgresql-8.0-310.jdbc3.jar I have an application with 3 tables (in this context that is) Table bills bill_id NOT NULL serial ... And other columns Table bill_status bill_id (references bills.bill_id) statusid int4 (references bill_statuslookup.statusid) statustime datetime Table bill_statuslookup statusid serial not null statusname varchar(255) The application basically tracks a workflow of bills flowing fromone department to another. Everytime the bill moves one step, an entry is made into the bill_status table. Eg. Bills table ------------- Bill_id otherfield1 1 ........ Bill_status table: --------------------- Bill_id statusid statustime 1 10 2005-04-04 00:34:31 1 20 2005-04-05 00:55:00 Bill_statuslookup table: ------------------------- Statusid Statusname 10 submitted 20 received 30 rejected 40 accepted .. .. Now my problem is this: 1. Find the time taken for each bill to reach from status 10 to status 20 , given the time of status 10 should be between t1 and t2. Eg I want to know how much time it took for a bill to be accepted after it was submitted (criteria: submitted between yesterday and today) 2. I want to know how many bills took <7 days, how many tok 7-15 days, how many took >15 days etc. The status is a lookup table because the workflow constantly changes, and I can't have submitted_on, recd_on accepted_on times etc in the main bills table as columns because there are way too many statuses in the life of a bill (read invoice). Hope its clear as to what I'm looking for. Any help is greatly appreciated!! Regards, Amit ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypesdo not match ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings