Thread: analyzing query results
Greetings, I have a PostgreSQL-8.x database that contains several tables which hold the results of automated regression tests. Each table contains a specific class of test, however the results from different subtests of that class of test reside in a single table. For example, I have tests A, B & C stored in tables a, b & c, and then there are subtests such as 'build a' and 'test a' in table a, etc. Once each day, I run a report against the contents of these tables, which then emails a report which lists only the subtests which have failed in the previous 24 hours. For this, I'm doing a SQL query along the lines of: select subtest,os,arch,info FROM ${table} WHERE (SELECT now() - interval '24 hours' < date_created::timestamp)='t' AND current_status='FAILED' ; The problem here is that occasionally the same subtest runs more than once in a 24 hour period (although not often), and not all of the test runs will have the same result (current_status). To make matters more complicated when this happens, the additional runs will have a current_status of PASSED, yet my automated report still grabs the older FAILED test result, and this confuses the hell out of the people reviewing the report, since its technically passed already. My problem is that I can't think of a non-trivial way to adjust the SQL query so that it will only capture the last time the subtest ran, in the event that it ran multiple times during a 24 hour window. There are different permutations of os, arch & info (actually about 400 in total), so explicitly iterating through all of them is going to be an ugly and expensive hack that I really want to avoid. If it helps any, each row in the assorted tables has a unique id number that auto-increments at insertion, but beyond that there isn't any other non-trivial way to differentiate one row from another as the fields are an assortment of text, integer, boolean & char. Does anyone have any good ideas? thanks in advance. -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ L. Friedman netllama@gmail.com LlamaLand http://netllama.linux-sxs.org
"Lonni J Friedman" <netllama@gmail.com> writes: > select subtest,os,arch,info FROM ${table} WHERE (SELECT now() - > interval '24 hours' < date_created::timestamp)='t' AND > current_status='FAILED' ; > My problem is that I can't think of a non-trivial way to adjust the > SQL query so that it will only capture the last time the subtest ran, > in the event that it ran multiple times during a 24 hour window. If you don't mind using a nonstandard feature, DISTINCT ON will probably help you solve this. This example looks pretty close to being the same as the "get the latest weather reports" example that you'll find in the PG reference page for SELECT. BTW, why are you writing the timestamp filter condition in such a bizarre way? I'd expect to see that query written as select subtest,os,arch,info FROM ${table} WHERE (now() - interval '24 hours') < date_created AND current_status='FAILED' ; which would have a fighting chance of using an index on date_created. The useless sub-SELECT not only eliminates any chance of using an index, but incurs a fair amount of extra runtime overhead. regards, tom lane
Lonni J Friedman wrote: > The problem here is that occasionally the same subtest runs more than > once in a 24 hour period (although not often), and not all of the test > runs will have the same result (current_status). To make matters more > complicated when this happens, the additional runs will have a > current_status of PASSED, yet my automated report still grabs the > older FAILED test result, and this confuses the hell out of the people > reviewing the report, since its technically passed already. DDL and some sample data would be useful to help out, but if I understood what you were describing, here's a simple mockup I tossed into my local playground: create table testresults ( id serial not null primary key, subtest text not null, os text not null, arch text not null, info text not null, current_status text not null, date_created date not null); insert into testresults (subtest, os, arch, info, current_status, date_created) values ('a','linux','x86','info','PASSED',now()), ('b','linux','x86','info','PASSED',now()), ('c','linux','x86','info','FAILED',now()), ('c','linux','x86','info','FAILED','2007-07-01'), ('c','linux','x86','info','PASSED','2007-07-01'), ('d','linux','x86','info','FAILED',now()), ('d','linux','x86','info','PASSED',now()), ('e','linux','x86','info','FAILED','2007-07-01'); select a.* from testresults a left join testresults b on (b.id <> a.id and b.date_created >= a.date_created and b.current_status = 'PASSED' and row(b.subtest, b.os, b.arch) = row(a.subtest, a.os, a.arch)) where a.current_status = 'FAILED' and b.id is null and a.date_created >= now() - interval '24 hours'; id | subtest | os | arch | info | current_status | date_created ----+---------+-------+------+------+----------------+-------------- 8 | c | linux | x86 | info | FAILED | 2007-08-10 (1 row) The row constructors within the join should be adjusted to match the set of columns within the table that would adequately match two separate rows as having been the same subtest on a different run (I just picked the three that stood out from your original query as possibly being identifying). This may well not be the best way to do it. I've obviously only tested it against a very tiny set of data. And without any DDL or sample data, I may have misunderstood the problem. -Jon -- Senior Systems Developer Media Matters for America http://mediamatters.org/
On 8/10/07, Jon Sime <jsime@mediamatters.org> wrote: > Lonni J Friedman wrote: > > The problem here is that occasionally the same subtest runs more than > > once in a 24 hour period (although not often), and not all of the test > > runs will have the same result (current_status). To make matters more > > complicated when this happens, the additional runs will have a > > current_status of PASSED, yet my automated report still grabs the > > older FAILED test result, and this confuses the hell out of the people > > reviewing the report, since its technically passed already. > > DDL and some sample data would be useful to help out, but if I > understood what you were describing, here's a simple mockup I tossed > into my local playground: > > create table testresults ( > id serial not null primary key, > subtest text not null, > os text not null, > arch text not null, > info text not null, > current_status text not null, > date_created date not null); > > insert into testresults > (subtest, os, arch, info, current_status, date_created) > values > ('a','linux','x86','info','PASSED',now()), > ('b','linux','x86','info','PASSED',now()), > ('c','linux','x86','info','FAILED',now()), > ('c','linux','x86','info','FAILED','2007-07-01'), > ('c','linux','x86','info','PASSED','2007-07-01'), > ('d','linux','x86','info','FAILED',now()), > ('d','linux','x86','info','PASSED',now()), > ('e','linux','x86','info','FAILED','2007-07-01'); > > select a.* > from testresults a > left join testresults b on (b.id <> a.id and > b.date_created >= a.date_created and > b.current_status = 'PASSED' and > row(b.subtest, b.os, b.arch) > = row(a.subtest, a.os, a.arch)) > where a.current_status = 'FAILED' and b.id is null and > a.date_created >= now() - interval '24 hours'; > > id | subtest | os | arch | info | current_status | date_created > ----+---------+-------+------+------+----------------+-------------- > 8 | c | linux | x86 | info | FAILED | 2007-08-10 > (1 row) > > The row constructors within the join should be adjusted to match the set > of columns within the table that would adequately match two separate > rows as having been the same subtest on a different run (I just picked > the three that stood out from your original query as possibly being > identifying). > > This may well not be the best way to do it. I've obviously only tested > it against a very tiny set of data. And without any DDL or sample data, > I may have misunderstood the problem. First, thanks much for your input. I'm sorry that I didn't provide sample data up front. I'll go into more detail now. In my database, I have one column whose content is always the same, for each iteration of a subtest. That column is named 'cudacode'. Thus, if a given subtest runs multiple times in any 24 hour window, each inserted row will always have the same cudacode for that subtest. So cudacode is the unique identifier. That said, here's some example data that illustrates the problem i'm having: id | date_created | cudacode | current_status ----+----------------------------+-----------------------------------------------------------------------+---------------------------------------- 13 | 2007-08-13 11:00:18.782425 | 2007-05-13.cuda-linux32-test0.CUBLAS.debug.hostEmu.CheckedIn_Compiler | FAILED 14 | 2007-08-13 11:00:39.917609 | 2007-05-13.cuda-linux32-test0.CUBLAS.debug.hostEmu.CheckedIn_Compiler | PASSED 15 | 2007-08-13 11:46:09.770971 | 2007-05-13.cuda-linux64-test0.CUBLAS.debug.hostEmu.CheckedIn_Compiler | FAILED In the above data, id 13 & 14 are two iterations of the same subtest (they have the same cudacode), yet the first has a current_status of FAILED, and the second is PASSED. What I want is a query that will detect this and return nothing because the later row (per the date_created field) has PASSED, since all I care about is when the last iteration has FAILED. id 15 has a different cudacode, and therefore should be returned from the query as well. In other words, I'd like to see the content of row 15 returned, since its the only one that meets my criteria. Or as a starting point, just get rows 14 & 15 returned, since 14 & 15 are the newest unique instances of cudacode, and then I can check the current_status in those as a secondary query. I tried to give that a whirl below: select a.* from cudaapps a left join cudaapps b on (b.cudacode <> a.cudacode and b.date_created >= a.date_created and b.current_status = 'PASSED' and row(b.subtest, b.os, b.arch) = row(a.subtest, a.os, a.arch)) where a.current_status = 'FAILED' and b.id is null and a.date_created >= now() - interval '24 hours'; but for reasons that aren't clear to me, I'm getting id 13 & 15 returned instead of 14 & 15, and I can't figure out why. Separately, I tried to use Tom's advice to call DISTINCT ON, with the following query: SELECT DISTINCT ON(date_created, cudacode) id,current_status FROM cudaapps WHERE (SELECT now() - interval '24 hours' < date_created::timestamp)='t' ORDER BY date_created; but that too is returning id 13, 14 & 15. If I strip out the date_created in the DISTINCT ON and ORDER BY sections: SELECT DISTINCT ON(cudacode) id,current_status FROM cudaapps WHERE (SELECT now() - interval '24 hours' < date_created::timestamp)='t' ; then I get id 13 & 15 all over again, rather than 14 & 15. I'm guessing that I'm missing something really obvious here, but I can't quite figure out what. -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ L. Friedman netllama@gmail.com LlamaLand http://netllama.linux-sxs.org
On 8/10/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Lonni J Friedman" <netllama@gmail.com> writes: > > select subtest,os,arch,info FROM ${table} WHERE (SELECT now() - > > interval '24 hours' < date_created::timestamp)='t' AND > > current_status='FAILED' ; > > > My problem is that I can't think of a non-trivial way to adjust the > > SQL query so that it will only capture the last time the subtest ran, > > in the event that it ran multiple times during a 24 hour window. > > If you don't mind using a nonstandard feature, DISTINCT ON will probably > help you solve this. This example looks pretty close to being the same > as the "get the latest weather reports" example that you'll find in the > PG reference page for SELECT. Here's some example data that illustrates the problem i'm having: id | date_created | cudacode | current_status ----+----------------------------+-----------------------------------------------------------------------+---------------------------------------- 13 | 2007-08-13 11:00:18.782425 | 2007-05-13.cuda-linux32-test0.CUBLAS.debug.hostEmu.CheckedIn_Compiler | FAILED 14 | 2007-08-13 11:00:39.917609 | 2007-05-13.cuda-linux32-test0.CUBLAS.debug.hostEmu.CheckedIn_Compiler | PASSED 15 | 2007-08-13 11:46:09.770971 | 2007-05-13.cuda-linux64-test0.CUBLAS.debug.hostEmu.CheckedIn_Compiler | FAILED In the above data, id 13 & 14 are two iterations of the same subtest (they have the same cudacode), yet the first has a current_status of FAILED, and the second is PASSED. What I want is a query that will detect this and return nothing because the later row (per the date_created field) has PASSED, since all I care about is when the last iteration has FAILED. id 15 has a different cudacode, and therefore should be returned from the query as well. In other words, I'd like to see the content of row 15 returned, since its the only one that meets my criteria. Or as a starting point, just get rows 14 & 15 returned, since 14 & 15 are the newest unique instances of cudacode, and then I can check the current_status in those as a secondary query. I tried to call DISTINCT ON, with the following query: SELECT DISTINCT ON(date_created, cudacode) id,current_status FROM cudaapps WHERE (SELECT now() - interval '24 hours' < date_created::timestamp)='t' ORDER BY date_created; but that is returning id 13, 14 & 15. If I strip out the date_created in the DISTINCT ON and ORDER BY sections: SELECT DISTINCT ON(cudacode) id,current_status FROM cudaapps WHERE (SELECT now() - interval '24 hours' < date_created::timestamp)='t' ; then I get id 13 & 15 all over again, rather than 14 & 15. I'm guessing that I'm missing something really obvious here, but I can't quite figure out what. > > BTW, why are you writing the timestamp filter condition in such a > bizarre way? I'd expect to see that query written as > > select subtest,os,arch,info FROM ${table} > WHERE > (now() - interval '24 hours') < date_created > AND current_status='FAILED' ; > > which would have a fighting chance of using an index on date_created. > The useless sub-SELECT not only eliminates any chance of using an index, > but incurs a fair amount of extra runtime overhead. I'm doing this because date_created is of type char(20) rather than a normal/expected date datatype. I know, its dumb.
"Lonni J Friedman" <netllama@gmail.com> writes: > I tried to call DISTINCT ON, with the following query: > SELECT DISTINCT ON(date_created, cudacode) id,current_status FROM > cudaapps WHERE (SELECT now() - interval '24 hours' < > date_created::timestamp)='t' ORDER BY date_created; No, you're not grokking how to use DISTINCT ON. The DISTINCT part specifies what rows you want to group together --- here, all the ones with the same cudacode --- and then ORDER BY has to list all those same columns *followed by* the ordering condition that determines which row you want out of the group. What I think you want is SELECT DISTINCT ON (cudacode) id,current_status FROM cudaapps WHERE (SELECT now() - interval '24 hours' < date_created::timestamp)='t' ORDER BY cudacode, date_created DESC; which gives you the latest current_status for each cudacode, and then you filter the uninteresting rows in an outer select: SELECT id FROM (SELECT DISTINCT ON (cudacode) id,current_status FROM cudaapps WHERE (SELECT now() - interval '24 hours' < date_created::timestamp)='t' ORDER BY cudacode, date_created DESC) ss WHERE current_status = 'FAILED'; >> BTW, why are you writing the timestamp filter condition in such a >> bizarre way? > I'm doing this because date_created is of type char(20) rather than a > normal/expected date datatype. I know, its dumb. Well, so you have to have the cast, but I'd still think that WHERE (now() - interval '24 hours') < date_created::timestamptz would be the best way to express it. The sub-SELECT and the comparison to 't' are just obscurantism. Another issue here is whether the date format was chosen to ensure that textual sorting of the values would give the same result as datewise sorting. You might get some fairly bizarre misbehavior if the data is sloppy about spaces instead of zero-fill, for instance. Sometimes it's worth biting the bullet and fixing the column type ... regards, tom lane
On 8/13/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Lonni J Friedman" <netllama@gmail.com> writes: > > I tried to call DISTINCT ON, with the following query: > > > SELECT DISTINCT ON(date_created, cudacode) id,current_status FROM > > cudaapps WHERE (SELECT now() - interval '24 hours' < > > date_created::timestamp)='t' ORDER BY date_created; > > No, you're not grokking how to use DISTINCT ON. The DISTINCT part > specifies what rows you want to group together --- here, all the ones > with the same cudacode --- and then ORDER BY has to list all those same > columns *followed by* the ordering condition that determines which row > you want out of the group. What I think you want is > > SELECT DISTINCT ON (cudacode) id,current_status > FROM cudaapps > WHERE (SELECT now() - interval '24 hours' < date_created::timestamp)='t' > ORDER BY cudacode, date_created DESC; > > which gives you the latest current_status for each cudacode, and then > you filter the uninteresting rows in an outer select: > > SELECT id > FROM > (SELECT DISTINCT ON (cudacode) id,current_status > FROM cudaapps > WHERE (SELECT now() - interval '24 hours' < date_created::timestamp)='t' > ORDER BY cudacode, date_created DESC) ss > WHERE current_status = 'FAILED'; Thanks, this is exactly what I need. Only one question. What is the 'ss' that you have in that query? I tried googling, but getting any useful hits on just "postgres ss" returns a ton of useless results. > > >> BTW, why are you writing the timestamp filter condition in such a > >> bizarre way? > > > I'm doing this because date_created is of type char(20) rather than a > > normal/expected date datatype. I know, its dumb. > > Well, so you have to have the cast, but I'd still think that > WHERE (now() - interval '24 hours') < date_created::timestamptz > would be the best way to express it. The sub-SELECT and the comparison > to 't' are just obscurantism. Thanks, that change is definitely equivalent. > > Another issue here is whether the date format was chosen to ensure that > textual sorting of the values would give the same result as datewise > sorting. You might get some fairly bizarre misbehavior if the data is > sloppy about spaces instead of zero-fill, for instance. Sometimes it's > worth biting the bullet and fixing the column type ... Thankfully the date is always inserted via an automated script, so its not going to get mangled or changed over time. The issue that I was trying to address was a 3rd party web app which does a horrible job of formatting date fields. I'm basically working around this apps bad behavior by feeding it the date in the format that I need, rather than letting the app mangle it. -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ L. Friedman netllama@gmail.com LlamaLand http://netllama.linux-sxs.org
"Lonni J Friedman" <netllama@gmail.com> writes: > Thanks, this is exactly what I need. Only one question. What is the > 'ss' that you have in that query? Table alias for the sub-select. The SQL spec says you have to provide one ... you can leave out the AS keyword, though. regards, tom lane