Thread: analyzing query results

analyzing query results

From

"Lonni J Friedman"

Date:

10 August 2007, 18:47:16

Greetings,
I have a PostgreSQL-8.x database that contains several tables which
hold the results of automated regression tests.  Each table contains a
specific class of test, however the results from different subtests of
that class of test reside in a single table.  For example, I have
tests A, B & C stored in tables a, b & c, and then there are subtests
such as 'build a' and  'test a' in table a, etc.

Once each day, I run a report against the contents of these tables,
which then emails a report which lists only the subtests which have
failed in the previous 24 hours.  For this, I'm doing a SQL query
along the lines of:

select subtest,os,arch,info FROM ${table} WHERE (SELECT now() -
interval '24 hours' < date_created::timestamp)='t' AND
current_status='FAILED' ;

The problem here is that occasionally the same subtest runs more than
once in a 24 hour period (although not often), and not all of the test
runs will have the same result (current_status).  To make matters more
complicated when this happens, the additional runs will have a
current_status of PASSED, yet my automated report still grabs the
older FAILED test result, and this confuses the hell out of the people
reviewing the report, since its technically passed already.

My problem is that I can't think of a non-trivial way to adjust the
SQL query so that it will only capture the last time the subtest ran,
in the event that it ran multiple times during a 24 hour window.
There are different permutations of os, arch & info (actually about
400 in total), so explicitly iterating through all of them is going to
be an ugly and expensive hack that I really want to avoid.

If it helps any, each row in the assorted tables has a unique id
number that auto-increments at insertion, but beyond that there isn't
any other non-trivial way to differentiate one row from another as the
fields are an assortment of text, integer, boolean & char.

Does anyone have any good ideas?  thanks in advance.

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       http://netllama.linux-sxs.org

Re: analyzing query results

From

Tom Lane

Date:

10 August 2007, 19:06:39

"Lonni J Friedman" <netllama@gmail.com> writes:
> select subtest,os,arch,info FROM ${table} WHERE (SELECT now() -
> interval '24 hours' < date_created::timestamp)='t' AND
> current_status='FAILED' ;

> My problem is that I can't think of a non-trivial way to adjust the
> SQL query so that it will only capture the last time the subtest ran,
> in the event that it ran multiple times during a 24 hour window.

If you don't mind using a nonstandard feature, DISTINCT ON will probably
help you solve this.  This example looks pretty close to being the same
as the "get the latest weather reports" example that you'll find in the
PG reference page for SELECT.

BTW, why are you writing the timestamp filter condition in such a
bizarre way?  I'd expect to see that query written as

select subtest,os,arch,info FROM ${table}
WHERE
 (now() - interval '24 hours') < date_created
 AND current_status='FAILED' ;

which would have a fighting chance of using an index on date_created.
The useless sub-SELECT not only eliminates any chance of using an index,
but incurs a fair amount of extra runtime overhead.

            regards, tom lane

Re: analyzing query results

From

Jon Sime

Date:

10 August 2007, 22:04:30

Lonni J Friedman wrote:
> The problem here is that occasionally the same subtest runs more than
> once in a 24 hour period (although not often), and not all of the test
> runs will have the same result (current_status).  To make matters more
> complicated when this happens, the additional runs will have a
> current_status of PASSED, yet my automated report still grabs the
> older FAILED test result, and this confuses the hell out of the people
> reviewing the report, since its technically passed already.

DDL and some sample data would be useful to help out, but if I
understood what you were describing, here's a simple mockup I tossed
into my local playground:

create table testresults (
    id              serial not null primary key,
    subtest         text not null,
    os              text not null,
    arch            text not null,
    info            text not null,
    current_status  text not null,
    date_created    date not null);

insert into testresults
    (subtest, os, arch, info, current_status, date_created)
    values
    ('a','linux','x86','info','PASSED',now()),
    ('b','linux','x86','info','PASSED',now()),
    ('c','linux','x86','info','FAILED',now()),
    ('c','linux','x86','info','FAILED','2007-07-01'),
    ('c','linux','x86','info','PASSED','2007-07-01'),
    ('d','linux','x86','info','FAILED',now()),
    ('d','linux','x86','info','PASSED',now()),
    ('e','linux','x86','info','FAILED','2007-07-01');

select a.*
from testresults a
    left join testresults b on (b.id <> a.id and
        b.date_created >= a.date_created and
        b.current_status = 'PASSED' and
        row(b.subtest, b.os, b.arch)
        = row(a.subtest, a.os, a.arch))
where a.current_status = 'FAILED' and b.id is null and
    a.date_created >= now() - interval '24 hours';

 id | subtest |  os   | arch | info | current_status | date_created
----+---------+-------+------+------+----------------+--------------
  8 | c       | linux | x86  | info | FAILED         | 2007-08-10
(1 row)

The row constructors within the join should be adjusted to match the set
of columns within the table that would adequately match two separate
rows as having been the same subtest on a different run (I just picked
the three that stood out from your original query as possibly being
identifying).

This may well not be the best way to do it. I've obviously only tested
it against a very tiny set of data. And without any DDL or sample data,
I may have misunderstood the problem.

-Jon

--
Senior Systems Developer
Media Matters for America
http://mediamatters.org/

Re: analyzing query results

From

"Lonni J Friedman"

Date:

13 August 2007, 16:50:26

On 8/10/07, Jon Sime <jsime@mediamatters.org> wrote:
> Lonni J Friedman wrote:
> > The problem here is that occasionally the same subtest runs more than
> > once in a 24 hour period (although not often), and not all of the test
> > runs will have the same result (current_status).  To make matters more
> > complicated when this happens, the additional runs will have a
> > current_status of PASSED, yet my automated report still grabs the
> > older FAILED test result, and this confuses the hell out of the people
> > reviewing the report, since its technically passed already.
>
> DDL and some sample data would be useful to help out, but if I
> understood what you were describing, here's a simple mockup I tossed
> into my local playground:
>
> create table testresults (
>     id              serial not null primary key,
>     subtest         text not null,
>     os              text not null,
>     arch            text not null,
>     info            text not null,
>     current_status  text not null,
>     date_created    date not null);
>
> insert into testresults
>     (subtest, os, arch, info, current_status, date_created)
>     values
>     ('a','linux','x86','info','PASSED',now()),
>     ('b','linux','x86','info','PASSED',now()),
>     ('c','linux','x86','info','FAILED',now()),
>     ('c','linux','x86','info','FAILED','2007-07-01'),
>     ('c','linux','x86','info','PASSED','2007-07-01'),
>     ('d','linux','x86','info','FAILED',now()),
>     ('d','linux','x86','info','PASSED',now()),
>     ('e','linux','x86','info','FAILED','2007-07-01');
>
> select a.*
> from testresults a
>     left join testresults b on (b.id <> a.id and
>         b.date_created >= a.date_created and
>         b.current_status = 'PASSED' and
>         row(b.subtest, b.os, b.arch)
>         = row(a.subtest, a.os, a.arch))
> where a.current_status = 'FAILED' and b.id is null and
>     a.date_created >= now() - interval '24 hours';
>
>  id | subtest |  os   | arch | info | current_status | date_created
> ----+---------+-------+------+------+----------------+--------------
>   8 | c       | linux | x86  | info | FAILED         | 2007-08-10
> (1 row)
>
> The row constructors within the join should be adjusted to match the set
> of columns within the table that would adequately match two separate
> rows as having been the same subtest on a different run (I just picked
> the three that stood out from your original query as possibly being
> identifying).
>
> This may well not be the best way to do it. I've obviously only tested
> it against a very tiny set of data. And without any DDL or sample data,
> I may have misunderstood the problem.

First, thanks much for your input.  I'm sorry that I didn't provide
sample data up front.

I'll go into more detail now.  In my database, I have one column whose
content is always the same, for each iteration of a subtest.  That
column is named 'cudacode'.  Thus, if a given subtest runs multiple
times in any 24 hour window, each inserted row will always have the
same cudacode for that subtest.  So cudacode is the unique identifier.
 That said, here's some example data that illustrates the problem i'm
having:

 id |        date_created        |
cudacode                                |             current_status

----+----------------------------+-----------------------------------------------------------------------+----------------------------------------
13 | 2007-08-13 11:00:18.782425 |
2007-05-13.cuda-linux32-test0.CUBLAS.debug.hostEmu.CheckedIn_Compiler
| FAILED
 14 | 2007-08-13 11:00:39.917609 |
2007-05-13.cuda-linux32-test0.CUBLAS.debug.hostEmu.CheckedIn_Compiler
| PASSED
 15 | 2007-08-13 11:46:09.770971 |
2007-05-13.cuda-linux64-test0.CUBLAS.debug.hostEmu.CheckedIn_Compiler
| FAILED

In the above data, id 13 & 14 are two iterations of the same subtest
(they have the same cudacode), yet the first has a current_status of
FAILED, and the second is PASSED.  What I want is a query that will
detect this and return nothing because the later row (per the
date_created field) has PASSED, since all I care about is when the
last iteration has FAILED.  id 15 has a different cudacode, and
therefore should be returned from the query as well.  In other words,
I'd like to see the content of row 15 returned, since its the only one
that meets my criteria.

Or as a starting point, just get rows 14 & 15 returned, since 14 & 15
are the newest unique instances of cudacode, and then I can check the
current_status in those as a secondary query. I tried to give that a
whirl below:

select a.* from cudaapps a left join cudaapps b on (b.cudacode <>
a.cudacode and b.date_created >= a.date_created and b.current_status =
'PASSED' and row(b.subtest, b.os, b.arch) = row(a.subtest, a.os,
a.arch)) where a.current_status = 'FAILED' and b.id is null and
a.date_created >= now() - interval '24 hours';

but for reasons that aren't clear to me, I'm getting id 13 & 15
returned instead of 14 & 15, and I can't figure out why.

Separately, I tried to use Tom's advice to call DISTINCT ON, with the
following query:

SELECT DISTINCT ON(date_created, cudacode) id,current_status FROM
cudaapps WHERE (SELECT now() - interval '24 hours' <
date_created::timestamp)='t' ORDER BY date_created;

but that too is returning id 13, 14 & 15.  If I strip out the
date_created in the DISTINCT ON and ORDER BY sections:
SELECT DISTINCT ON(cudacode) id,current_status FROM cudaapps WHERE
(SELECT now() - interval '24 hours' < date_created::timestamp)='t' ;

then I get id 13 & 15 all over again, rather than 14 & 15.

I'm guessing that I'm missing something really obvious here, but I
can't quite figure out what.

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       http://netllama.linux-sxs.org

Re: analyzing query results

From

"Lonni J Friedman"

Date:

13 August 2007, 17:02:20

On 8/10/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Lonni J Friedman" <netllama@gmail.com> writes:
> > select subtest,os,arch,info FROM ${table} WHERE (SELECT now() -
> > interval '24 hours' < date_created::timestamp)='t' AND
> > current_status='FAILED' ;
>
> > My problem is that I can't think of a non-trivial way to adjust the
> > SQL query so that it will only capture the last time the subtest ran,
> > in the event that it ran multiple times during a 24 hour window.
>
> If you don't mind using a nonstandard feature, DISTINCT ON will probably
> help you solve this.  This example looks pretty close to being the same
> as the "get the latest weather reports" example that you'll find in the
> PG reference page for SELECT.

Here's some example data that illustrates the problem i'm having:

 id |        date_created        |
cudacode                                |             current_status

----+----------------------------+-----------------------------------------------------------------------+----------------------------------------
13 | 2007-08-13 11:00:18.782425 |
2007-05-13.cuda-linux32-test0.CUBLAS.debug.hostEmu.CheckedIn_Compiler
| FAILED
 14 | 2007-08-13 11:00:39.917609 |
2007-05-13.cuda-linux32-test0.CUBLAS.debug.hostEmu.CheckedIn_Compiler
| PASSED
 15 | 2007-08-13 11:46:09.770971 |
2007-05-13.cuda-linux64-test0.CUBLAS.debug.hostEmu.CheckedIn_Compiler
| FAILED

In the above data, id 13 & 14 are two iterations of the same subtest
(they have the same cudacode), yet the first has a current_status of
FAILED, and the second is PASSED.  What I want is a query that will
detect this and return nothing because the later row (per the
date_created field) has PASSED, since all I care about is when the
last iteration has FAILED.  id 15 has a different cudacode, and
therefore should be returned from the query as well.  In other words,
I'd like to see the content of row 15 returned, since its the only one
that meets my criteria.

Or as a starting point, just get rows 14 & 15 returned, since 14 & 15
are the newest unique instances of cudacode, and then I can check the
current_status in those as a secondary query.

I tried to call DISTINCT ON, with the following query:

SELECT DISTINCT ON(date_created, cudacode) id,current_status FROM
cudaapps WHERE (SELECT now() - interval '24 hours' <
date_created::timestamp)='t' ORDER BY date_created;

but that is returning id 13, 14 & 15.  If I strip out the date_created
in the DISTINCT ON and ORDER BY sections:

SELECT DISTINCT ON(cudacode) id,current_status FROM cudaapps WHERE
(SELECT now() - interval '24 hours' < date_created::timestamp)='t' ;

then I get id 13 & 15 all over again, rather than 14 & 15.

I'm guessing that I'm missing something really obvious here, but I
can't quite figure out what.

>
> BTW, why are you writing the timestamp filter condition in such a
> bizarre way?  I'd expect to see that query written as
>
> select subtest,os,arch,info FROM ${table}
> WHERE
>  (now() - interval '24 hours') < date_created
>  AND current_status='FAILED' ;
>
> which would have a fighting chance of using an index on date_created.
> The useless sub-SELECT not only eliminates any chance of using an index,
> but incurs a fair amount of extra runtime overhead.

I'm doing this because date_created is of type char(20) rather than a
normal/expected date datatype.  I know, its dumb.

Re: analyzing query results

From

Tom Lane

Date:

13 August 2007, 17:42:28

"Lonni J Friedman" <netllama@gmail.com> writes:
> I tried to call DISTINCT ON, with the following query:

> SELECT DISTINCT ON(date_created, cudacode) id,current_status FROM
> cudaapps WHERE (SELECT now() - interval '24 hours' <
> date_created::timestamp)='t' ORDER BY date_created;

No, you're not grokking how to use DISTINCT ON.  The DISTINCT part
specifies what rows you want to group together --- here, all the ones
with the same cudacode --- and then ORDER BY has to list all those same
columns *followed by* the ordering condition that determines which row
you want out of the group.  What I think you want is

SELECT DISTINCT ON (cudacode) id,current_status
  FROM cudaapps
  WHERE (SELECT now() - interval '24 hours' < date_created::timestamp)='t'
  ORDER BY cudacode, date_created DESC;

which gives you the latest current_status for each cudacode, and then
you filter the uninteresting rows in an outer select:

SELECT id
FROM
  (SELECT DISTINCT ON (cudacode) id,current_status
    FROM cudaapps
    WHERE (SELECT now() - interval '24 hours' < date_created::timestamp)='t'
    ORDER BY cudacode, date_created DESC) ss
WHERE current_status = 'FAILED';

>> BTW, why are you writing the timestamp filter condition in such a
>> bizarre way?

> I'm doing this because date_created is of type char(20) rather than a
> normal/expected date datatype.  I know, its dumb.

Well, so you have to have the cast, but I'd still think that
    WHERE (now() - interval '24 hours') < date_created::timestamptz
would be the best way to express it.  The sub-SELECT and the comparison
to 't' are just obscurantism.

Another issue here is whether the date format was chosen to ensure that
textual sorting of the values would give the same result as datewise
sorting.  You might get some fairly bizarre misbehavior if the data is
sloppy about spaces instead of zero-fill, for instance.  Sometimes it's
worth biting the bullet and fixing the column type ...

            regards, tom lane

Re: analyzing query results

From

"Lonni J Friedman"

Date:

13 August 2007, 18:51:44

On 8/13/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Lonni J Friedman" <netllama@gmail.com> writes:
> > I tried to call DISTINCT ON, with the following query:
>
> > SELECT DISTINCT ON(date_created, cudacode) id,current_status FROM
> > cudaapps WHERE (SELECT now() - interval '24 hours' <
> > date_created::timestamp)='t' ORDER BY date_created;
>
> No, you're not grokking how to use DISTINCT ON.  The DISTINCT part
> specifies what rows you want to group together --- here, all the ones
> with the same cudacode --- and then ORDER BY has to list all those same
> columns *followed by* the ordering condition that determines which row
> you want out of the group.  What I think you want is
>
> SELECT DISTINCT ON (cudacode) id,current_status
>   FROM cudaapps
>   WHERE (SELECT now() - interval '24 hours' < date_created::timestamp)='t'
>   ORDER BY cudacode, date_created DESC;
>
> which gives you the latest current_status for each cudacode, and then
> you filter the uninteresting rows in an outer select:
>
> SELECT id
> FROM
>   (SELECT DISTINCT ON (cudacode) id,current_status
>     FROM cudaapps
>     WHERE (SELECT now() - interval '24 hours' < date_created::timestamp)='t'
>     ORDER BY cudacode, date_created DESC) ss
> WHERE current_status = 'FAILED';

Thanks, this is exactly what I need.  Only one question.  What is the
'ss' that you have in that query?  I tried googling, but getting any
useful hits on just "postgres ss" returns a ton of useless results.

>
> >> BTW, why are you writing the timestamp filter condition in such a
> >> bizarre way?
>
> > I'm doing this because date_created is of type char(20) rather than a
> > normal/expected date datatype.  I know, its dumb.
>
> Well, so you have to have the cast, but I'd still think that
>         WHERE (now() - interval '24 hours') < date_created::timestamptz
> would be the best way to express it.  The sub-SELECT and the comparison
> to 't' are just obscurantism.

Thanks, that change is definitely equivalent.

>
> Another issue here is whether the date format was chosen to ensure that
> textual sorting of the values would give the same result as datewise
> sorting.  You might get some fairly bizarre misbehavior if the data is
> sloppy about spaces instead of zero-fill, for instance.  Sometimes it's
> worth biting the bullet and fixing the column type ...

Thankfully the date is always inserted via an automated script, so its
not going to get mangled or changed over time.  The issue that I was
trying to address was a 3rd party web app which does a horrible job of
formatting date fields.  I'm basically working around this apps bad
behavior by feeding it the date in the format that I need, rather than
letting the app mangle it.


--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       http://netllama.linux-sxs.org

Re: analyzing query results

From

Tom Lane

Date:

13 August 2007, 19:28:44

"Lonni J Friedman" <netllama@gmail.com> writes:
> Thanks, this is exactly what I need.  Only one question.  What is the
> 'ss' that you have in that query?

Table alias for the sub-select.  The SQL spec says you have to provide
one ... you can leave out the AS keyword, though.

            regards, tom lane