Thread: Limits in subqueries...

Limits in subqueries...

From
Martijn van Oosterhout
Date:
I have a (simplified) table looking like:

serial int4
custid int4
stamp  datetime

I want, for each customer, to get the serial of the the
highest timestamp for that customer, and for duplicate
timestamps, the highest serial. The way I thought how to
do that would be:

select custid, serial
from test b
where serial =
   (select serial
    from test a
    where a.custid = b.custid
    order by serial desc limit 1);

but thats not allowed. The best I could do is:

select custid, max(serial)
from test
where (custid,stamp) in
   (select custid, max(stamp)
    from test
    group by custid)
group by custid;

Which is ugly and only works in this case because
max does what I want. It has a horrible plan though:

Aggregate  (cost=1.36 rows=11 width=8)
  ->  Group  (cost=1.36 rows=11 width=8)
        ->  Sort  (cost=1.36 rows=11 width=8)
              ->  Seq Scan on test  (cost=1.36 rows=11 width=8)
                    SubPlan
                      ->  Aggregate  (cost=1.36 rows=11 width=12)
                            ->  Group  (cost=1.36 rows=11 width=12)
                                  ->  Sort  (cost=1.36 rows=11 width=12)
                                        ->  Seq Scan on test  (cost=1.36
rows=11 width=12)

In the more general case, this won't work.

Is there a better way of doing this? Does postgres 7.0 do
better in this case?
--
Martijn van Oosterhout <kleptog@cupid.suninternet.com>
http://cupid.suninternet.com/~kleptog/


Re: Limits in subqueries...

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@cupid.suninternet.com> writes:
> I want, for each customer, to get the serial of the the
> highest timestamp for that customer, and for duplicate
> timestamps, the highest serial.

If you will take a non-SQL-standard solution, this will work:

select distinct on (custid) custid, serial from test
order by custid, stamp desc, serial desc;

(In pre-7.0, omit parens around distinct-on value.)  Basically this
works by ordering the data in the right way and then taking just the
first row out of each group with the same custid.

It's a lot harder if you want to be SQL-compliant ...

> The way I thought how to do that would be:

> select custid, serial
> from test b
> where serial =
>    (select serial
>     from test a
>     where a.custid = b.custid
>     order by serial desc limit 1);

> but thats not allowed.

Nope.  Subselects can't have order-by clauses; SQL thinks you shouldn't
do anything as grotty as depending on the order in which tuples are
processed.  The spec lets you sort the final output of a SELECT, but not
any intermediate steps.

> The best I could do is:

> select custid, max(serial)
> from test
> where (custid,stamp) in
>    (select custid, max(stamp)
>     from test
>     group by custid)
> group by custid;

> Which is ugly and only works in this case because
> max does what I want. It has a horrible plan though:

> Aggregate  (cost=1.36 rows=11 width=8)
>   ->  Group  (cost=1.36 rows=11 width=8)
>         ->  Sort  (cost=1.36 rows=11 width=8)
>               ->  Seq Scan on test  (cost=1.36 rows=11 width=8)
>                     SubPlan
>                       ->  Aggregate  (cost=1.36 rows=11 width=12)
>                             ->  Group  (cost=1.36 rows=11 width=12)
>                                   ->  Sort  (cost=1.36 rows=11 width=12)
>                                         ->  Seq Scan on test  (cost=1.36 rows=11 width=12)

> In the more general case, this won't work.

It'd work as well as your other solution, seems to me, since max()
should be able to do the same calculation that "order by" would.
What's the more general case you are visualizing?

> Is there a better way of doing this? Does postgres 7.0 do
> better in this case?

7.0 is smart enough to materialize the result of the subselect (ie,
do the subselect just once and cache the results).  That'd make
a pretty considerable difference in runtime, but it's still a lot
slower than the DISTINCT ON solution.

            regards, tom lane