Thread: Improve a query...

Improve a query...

From
"Eric G. Miller"
Date:
Looking for the best way to formulate a query to select
the most "recent" entry for an organization in a table
like:

CREATE TABLE reports (

  -- Report Id used to link up related 1:M rather than multi-key
  rpt_id  SERIAL NOT NULL PRIMARY KEY,

  -- A Unique ID for the organization
  org_id   char(10) NOT NULL CHECK(CHARACTER_LENGTH(op_id) = 10),

  -- The reporting period
  period   integer NOT NULL

  -- Various and Sundry ...
  .
  .
  .

  UNIQUE (org_id,period)
);

If max(period) for an organization yields the most recent reporting,
I want to get whatever is the most recent report for each organization.

This query works, but seems expensive...

SELECT a.rpt_id, a.org_id, a.period, ...
FROM reports As a
INNER JOIN
  (SELECT b.org_id, max(b.period) As period
   FROM reports b group by b.org_id) As c
ON a.org_id = c.org_id and a.period = c.period;

EXPLAIN looks thusly:

NOTICE:  QUERY PLAN:

Merge Join  (cost=147.98..164.48 rows=10 width=48)
  ->  Sort  (cost=69.83..69.83 rows=1000 width=32)
        ->  Seq Scan on reports a  (cost=0.00..20.00 rows=1000 width=32)
  ->  Sort  (cost=78.15..78.15 rows=100 width=16)
        ->  Subquery Scan c  (cost=69.83..74.83 rows=100 width=16)
              ->  Aggregate  (cost=69.83..74.83 rows=100 width=16)
                    ->  Group  (cost=69.83..72.33 rows=1000 width=16)
                          ->  Sort  (cost=69.83..69.83 rows=1000 width=16)
                                ->  Seq Scan on reports b (cost=0.00..20.00
                            rows=1000 width=16)


The data is very hierarchical so I didn't want to carry around alot of
key fields in related "many-sided" tables which may also have related
"many-sided" tables.  Any ideas on how to minimize the multiple
scans on the table?  The numbers for explain probably aren't telling
much since there's not much real data in the table at this time...

--
Eric G. Miller <egm2@jps.net>

Re: Improve a query...

From
Tom Lane
Date:
"Eric G. Miller" <egm2@jps.net> writes:
> Looking for the best way to formulate a query to select
> the most "recent" entry for an organization in a table
> like:

Take a look at the SELECT reference page's example for
SELECT DISTINCT ON:

: For example,
:
:         SELECT DISTINCT ON (location) location, time, report
:         FROM weatherReports
:         ORDER BY location, time DESC;
:
: retrieves the most recent weather report for each location.

A tad nonstandard, but bloody effective, particularly if you
have an appropriate index in place...

            regards, tom lane

Re: Improve a query...

From
"Eric G. Miller"
Date:
On Wed, May 02, 2001 at 01:17:44AM -0400, Tom Lane wrote:
> "Eric G. Miller" <egm2@jps.net> writes:
> > Looking for the best way to formulate a query to select
> > the most "recent" entry for an organization in a table
> > like:
>
> Take a look at the SELECT reference page's example for
> SELECT DISTINCT ON:
>
> : For example,
> :
> :         SELECT DISTINCT ON (location) location, time, report
> :         FROM weatherReports
> :         ORDER BY location, time DESC;
> :
> : retrieves the most recent weather report for each location.
>
> A tad nonstandard, but bloody effective, particularly if you
> have an appropriate index in place...

Tricky! After adding the unique 2 column index and using that construct
I get:

NOTICE:  QUERY PLAN:

Unique  (cost=1.20..1.22 rows=1 width=32)
  ->  Sort  (cost=1.20..1.20 rows=8 width=32)
          ->  Seq Scan on reports  (cost=0.00..1.08 rows=8 width=32)

EXPLAIN

--
Eric G. Miller <egm2@jps.net>

RE: Improve a query...

From
Jeff Eckermann
Date:
Try:
SELECT DISTINCT ON (org_id) rpt_id, org_id, period,...
FROM reports
ORDER BY period DESC;

> -----Original Message-----
> From:    Eric G. Miller [SMTP:egm2@jps.net]
> Sent:    Tuesday, May 01, 2001 4:04 AM
> To:    PostgreSQL General
> Subject:    [GENERAL] Improve a query...
>
> Looking for the best way to formulate a query to select
> the most "recent" entry for an organization in a table
> like:
>
> CREATE TABLE reports (
>
>   -- Report Id used to link up related 1:M rather than multi-key
>   rpt_id  SERIAL NOT NULL PRIMARY KEY,
>
>   -- A Unique ID for the organization
>   org_id   char(10) NOT NULL CHECK(CHARACTER_LENGTH(op_id) = 10),
>
>   -- The reporting period
>   period   integer NOT NULL
>
>   -- Various and Sundry ...
>   .
>   .
>   .
>
>   UNIQUE (org_id,period)
> );
>
> If max(period) for an organization yields the most recent reporting,
> I want to get whatever is the most recent report for each organization.
>
> This query works, but seems expensive...
>
> SELECT a.rpt_id, a.org_id, a.period, ...
> FROM reports As a
> INNER JOIN
>   (SELECT b.org_id, max(b.period) As period
>    FROM reports b group by b.org_id) As c
> ON a.org_id = c.org_id and a.period = c.period;
>
> EXPLAIN looks thusly:
>
> NOTICE:  QUERY PLAN:
>
> Merge Join  (cost=147.98..164.48 rows=10 width=48)
>   ->  Sort  (cost=69.83..69.83 rows=1000 width=32)
>         ->  Seq Scan on reports a  (cost=0.00..20.00 rows=1000 width=32)
>   ->  Sort  (cost=78.15..78.15 rows=100 width=16)
>         ->  Subquery Scan c  (cost=69.83..74.83 rows=100 width=16)
>               ->  Aggregate  (cost=69.83..74.83 rows=100 width=16)
>                     ->  Group  (cost=69.83..72.33 rows=1000 width=16)
>                           ->  Sort  (cost=69.83..69.83 rows=1000 width=16)
>                                 ->  Seq Scan on reports b
> (cost=0.00..20.00
>                             rows=1000 width=16)
>
>
> The data is very hierarchical so I didn't want to carry around alot of
> key fields in related "many-sided" tables which may also have related
> "many-sided" tables.  Any ideas on how to minimize the multiple
> scans on the table?  The numbers for explain probably aren't telling
> much since there's not much real data in the table at this time...
>
> --
> Eric G. Miller <egm2@jps.net>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Re: Improve a query...

From
"Eric G. Miller"
Date:
On Wed, May 02, 2001 at 07:30:13PM +0200, Peter Eisentraut wrote:
> Eric G. Miller writes:
>
> > If max(period) for an organization yields the most recent reporting,
> > I want to get whatever is the most recent report for each organization.
>
> SELECT org_id, max(period) FROM reports GROUP BY org_id;

Naw, doesn't work. I need the other columns, so the GROUP BY is out
(except in a subselect).  The DISTINCT ON () syntax, though
non-standard, seems to do the job.

--
Eric G. Miller <egm2@jps.net>

Re: Improve a query...

From
Peter Eisentraut
Date:
Eric G. Miller writes:

> If max(period) for an organization yields the most recent reporting,
> I want to get whatever is the most recent report for each organization.

SELECT org_id, max(period) FROM reports GROUP BY org_id;

--
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter