Thread: Join optimisation Quandry

Join optimisation Quandry

From
Ceri Storey
Date:
Hi there.

I've got a database (the schema is:
http://compsoc.man.ac.uk/~cez/2004/01/14/tv-schema.sql) for television
data. Now, one of the things I want to use this for is a now and next
display. (much like http://teletext.com/tvplus/nownext.asp ).

I've got a view defined like this:
CREATE VIEW progtit AS SELECT programme.*, title_seen, title_wanted, title_text FROM (programme NATURAL JOIN title);

And to select the programmes that are on currently and next, I'm doing
something like this:

SELECT *
FROM progtit AS p1 LEFT JOIN progtit AS p2 ON p1.prog_next = p2.prog_id
WHERE prog_start <= '2004-01-14 23:09:11'
    AND prog_stop > '2004-01-14 23:09:11';

Now, unfourtunately this runs rather slowly (takes around 1sec to
complete on my machine), as it (AFAIK) ends up building a complete
instance of the progtit view and then joining the current programmes
with that, instead of just finding the current set of programs and then
selecting the relevant rows from the view.

Now, I know I could just two it in two seperate passes for the current
programmes and those after them, but I'd be neater to do it in one.

So, is there any way to optimize the above query? Any clues, or
references would be wonderful.

Thanks.
--
Ceri Storey <cez@necrofish.org.uk>

Re: Join optimisation Quandry

From
Stephan Szabo
Date:
On Wed, 14 Jan 2004, Ceri Storey wrote:

> Hi there.
>
> I've got a database (the schema is:
> http://compsoc.man.ac.uk/~cez/2004/01/14/tv-schema.sql) for television
> data. Now, one of the things I want to use this for is a now and next
> display. (much like http://teletext.com/tvplus/nownext.asp ).
>
> I've got a view defined like this:
> CREATE VIEW progtit AS SELECT programme.*, title_seen, title_wanted,
> title_text FROM (programme NATURAL JOIN title);
>
> And to select the programmes that are on currently and next, I'm doing
> something like this:
>
> SELECT *
> FROM progtit AS p1 LEFT JOIN progtit AS p2 ON p1.prog_next = p2.prog_id
> WHERE prog_start <= '2004-01-14 23:09:11'
>     AND prog_stop > '2004-01-14 23:09:11';
>
> Now, unfourtunately this runs rather slowly (takes around 1sec to
> complete on my machine), as it (AFAIK) ends up building a complete
> instance of the progtit view and then joining the current programmes
> with that, instead of just finding the current set of programs and then
> selecting the relevant rows from the view.
>
> Now, I know I could just two it in two seperate passes for the current
> programmes and those after them, but I'd be neater to do it in one.
>
> So, is there any way to optimize the above query? Any clues, or
> references would be wonderful.

As a starting point, we're likely to need the exact query, explain analyze
output for the query and version information.

Re: Join optimisation Quandry

From
Stephan Szabo
Date:
On Sat, 17 Jan 2004, Ceri Storey wrote:

> On Fri, Jan 16, 2004 at 10:17:50AM -0800, Stephan Szabo wrote:
> > As a starting point, we're likely to need the exact query, explain analyze
> > output for the query and version information.
>
> Okay, from top to bottom:
>
>   SELECT p1.chan_name, p1.prog_start AS now_start, p1.prog_id, p1.title_text,
>     p2.prog_start AS next_start, p2.prog_id, p2.title_text,
>     p1.title_wanted, p2.title_wanted, p1.chan_id
>   FROM (programme natural join channel NATURAL JOIN title) AS p1
>     LEFT OUTER JOIN (programme NATURAL JOIN title) AS p2
>     ON p1.prog_next = p2.prog_id
>   WHERE p1.prog_start <= timestamp 'now' AND p1.prog_stop > timestamp 'now'
>   ORDER BY p1.chan_id ASC
>

Well the plan would seems reasonable to me if there really was only 1 row
coming from the where conditions on p1.  As a first step, if you raise the
statistics target (see ALTER TABLE) for prog_start and prog_stop and
re-analyze the table, do you get a better estimate of the rows from that
condition? Secondly, what do you get if you enable_nestloop=off before
explain analyzing the query?

Re: Join optimisation Quandry

From
Ceri Storey
Date:
On Sat, Jan 17, 2004 at 01:03:34AM +0000, Ceri Storey wrote:
> Okay, from top to bottom:
>
>   SELECT p1.chan_name, p1.prog_start AS now_start, p1.prog_id, p1.title_text,
>     p2.prog_start AS next_start, p2.prog_id, p2.title_text,
>     p1.title_wanted, p2.title_wanted, p1.chan_id
>   FROM (programme natural join channel NATURAL JOIN title) AS p1
>     LEFT OUTER JOIN (programme NATURAL JOIN title) AS p2
>     ON p1.prog_next = p2.prog_id
>   WHERE p1.prog_start <= timestamp 'now' AND p1.prog_stop > timestamp 'now'
>   ORDER BY p1.chan_id ASC


Although, as I've just found, another bottleneck is the title table.
PostgreSQL seems to inst on doing a Seq Scan on the entire table.

Compare:
tv=> explain analyse SELECT * FROM tid LEFT OUTER  JOIN title ON  t1 = title_id OR t2 = title_id;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=190.83..267285.83 rows=2000 width=35) (actual time=222.776..2430.073 rows=33 loops=1)
   Join Filter: (("outer".t1 = "inner".title_id) OR ("outer".t2 = "inner".title_id))
   ->  Seq Scan on tid  (cost=0.00..20.00 rows=1000 width=8) (actual time=0.028..10.457 rows=17 loops=1)
   ->  Materialize  (cost=190.83..297.66 rows=10683 width=27) (actual time=0.197..57.918 rows=10767 loops=17)
         ->  Seq Scan on title  (cost=0.00..190.83 rows=10683 width=27) (actual time=0.045..64.988 rows=10767 loops=1)
 Total runtime: 2435.059 ms
(6 rows)

With:
tv=> explain analyse select * from title where title_id IN (SELECT t1 FROM tid UNION SELECT t2 FROM tid);
                                                                      QUERY PLAN
                               

-------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=205.16..451.40 rows=200 width=27) (actual time=3.065..82.689 rows=33 loops=1)
   Hash Cond: ("outer".title_id = "inner".t1)
   ->  Seq Scan on title  (cost=0.00..190.83 rows=10683 width=27) (actual time=0.010..36.325 rows=10767 loops=1)
   ->  Hash  (cost=204.66..204.66 rows=200 width=4) (actual time=1.464..1.464 rows=0 loops=1)
         ->  HashAggregate  (cost=204.66..204.66 rows=200 width=4) (actual time=1.234..1.355 rows=33 loops=1)
               ->  Subquery Scan "IN_subquery"  (cost=169.66..199.66 rows=2000 width=4) (actual time=0.735..1.104
rows=33loops=1) 
                     ->  Unique  (cost=169.66..179.66 rows=2000 width=4) (actual time=0.728..0.934 rows=33 loops=1)
                           ->  Sort  (cost=169.66..174.66 rows=2000 width=4) (actual time=0.722..0.779 rows=34 loops=1)
                                 Sort Key: t1
                                 ->  Append  (cost=0.00..60.00 rows=2000 width=4) (actual time=0.054..0.534 rows=34
loops=1)
                                       ->  Subquery Scan "*SELECT* 1"  (cost=0.00..30.00 rows=1000 width=4) (actual
time=0.050..0.228rows=17 loops=1) 
                                             ->  Seq Scan on tid  (cost=0.00..20.00 rows=1000 width=4) (actual
time=0.041..0.126rows=17 loops=1) 
                                       ->  Subquery Scan "*SELECT* 2"  (cost=0.00..30.00 rows=1000 width=4) (actual
time=0.014..0.183rows=17 loops=1) 
                                             ->  Seq Scan on tid  (cost=0.00..20.00 rows=1000 width=4) (actual
time=0.008..0.087rows=17 loops=1) 
 Total runtime: 83.214 ms
(15 rows)

--
Ceri Storey <cez@necrofish.org.uk>

Re: Join optimisation Quandry

From
Ceri Storey
Date:
On Fri, Jan 16, 2004 at 10:05:54PM -0800, Stephan Szabo wrote:
> Well the plan would seems reasonable to me if there really was only 1 row
> coming from the where conditions on p1.  As a first step, if you raise the
> statistics target (see ALTER TABLE) for prog_start and prog_stop and
> re-analyze the table, do you get a better estimate of the rows from that
> condition?

Indeed, that helps a fair bit with the estimate.

> Secondly, what do you get if you enable_nestloop=off before
> explain analyzing the query?
See attachment.

Saying that, I've managed to halve the query time by lifting the join of
the title out of the RHS of the left outer join into the top-level of
the FROM clause; which was really the kind of advice I was after. If
this is the wrong list for that kind of thing, please say so.

Again, thanks.
--
Ceri Storey <cez@necrofish.org.uk>

Attachment

Re: Join optimisation Quandry

From
Ceri Storey
Date:
On Fri, Jan 16, 2004 at 10:17:50AM -0800, Stephan Szabo wrote:
> As a starting point, we're likely to need the exact query, explain analyze
> output for the query and version information.

Okay, from top to bottom:

  SELECT p1.chan_name, p1.prog_start AS now_start, p1.prog_id, p1.title_text,
    p2.prog_start AS next_start, p2.prog_id, p2.title_text,
    p1.title_wanted, p2.title_wanted, p1.chan_id
  FROM (programme natural join channel NATURAL JOIN title) AS p1
    LEFT OUTER JOIN (programme NATURAL JOIN title) AS p2
    ON p1.prog_next = p2.prog_id
  WHERE p1.prog_start <= timestamp 'now' AND p1.prog_stop > timestamp 'now'
  ORDER BY p1.chan_id ASC

QUERY PLAN
----
 Sort  (cost=983.38..983.38 rows=1 width=85) (actual time=10988.525..10988.557 rows=17 loops=1)
   Sort Key: public.programme.chan_id
   ->  Nested Loop Left Join  (cost=289.86..983.37 rows=1 width=85) (actual time=631.918..10988.127 rows=17 loops=1)
         Join Filter: ("outer".prog_next = "inner".prog_id)
         ->  Nested Loop  (cost=4.33..9.12 rows=1 width=55) (actual time=4.111..7.960 rows=17 loops=1)
               ->  Hash Join  (cost=4.33..5.64 rows=1 width=37) (actual time=4.017..5.182 rows=17 loops=1)
                     Hash Cond: ("outer".chan_id = "inner".chan_id)
                     ->  Seq Scan on channel  (cost=0.00..1.20 rows=20 width=17) (actual time=0.017..0.403 rows=20
loops=1)
                     ->  Hash  (cost=4.32..4.32 rows=1 width=24) (actual time=3.910..3.910 rows=0 loops=1)
                           ->  Index Scan using prog_stop_idx on programme  (cost=0.00..4.32 rows=1 width=24) (actual
time=0.140..3.809rows=17 loops=1) 
                                 Index Cond: (prog_stop > '2004-01-17 01:01:51.786145'::timestamp without time zone)
                                 Filter: (prog_start <= '2004-01-17 01:01:51.786145'::timestamp without time zone)
               ->  Index Scan using "$3" on title  (cost=0.00..3.47 rows=1 width=26) (actual time=0.078..0.114 rows=1
loops=17)
                     Index Cond: ("outer".title_id = title.title_id)
         ->  Hash Join  (cost=285.54..892.91 rows=6507 width=34) (actual time=191.612..586.407 rows=7145 loops=17)
               Hash Cond: ("outer".title_id = "inner".title_id)
               ->  Seq Scan on programme  (cost=0.00..121.07 rows=6507 width=16) (actual time=0.036..42.337 rows=7145
loops=17)
               ->  Hash  (cost=190.83..190.83 rows=10683 width=26) (actual time=190.795..190.795 rows=0 loops=17)
                     ->  Seq Scan on title  (cost=0.00..190.83 rows=10683 width=26) (actual time=0.143..113.223
rows=10715loops=17) 
 Total runtime: 10989.661 ms

And both client and server are:
postgres (PostgreSQL) 7.4.1

Thanks for looking into it.
--
Ceri Storey <cez@necrofish.org.uk>

Re: Join optimisation Quandry

From
Tom Lane
Date:
Ceri Storey <cez-misc.pgsql-perf@necrofish.org.uk> writes:
> Although, as I've just found, another bottleneck is the title table.
> PostgreSQL seems to inst on doing a Seq Scan on the entire table.

>    ->  Seq Scan on tid  (cost=0.00..20.00 rows=1000 width=8) (actual time=0.028..10.457 rows=17 loops=1)

It doesn't look like you've ever vacuumed or analyzed "tid" --- those
are the default cost and rows estimates.  Although I'm unsure whether
the plan would change much if you had.

            regards, tom lane