Thread: Optimising a query requiring seqscans=0

Optimising a query requiring seqscans=0

From
Russ Brown
Date:
Hi,

We recently upgraded our trac backend from sqlite to postgres, and I
decided to have a little fun and write some reports that delve into
trac's subversion cache, and got stuck with a query optimisation
problem.

Table revision contains 2800+ rows
Table node_change contains 370000+.

rev is a 'TEXT' field on both containing numerical data (not my choice,
trac does it like this to support VCS backends with non-numerical
revision identifiers).

I've got stuck with this query:

   SELECT author, COUNT(DISTINCT r.rev)
     FROM revision AS r
LEFT JOIN node_change AS nc
       ON r.rev=nc.rev
    WHERE r.time >= EXTRACT(epoch FROM (NOW() - interval '30
days'))::integer
 GROUP BY r.author;

Currently it shows the number of commits per author in the last 30 days.
The join to node_change is superfluous for this purpose but was added
because I intended to add new columns which draw on this table, such as
the number of files added, deleted etc.

I never got that far however due to the planner problem:

 GroupAggregate  (cost=56755.41..56852.28 rows=2 width=17) (actual
time=4836.433..4897.458 rows=25 loops=1)
   ->  Sort  (cost=56755.41..56787.69 rows=12913 width=17) (actual
time=4836.233..4851.968 rows=22419 loops=1)
         Sort Key: r.author
         ->  Merge Left Join  (cost=53886.10..55873.68 rows=12913
width=17) (actual time=4600.733..4641.749 rows=22419 loops=1)
               Merge Cond: ("outer".rev = "inner".rev)
               ->  Sort  (cost=93.78..96.24 rows=982 width=17) (actual
time=7.050..7.383 rows=1088 loops=1)
                     Sort Key: r.rev
                     ->  Index Scan using revision_time_idx on revision
r  (cost=0.01..44.98 rows=982 width=17) (actual time=0.191..4.014
rows=1088 loops=1)
                           Index Cond: ("time" >=
(date_part('epoch'::text, (now() - '30 days'::interval)))::integer)
               ->  Sort  (cost=53792.32..54719.09 rows=370707 width=8)
(actual time=4203.665..4443.748 rows=346238 loops=1)
                     Sort Key: nc.rev
                     ->  Seq Scan on node_change nc
(cost=0.00..12852.07 rows=370707 width=8) (actual time=0.054..663.719
rows=370707 loops=1)
 Total runtime: 4911.430 ms

If I disable sequential scans I get the following explain:

 GroupAggregate  (cost=221145.13..221242.01 rows=2 width=17) (actual
time=286.348..348.268 rows=25 loops=1)
   ->  Sort  (cost=221145.13..221177.42 rows=12913 width=17) (actual
time=286.183..302.239 rows=22419 loops=1)
         Sort Key: r.author
         ->  Nested Loop Left Join  (cost=0.01..220263.40 rows=12913
width=17) (actual time=0.339..86.626 rows=22419 loops=1)
               ->  Index Scan using revision_time_idx on revision r
(cost=0.01..44.98 rows=982 width=17) (actual time=0.194..4.056 rows=1088
loops=1)
                     Index Cond: ("time" >= (date_part('epoch'::text,
(now() - '30 days'::interval)))::integer)
               ->  Index Scan using node_change_rev_idx on node_change
nc  (cost=0.00..223.18 rows=86 width=8) (actual time=0.009..0.058
rows=21 loops=1088)
                     Index Cond: ("outer".rev = nc.rev)
 Total runtime: 350.103 ms

Statistics are set to 20, and I have ANALYZEd both tables.

The report itself isn't important, but I'm using this as an exercise in
PostgreSQL query optimisation and planner tuning, so any help/hints
would be appreciated.

Thanks.

--

Russ


Re: Optimising a query requiring seqscans=0

From
Jim Nasby
Date:
On Sep 14, 2006, at 11:15 AM, Russ Brown wrote:
> We recently upgraded our trac backend from sqlite to postgres, and I
> decided to have a little fun and write some reports that delve into
> trac's subversion cache, and got stuck with a query optimisation
> problem.
>
> Table revision contains 2800+ rows
> Table node_change contains 370000+.
<...>
> I've got stuck with this query:
>
>    SELECT author, COUNT(DISTINCT r.rev)
>      FROM revision AS r
> LEFT JOIN node_change AS nc
>        ON r.rev=nc.rev
>     WHERE r.time >= EXTRACT(epoch FROM (NOW() - interval '30
> days'))::integer

Man I really hate when people store time_t in a database...

>  GROUP BY r.author;
>
> Statistics are set to 20, and I have ANALYZEd both tables.
>
> The report itself isn't important, but I'm using this as an
> exercise in
> PostgreSQL query optimisation and planner tuning, so any help/hints
> would be appreciated.

Setting statistics higher (100-200), at least for the large table
will likely help. Also make sure that you've set effective_cache_size
correctly (I generally set it to total memory - 1G, assuming the
server has at least 4G in it).
--
Jim Nasby                                    jimn@enterprisedb.com
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)



Re: Optimising a query requiring seqscans=0

From
Russ Brown
Date:
On Thu, 2006-09-21 at 23:39 -0400, Jim Nasby wrote:
> On Sep 14, 2006, at 11:15 AM, Russ Brown wrote:
> > We recently upgraded our trac backend from sqlite to postgres, and I
> > decided to have a little fun and write some reports that delve into
> > trac's subversion cache, and got stuck with a query optimisation
> > problem.
> >
> > Table revision contains 2800+ rows
> > Table node_change contains 370000+.
> <...>
> > I've got stuck with this query:
> >
> >    SELECT author, COUNT(DISTINCT r.rev)
> >      FROM revision AS r
> > LEFT JOIN node_change AS nc
> >        ON r.rev=nc.rev
> >     WHERE r.time >= EXTRACT(epoch FROM (NOW() - interval '30
> > days'))::integer
>
> Man I really hate when people store time_t in a database...
>

I know. Probably something to do with database engine independence. I
don't know if sqlite even has a date type (probably does, but I haven't
checked).

> >  GROUP BY r.author;
> >
> > Statistics are set to 20, and I have ANALYZEd both tables.
> >
> > The report itself isn't important, but I'm using this as an
> > exercise in
> > PostgreSQL query optimisation and planner tuning, so any help/hints
> > would be appreciated.
>
> Setting statistics higher (100-200), at least for the large table
> will likely help. Also make sure that you've set effective_cache_size
> correctly (I generally set it to total memory - 1G, assuming the
> server has at least 4G in it).

Thank you: the problem was the effective_cache_size (which I hadn't
changed from the default of 1000). This machine doesn't have loads of
RAM, but I knocked it up to 65536 and now the query uses the index,
without having to change the statistics.

Thanks a lot!

> --
> Jim Nasby                                    jimn@enterprisedb.com
> EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)
>
>