Re: efficient storing of urls - Mailing list pgsql-general

From Shane Wegner
Subject Re: efficient storing of urls
Date
Msg-id 20040301192345.GA23411@cm.nu
Whole thread Raw
In response to Re: efficient storing of urls  ("scott.marlowe" <scott.marlowe@ihs.com>)
Responses Re: efficient storing of urls
List pgsql-general
On Mon, Mar 01, 2004 at 08:54:43AM -0700, scott.marlowe wrote:
> On Fri, 27 Feb 2004, Shane Wegner wrote:
> > A select to display the 100 most popular pages:
> > explain analyze select count(*) as c,url from hits,urls where hit_date between '2004-01-01' and '2004-01-31' and
url_id=urls.idgroup by url order by c desc limit 100; 
> >                                                                       QUERY PLAN
> >
------------------------------------------------------------------------------------------------------------------------------------------------------
> >  Limit  (cost=320189.71..320189.96 rows=100 width=68) (actual time=34156.080..34156.324 rows=100 loops=1)
> >    ->  Sort  (cost=320189.71..320700.06 rows=204138 width=68) (actual time=34156.068..34156.208 rows=100 loops=1)
> >          Sort Key: count(*)
> >          ->  GroupAggregate  (cost=281214.19..283255.57 rows=204138 width=68) (actual time=32457.857..33584.861
rows=53460loops=1) 
> >                ->  Sort  (cost=281214.19..281724.54 rows=204138 width=68) (actual time=32457.690..32873.446
rows=248888loops=1) 
> >                      Sort Key: urls.url
> >                      ->  Merge Join  (cost=239594.05..244280.05 rows=204138 width=68) (actual
time=21363.547..24385.213rows=248888 loops=1) 
> >                            Merge Cond: ("outer".url_id = "inner".id)
> >                            ->  Sort  (cost=168400.38..168914.15 rows=205508 width=4) (actual
time=14785.934..15156.772rows=249350 loops=1) 
> >                                  Sort Key: hits.url_id
> >                                  ->  Seq Scan on hits  (cost=0.00..148512.07 rows=205508 width=4) (actual
time=40.265..12081.506rows=249350 loops=1) 
> >                                        Filter: ((hit_date >= '2004-01-01'::date) AND (hit_date <=
'2004-01-31'::date))
> >                            ->  Sort  (cost=71193.67..72005.68 rows=324805 width=72) (actual time=6577.430..7422.945
rows=519307loops=1) 
> >                                  Sort Key: urls.id
> >                                  ->  Seq Scan on urls  (cost=0.00..7389.05 rows=324805 width=72) (actual
time=0.110..1187.617rows=324805 loops=1) 
> >  Total runtime: 34221.250 ms
> > (16 rows)
>
> Your single biggest jump in actual time seems to be coming from the seq
> scan on hits, which takes you from 40 ms to 12,000 ms.
>
> Can you index the hit_date and see if that helps?

Hmm, I don't think it's using the index.  I tried two indexes
create index hit_date on hits(hit_date);
create index hit_date2 on hits(hit_date,url_id);
vacuum analyze;

With the hope that it would use the multicolumn index for
hit_date and url_id but it's still doing a seq scan on
hits.

explain analyze select count(*) as c,url from hits,urls where hit_date between
'2004-01-01' and '2004-01-31' and url_id=urls.id group by url order by c desc
limit 100;
                                                                           QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=300458.08..300458.33 rows=100 width=68) (actual time=21117.590..21117.842 rows=100 loops=1)
   ->  Sort  (cost=300458.08..301021.65 rows=225429 width=68) (actual time=21117.579..21117.718 rows=100 loops=1)
         Sort Key: count(*)
         ->  GroupAggregate  (cost=256840.55..259094.84 rows=225429 width=68) (actual time=19645.926..20671.608
rows=53460loops=1) 
               ->  Sort  (cost=256840.55..257404.12 rows=225429 width=68) (actual time=19645.841..20000.521 rows=248888
loops=1)
                     Sort Key: urls.url
                     ->  Merge Join  (cost=163951.08..215477.30 rows=225429 width=68) (actual time=10623.308..13332.906
rows=248888loops=1) 
                           Merge Cond: ("outer".id = "inner".url_id)
                           ->  Index Scan using urls_pkey on urls  (cost=0.00..47329.44 rows=326181 width=72) (actual
time=0.109..1063.677rows=326181 loops=1) 
                           ->  Sort  (cost=163951.08..164519.00 rows=227170 width=4) (actual time=10622.954..10966.062
rows=248889loops=1) 
                                 Sort Key: hits.url_id
                                 ->  Seq Scan on hits  (cost=0.00..141717.33 rows=227170 width=4) (actual
time=0.282..8229.634rows=249350 loops=1) 
                                       Filter: ((hit_date >= '2004-01-01'::date) AND (hit_date <= '2004-01-31'::date))
 Total runtime: 21160.919 ms
(14 rows)

S

pgsql-general by date:

Previous
From: Bruno Wolff III
Date:
Subject: Re: ORDER BY problem
Next
From: Martijn van Oosterhout
Date:
Subject: Re: efficient storing of urls