Re: efficient storing of urls - Mailing list pgsql-general
From | Bill Moran |
---|---|
Subject | Re: efficient storing of urls |
Date | |
Msg-id | 403FF862.60202@potentialtech.com Whole thread Raw |
In response to | Re: efficient storing of urls (Shane Wegner <shane-keyword-pgsql.a1e0d9@cm.nu>) |
List | pgsql-general |
Shane Wegner wrote: > On Fri, Feb 27, 2004 at 06:00:36PM -0500, Sean Shanny wrote: > >>Shane, >> >>Can you give an example of a query that has gotten slower due to the >>increasing size of the urls table with an explain analyze? > > The database is a simple traffic monitoring tool so we have > a hits table which gets a new row for every url accessed. > Very simple table > > create table hits( > hit_date date not null, > hit_time time(0) without time zone not null, > url_id int references urls(id) > ); > > A select to display the 100 most popular pages: > explain analyze select count(*) as c,url from hits,urls where hit_date between '2004-01-01' and '2004-01-31' and url_id=urls.idgroup by url order by c desc limit 100; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------ > Limit (cost=320189.71..320189.96 rows=100 width=68) (actual time=34156.080..34156.324 rows=100 loops=1) > -> Sort (cost=320189.71..320700.06 rows=204138 width=68) (actual time=34156.068..34156.208 rows=100 loops=1) > Sort Key: count(*) > -> GroupAggregate (cost=281214.19..283255.57 rows=204138 width=68) (actual time=32457.857..33584.861 rows=53460loops=1) > -> Sort (cost=281214.19..281724.54 rows=204138 width=68) (actual time=32457.690..32873.446 rows=248888loops=1) > Sort Key: urls.url > -> Merge Join (cost=239594.05..244280.05 rows=204138 width=68) (actual time=21363.547..24385.213rows=248888 loops=1) > Merge Cond: ("outer".url_id = "inner".id) > -> Sort (cost=168400.38..168914.15 rows=205508 width=4) (actual time=14785.934..15156.772rows=249350 loops=1) > Sort Key: hits.url_id > -> Seq Scan on hits (cost=0.00..148512.07 rows=205508 width=4) (actual time=40.265..12081.506rows=249350 loops=1) > Filter: ((hit_date >= '2004-01-01'::date) AND (hit_date <= '2004-01-31'::date)) > -> Sort (cost=71193.67..72005.68 rows=324805 width=72) (actual time=6577.430..7422.945 rows=519307loops=1) > Sort Key: urls.id > -> Seq Scan on urls (cost=0.00..7389.05 rows=324805 width=72) (actual time=0.110..1187.617rows=324805 loops=1) > Total runtime: 34221.250 ms > (16 rows) > > Time: 34224.959 ms While there's a lot that can be tried to improve performance here, it doesn't look like changing the url table is going to help much. Notice that the sequential scan on the urls table takes a tiny amount of time (compared to everything else) You might do well to create an index on hit_date, as that sequential scan seems to take quite a while. Also, it doesn't seem like this query is doing what you want at all. It says you only get 16 rows, are there only 16 URLs? However, I would think the best thing you could do to, if you're often counting up hits, is to count them as they occur. Add an INT column to urls (call it "hits" or something) and each time you store a hit record in the hits table, also do the following query: UPDATE urls SET hits = hits + 1 WHERE url = '<the URL string>'; This will speed up and simplify the above query greatly. You could even simplify your client app by making the above UPDATE an insert trigger on the hits table. -- Bill Moran Potential Technologies http://www.potentialtech.com
pgsql-general by date: