Thread: efficient storing of urls

efficient storing of urls

From
Shane Wegner
Date:
Hello list,

I have a database where one of the tables stores urls and
it's getting to the point where queries are getting quite
slow.  My urls table looks something like:

create table urls(
id serial,
url text,
unique(url),
primary key(id)
);

What I am thinking of doing is storing urls in a tree-like
structure

create table urls(
id serial,
url_part text,
parent_id int, -- references back to urls table
unique(parent_id,url_part)
);

So:
insert into urls (id,parent_id,url_part) (1, NULL,
'http://www.mydomain.com');
insert into url (id,parent_id,url_part) values(2, 1, '/images');

url id 2 would represent www.mydomain.com/images without
actually storing the full hostname and path for each url.

Is this a recommended way of storing urls or is there a
better way?  Is it likely to result in faster joins as each
row will be smaller?

One final question, how would one get the full url back out
of the sql table referencing the parent back to the root
(null parent) for use by an sql like query and would that
procedure negate any performance benefits by this storage
method?

Thanks,
Shane

Re: efficient storing of urls

From
Sean Shanny
Date:
Shane,

Can you give an example of a query that has gotten slower due to the
increasing size of the urls table with an explain analyze?

Thanks.

--sean

Shane Wegner wrote:

>Hello list,
>
>I have a database where one of the tables stores urls and
>it's getting to the point where queries are getting quite
>slow.  My urls table looks something like:
>
>create table urls(
>id serial,
>url text,
>unique(url),
>primary key(id)
>);
>
>What I am thinking of doing is storing urls in a tree-like
>structure
>
>create table urls(
>id serial,
>url_part text,
>parent_id int, -- references back to urls table
>unique(parent_id,url_part)
>);
>
>So:
>insert into urls (id,parent_id,url_part) (1, NULL,
>'http://www.mydomain.com');
>insert into url (id,parent_id,url_part) values(2, 1, '/images');
>
>url id 2 would represent www.mydomain.com/images without
>actually storing the full hostname and path for each url.
>
>Is this a recommended way of storing urls or is there a
>better way?  Is it likely to result in faster joins as each
>row will be smaller?
>
>One final question, how would one get the full url back out
>of the sql table referencing the parent back to the root
>(null parent) for use by an sql like query and would that
>procedure negate any performance benefits by this storage
>method?
>
>Thanks,
>Shane
>
>---------------------------(end of broadcast)---------------------------
>TIP 8: explain analyze is your friend
>
>
>

Re: efficient storing of urls

From
Bill Moran
Date:
Shane Wegner wrote:
> Hello list,
>
> I have a database where one of the tables stores urls and
> it's getting to the point where queries are getting quite
> slow.

What queries?  Do you have indexs on the queried fields?  Can
you please provide the EXPLAIN output from the slow queries?

If you've already looked at all these things, I apologize, if
not, you should look them over before you consider reorganizing
your database.

>  My urls table looks something like:
>
> create table urls(
> id serial,
> url text,
> unique(url),
> primary key(id)
> );
>
> What I am thinking of doing is storing urls in a tree-like
> structure
>
> create table urls(
> id serial,
> url_part text,
> parent_id int, -- references back to urls table
> unique(parent_id,url_part)
> );
>
> So:
> insert into urls (id,parent_id,url_part) (1, NULL,
> 'http://www.mydomain.com');
> insert into url (id,parent_id,url_part) values(2, 1, '/images');
>
> url id 2 would represent www.mydomain.com/images without
> actually storing the full hostname and path for each url.
>
> Is this a recommended way of storing urls or is there a
> better way?  Is it likely to result in faster joins as each
> row will be smaller?
>
> One final question, how would one get the full url back out
> of the sql table referencing the parent back to the root
> (null parent) for use by an sql like query and would that
> procedure negate any performance benefits by this storage
> method?
>
> Thanks,
> Shane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>


--
Bill Moran
Potential Technologies
http://www.potentialtech.com


Re: efficient storing of urls

From
Shane Wegner
Date:
On Fri, Feb 27, 2004 at 06:00:36PM -0500, Sean Shanny wrote:
> Shane,
>
> Can you give an example of a query that has gotten slower due to the
> increasing size of the urls table with an explain analyze?

The database is a simple traffic monitoring tool so we have
a hits table which gets a new row for every url accessed.
Very simple table

create table hits(
hit_date date not null,
hit_time time(0) without time zone not null,
url_id int references urls(id)
);

A select to display the 100 most popular pages:
explain analyze select count(*) as c,url from hits,urls where hit_date between '2004-01-01' and '2004-01-31' and
url_id=urls.idgroup by url order by c desc limit 100; 
                                                                      QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=320189.71..320189.96 rows=100 width=68) (actual time=34156.080..34156.324 rows=100 loops=1)
   ->  Sort  (cost=320189.71..320700.06 rows=204138 width=68) (actual time=34156.068..34156.208 rows=100 loops=1)
         Sort Key: count(*)
         ->  GroupAggregate  (cost=281214.19..283255.57 rows=204138 width=68) (actual time=32457.857..33584.861
rows=53460loops=1) 
               ->  Sort  (cost=281214.19..281724.54 rows=204138 width=68) (actual time=32457.690..32873.446 rows=248888
loops=1)
                     Sort Key: urls.url
                     ->  Merge Join  (cost=239594.05..244280.05 rows=204138 width=68) (actual time=21363.547..24385.213
rows=248888loops=1) 
                           Merge Cond: ("outer".url_id = "inner".id)
                           ->  Sort  (cost=168400.38..168914.15 rows=205508 width=4) (actual time=14785.934..15156.772
rows=249350loops=1) 
                                 Sort Key: hits.url_id
                                 ->  Seq Scan on hits  (cost=0.00..148512.07 rows=205508 width=4) (actual
time=40.265..12081.506rows=249350 loops=1) 
                                       Filter: ((hit_date >= '2004-01-01'::date) AND (hit_date <= '2004-01-31'::date))
                           ->  Sort  (cost=71193.67..72005.68 rows=324805 width=72) (actual time=6577.430..7422.945
rows=519307loops=1) 
                                 Sort Key: urls.id
                                 ->  Seq Scan on urls  (cost=0.00..7389.05 rows=324805 width=72) (actual
time=0.110..1187.617rows=324805 loops=1) 
 Total runtime: 34221.250 ms
(16 rows)

Time: 34224.959 ms

S

Re: efficient storing of urls

From
Bill Moran
Date:
Shane Wegner wrote:
> On Fri, Feb 27, 2004 at 06:00:36PM -0500, Sean Shanny wrote:
>
>>Shane,
>>
>>Can you give an example of a query that has gotten slower due to the
>>increasing size of the urls table with an explain analyze?
>
> The database is a simple traffic monitoring tool so we have
> a hits table which gets a new row for every url accessed.
> Very simple table
>
> create table hits(
> hit_date date not null,
> hit_time time(0) without time zone not null,
> url_id int references urls(id)
> );
>
> A select to display the 100 most popular pages:
> explain analyze select count(*) as c,url from hits,urls where hit_date between '2004-01-01' and '2004-01-31' and
url_id=urls.idgroup by url order by c desc limit 100; 
>                                                                       QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=320189.71..320189.96 rows=100 width=68) (actual time=34156.080..34156.324 rows=100 loops=1)
>    ->  Sort  (cost=320189.71..320700.06 rows=204138 width=68) (actual time=34156.068..34156.208 rows=100 loops=1)
>          Sort Key: count(*)
>          ->  GroupAggregate  (cost=281214.19..283255.57 rows=204138 width=68) (actual time=32457.857..33584.861
rows=53460loops=1) 
>                ->  Sort  (cost=281214.19..281724.54 rows=204138 width=68) (actual time=32457.690..32873.446
rows=248888loops=1) 
>                      Sort Key: urls.url
>                      ->  Merge Join  (cost=239594.05..244280.05 rows=204138 width=68) (actual
time=21363.547..24385.213rows=248888 loops=1) 
>                            Merge Cond: ("outer".url_id = "inner".id)
>                            ->  Sort  (cost=168400.38..168914.15 rows=205508 width=4) (actual
time=14785.934..15156.772rows=249350 loops=1) 
>                                  Sort Key: hits.url_id
>                                  ->  Seq Scan on hits  (cost=0.00..148512.07 rows=205508 width=4) (actual
time=40.265..12081.506rows=249350 loops=1) 
>                                        Filter: ((hit_date >= '2004-01-01'::date) AND (hit_date <=
'2004-01-31'::date))
>                            ->  Sort  (cost=71193.67..72005.68 rows=324805 width=72) (actual time=6577.430..7422.945
rows=519307loops=1) 
>                                  Sort Key: urls.id
>                                  ->  Seq Scan on urls  (cost=0.00..7389.05 rows=324805 width=72) (actual
time=0.110..1187.617rows=324805 loops=1) 
>  Total runtime: 34221.250 ms
> (16 rows)
>
> Time: 34224.959 ms

While there's a lot that can be tried to improve performance here, it doesn't
look like changing the url table is going to help much.

Notice that the sequential scan on the urls table takes a tiny amount of time
(compared to everything else)  You might do well to create an index on hit_date,
as that sequential scan seems to take quite a while.

Also, it doesn't seem like this query is doing what you want at all.  It says
you only get 16 rows, are there only 16 URLs?

However, I would think the best thing you could do to, if you're often counting
up hits, is to count them as they occur.  Add an INT column to urls (call it
"hits" or something) and each time you store a hit record in the hits table,
also do the following query:

UPDATE urls SET hits = hits + 1 WHERE url = '<the URL string>';

This will speed up and simplify the above query greatly.  You could even simplify
your client app by making the above UPDATE an insert trigger on the hits table.

--
Bill Moran
Potential Technologies
http://www.potentialtech.com


Re: efficient storing of urls

From
Chris Browne
Date:
shannyconsulting@earthlink.net (Sean Shanny) writes:
> Can you give an example of a query that has gotten slower due to the
> increasing size of the urls table with an explain analyze?

There's a "known issue" in that URL strings commonly contain the prefix:

   http://www.

What you get, as a result, is that there's very little uniqueness
there, and indices are known to suffer.

There was a report last week that essentially putting the URLs in
backwards, and having a functional index on the backwards form, led to
greatly improved selectivity of the index.

The approach being suggested here looks more like that of the "prefix
splitting" typical to Patricia Tries; that's what the New Oxford
English Dictionary project used for building efficient text search
indices.  It ought to be pretty quick, but pretty expensive in terms
of the complexity that gets added in.

I suspect that doing the "reverse the URL" trick would be a cheaper
"fix."
--
"cbbrowne","@","ntlug.org"
http://www.ntlug.org/~cbbrowne/linuxxian.html
"This .signature is  shareware.  Send in $20 for  the fully registered
version..."

Re: efficient storing of urls

From
"scott.marlowe"
Date:
On Fri, 27 Feb 2004, Shane Wegner wrote:

> On Fri, Feb 27, 2004 at 06:00:36PM -0500, Sean Shanny wrote:
> > Shane,
> >
> > Can you give an example of a query that has gotten slower due to the
> > increasing size of the urls table with an explain analyze?
>
> The database is a simple traffic monitoring tool so we have
> a hits table which gets a new row for every url accessed.
> Very simple table
>
> create table hits(
> hit_date date not null,
> hit_time time(0) without time zone not null,
> url_id int references urls(id)
> );
>
> A select to display the 100 most popular pages:
> explain analyze select count(*) as c,url from hits,urls where hit_date between '2004-01-01' and '2004-01-31' and
url_id=urls.idgroup by url order by c desc limit 100; 
>                                                                       QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=320189.71..320189.96 rows=100 width=68) (actual time=34156.080..34156.324 rows=100 loops=1)
>    ->  Sort  (cost=320189.71..320700.06 rows=204138 width=68) (actual time=34156.068..34156.208 rows=100 loops=1)
>          Sort Key: count(*)
>          ->  GroupAggregate  (cost=281214.19..283255.57 rows=204138 width=68) (actual time=32457.857..33584.861
rows=53460loops=1) 
>                ->  Sort  (cost=281214.19..281724.54 rows=204138 width=68) (actual time=32457.690..32873.446
rows=248888loops=1) 
>                      Sort Key: urls.url
>                      ->  Merge Join  (cost=239594.05..244280.05 rows=204138 width=68) (actual
time=21363.547..24385.213rows=248888 loops=1) 
>                            Merge Cond: ("outer".url_id = "inner".id)
>                            ->  Sort  (cost=168400.38..168914.15 rows=205508 width=4) (actual
time=14785.934..15156.772rows=249350 loops=1) 
>                                  Sort Key: hits.url_id
>                                  ->  Seq Scan on hits  (cost=0.00..148512.07 rows=205508 width=4) (actual
time=40.265..12081.506rows=249350 loops=1) 
>                                        Filter: ((hit_date >= '2004-01-01'::date) AND (hit_date <=
'2004-01-31'::date))
>                            ->  Sort  (cost=71193.67..72005.68 rows=324805 width=72) (actual time=6577.430..7422.945
rows=519307loops=1) 
>                                  Sort Key: urls.id
>                                  ->  Seq Scan on urls  (cost=0.00..7389.05 rows=324805 width=72) (actual
time=0.110..1187.617rows=324805 loops=1) 
>  Total runtime: 34221.250 ms
> (16 rows)

Your single biggest jump in actual time seems to be coming from the seq
scan on hits, which takes you from 40 ms to 12,000 ms.

Can you index the hit_date and see if that helps?


Re: efficient storing of urls

From
Shane Wegner
Date:
On Mon, Mar 01, 2004 at 08:54:43AM -0700, scott.marlowe wrote:
> On Fri, 27 Feb 2004, Shane Wegner wrote:
> > A select to display the 100 most popular pages:
> > explain analyze select count(*) as c,url from hits,urls where hit_date between '2004-01-01' and '2004-01-31' and
url_id=urls.idgroup by url order by c desc limit 100; 
> >                                                                       QUERY PLAN
> >
------------------------------------------------------------------------------------------------------------------------------------------------------
> >  Limit  (cost=320189.71..320189.96 rows=100 width=68) (actual time=34156.080..34156.324 rows=100 loops=1)
> >    ->  Sort  (cost=320189.71..320700.06 rows=204138 width=68) (actual time=34156.068..34156.208 rows=100 loops=1)
> >          Sort Key: count(*)
> >          ->  GroupAggregate  (cost=281214.19..283255.57 rows=204138 width=68) (actual time=32457.857..33584.861
rows=53460loops=1) 
> >                ->  Sort  (cost=281214.19..281724.54 rows=204138 width=68) (actual time=32457.690..32873.446
rows=248888loops=1) 
> >                      Sort Key: urls.url
> >                      ->  Merge Join  (cost=239594.05..244280.05 rows=204138 width=68) (actual
time=21363.547..24385.213rows=248888 loops=1) 
> >                            Merge Cond: ("outer".url_id = "inner".id)
> >                            ->  Sort  (cost=168400.38..168914.15 rows=205508 width=4) (actual
time=14785.934..15156.772rows=249350 loops=1) 
> >                                  Sort Key: hits.url_id
> >                                  ->  Seq Scan on hits  (cost=0.00..148512.07 rows=205508 width=4) (actual
time=40.265..12081.506rows=249350 loops=1) 
> >                                        Filter: ((hit_date >= '2004-01-01'::date) AND (hit_date <=
'2004-01-31'::date))
> >                            ->  Sort  (cost=71193.67..72005.68 rows=324805 width=72) (actual time=6577.430..7422.945
rows=519307loops=1) 
> >                                  Sort Key: urls.id
> >                                  ->  Seq Scan on urls  (cost=0.00..7389.05 rows=324805 width=72) (actual
time=0.110..1187.617rows=324805 loops=1) 
> >  Total runtime: 34221.250 ms
> > (16 rows)
>
> Your single biggest jump in actual time seems to be coming from the seq
> scan on hits, which takes you from 40 ms to 12,000 ms.
>
> Can you index the hit_date and see if that helps?

Hmm, I don't think it's using the index.  I tried two indexes
create index hit_date on hits(hit_date);
create index hit_date2 on hits(hit_date,url_id);
vacuum analyze;

With the hope that it would use the multicolumn index for
hit_date and url_id but it's still doing a seq scan on
hits.

explain analyze select count(*) as c,url from hits,urls where hit_date between
'2004-01-01' and '2004-01-31' and url_id=urls.id group by url order by c desc
limit 100;
                                                                           QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=300458.08..300458.33 rows=100 width=68) (actual time=21117.590..21117.842 rows=100 loops=1)
   ->  Sort  (cost=300458.08..301021.65 rows=225429 width=68) (actual time=21117.579..21117.718 rows=100 loops=1)
         Sort Key: count(*)
         ->  GroupAggregate  (cost=256840.55..259094.84 rows=225429 width=68) (actual time=19645.926..20671.608
rows=53460loops=1) 
               ->  Sort  (cost=256840.55..257404.12 rows=225429 width=68) (actual time=19645.841..20000.521 rows=248888
loops=1)
                     Sort Key: urls.url
                     ->  Merge Join  (cost=163951.08..215477.30 rows=225429 width=68) (actual time=10623.308..13332.906
rows=248888loops=1) 
                           Merge Cond: ("outer".id = "inner".url_id)
                           ->  Index Scan using urls_pkey on urls  (cost=0.00..47329.44 rows=326181 width=72) (actual
time=0.109..1063.677rows=326181 loops=1) 
                           ->  Sort  (cost=163951.08..164519.00 rows=227170 width=4) (actual time=10622.954..10966.062
rows=248889loops=1) 
                                 Sort Key: hits.url_id
                                 ->  Seq Scan on hits  (cost=0.00..141717.33 rows=227170 width=4) (actual
time=0.282..8229.634rows=249350 loops=1) 
                                       Filter: ((hit_date >= '2004-01-01'::date) AND (hit_date <= '2004-01-31'::date))
 Total runtime: 21160.919 ms
(14 rows)

S

Re: efficient storing of urls

From
Martijn van Oosterhout
Date:
On Mon, Mar 01, 2004 at 11:23:45AM -0800, Shane Wegner wrote:
> Hmm, I don't think it's using the index.  I tried two indexes
> create index hit_date on hits(hit_date);
> create index hit_date2 on hits(hit_date,url_id);
> vacuum analyze;
>
> With the hope that it would use the multicolumn index for
> hit_date and url_id but it's still doing a seq scan on
> hits.

Try an index on (url_id,hit_date), there's a difference.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> If the Catholic church can survive the printing press, science fiction
> will certainly weather the advent of bookwarez.
>    http://craphound.com/ebooksneitherenorbooks.txt - Cory Doctorow

Attachment