Re: Self-referencing table question - Mailing list pgsql-sql

From Edmund Bacon
Subject Re: Self-referencing table question
Date
Msg-id 424302D1.7000602@onesystem.com
Whole thread Raw
In response to Re: Self-referencing table question  (Sean Davis <sdavis2@mail.nih.gov>)
Responses Re: Self-referencing table question  (Sean Davis <sdavis2@mail.nih.gov>)
List pgsql-sql
Sean Davis wrote:

> Thanks.  I thought about that a bit and it seems like it is highly 
> likely to be expensive for a single query (though I should probably 
> try it at some point).  If I do find myself reformatting results after 
> response to user input (i.e., reusing the query), though, then your 
> solution is likely to be very useful.
>
You might think so, however, consider:

$ cat temptable.sql SELECT to_id   INTO TEMP TABLE tids   FROM correlation  WHERE from_id  = 1234  ORDER BY val DESC
limit100;
 
 SELECT t1.to_id AS from_id, t2.to_id   INTO TEMP TABLE from_to   FROM tids t1, tids t2  WHERE t1.to_id > t2.to_id;
 SELECT c.from_id, c.to_id, c.val   FROM from_to   JOIN correlation c USING(from_id, to_id)  WHERE val > 0.5  order by
from_id,to_id;
 
$ cat subselect.sql
select   from_id, to_id, val   from correlation   where from_id in (select to_id from correlation
    where from_id = 1234                order by val desc limit 100)       and to_id in (select to_id from correlation
                         where from_id = 1234                order by val desc limit 100)   and from_id > to_id   and
val> 0.5   order by from_id, to_id;
 
$ psql -q -c "select count(1) from correlation" test count  
----------64000000
(1 row)

$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m3.093s
user    0m0.000s
sys    0m0.010s
$ time psql -q -f temptable.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m0.238s
user    0m0.000s
sys    0m0.010s
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m1.945s
user    0m0.010s
sys    0m0.010s
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m1.949s
user    0m0.010s
sys    0m0.010s
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m1.953s
user    0m0.010s
sys    0m0.000s
$ time psql -q -f temptable.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957  -

real    0m0.237s
user    0m0.020s
sys    0m0.000s
$

Note that the subselect version takes about 10 times as long as the 
temptable version, and does not seem to be dependent on what data might 
be cached.

I added the sort so that the results would be in the same order - you 
can see that the two query sets  are producing the same output (at least 
md5sum thinks they are the same).

One thing to be aware of is the size of your returned data set - If it's 
fairly large, then the transfer time from your web-server to the pgsql 
box might overwhelm any "small" optimization in query time. 

> Sean
>
> On Mar 24, 2005, at 11:13 AM, Edmund Bacon wrote:
>
>> Sometimes using a temp table is a better idea:
>>
>> e.g.
>>
>> -- start by creating a temp table 'tids' that hold the to_ids that
>> -- we are interested in.
>>  SELECT to_id
>>    INTO TEMP TABLE tids
>>    FROM correlation
>>   WHERE from_id  = 1234
>>   ORDER BY val DESC limit 100;
>>
>> -- The following temp table makes use of the primary key on
>> -- the correlation table, and the stated goal from the original
>> -- question that:
>> --         from_id > to_id
>> --     and from_id in (tids.to_id)
>> --    and to_id in (tids.to_id)
>>
>>  SELECT t1.to_id AS from_id, t2.to_id
>>    INTO TEMP TABLE from_to
>>    FROM tids t1, tids t2
>>   WHERE t1.to_id > t2.to_id;
>>
>> -- Now we can use the from_to table as an index into the correlation
>> -- table.
>>
>>  SELECT c.from_id, c.to_id, c.val
>>    FROM from_to
>>    JOIN correlation c USING(from_id, to_id)
>>   WHERE val > 0.5;
>>
>>
>> The explain analyze for the final select works out to:
>> Nested Loop  (cost=0.00..50692.00 rows=8488 width=16) (actual 
>> time=0.171..150.095 rows=2427 loops=1)
>>   ->  Seq Scan on from_to  (cost=0.00..79.38 rows=5238 width=8) 
>> (actual time=0.006..7.660 rows=4950 loops=1)
>>   ->  Index Scan using correlation_pkey on correlation c  
>> (cost=0.00..9.63 rows=2 width=16) (actual time=0.024..0.025 rows=0 
>> loops=4950)
>>         Index Cond: (("outer".from_id = c.from_id) AND ("outer".to_id 
>> = c.to_id))
>>         Filter: (val > 0.5::double precision)
>> Total runtime: 152.261 ms
>>
>>
>> Richard Huxton wrote:
>>
>>> Sean Davis wrote:
>>>
>>>> I answer my own question, if only for my own records.  The 
>>>> following  query is about 5-6 times faster than the original.  Of 
>>>> course, if  anyone else has other ideas, I'd be happy to hear them.
>>>>
>>>> Sean
>>>>
>>>> explain analyze select from_id,to_id,val from exprsdb.correlation 
>>>> where  from_id in (select to_id from exprsdb.correlation where 
>>>> from_id=2424  order by val desc limit 100) and to_id in (select 
>>>> to_id from  exprsdb.correlation where from_id=2424 order by val 
>>>> desc limit 100) and  val>0.6 and to_id<from_id;
>>>
>>>
>>>
>>> Might not be any faster, but you can do this as a self-join with 
>>> subquery:
>>>
>>> SELECT c1.from_id, c1.to_id, c1.val
>>> FROM
>>>   correlation c1,
>>>   (
>>>     SELECT to_id FROM correlation WHERE from_id=2424
>>>     ORDER BY val DESC LIMIT 100
>>>   ) AS c2
>>>   (
>>>     SELECT to_id FROM correlation WHERE from_id=2424
>>>     ORDER BY val DESC LIMIT 100
>>>   ) AS c3
>>> WHERE
>>>   c1.from_id = c2.to_id
>>>   AND c1.to_id = c3.to_id
>>>   AND c1.val > 0.5
>>>   AND c1.to_id < from_id
>>> ;
>>>
>>> I think PG should be smart enough nowadays to figure out these two 
>>> queries are basically the same.
>>
>>
>>
>> -- 
>> Edmund Bacon <ebacon@onesystem.com>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 3: if posting/reading through Usenet, please send an appropriate
>>      subscribe-nomail command to majordomo@postgresql.org so that your
>>      message can get through to the mailing list cleanly
>
>

-- 
Edmund Bacon <ebacon@onesystem.com>



pgsql-sql by date:

Previous
From: "Jim Buttafuoco"
Date:
Subject: Re: Funtions + plpgsql + contrib/pgcrypto = ??
Next
From: "Moran.Michael"
Date:
Subject: Re: Funtions + plpgsql + contrib/pgcrypto = ??