Re: workaround for expensive KNN? - Mailing list pgsql-hackers

From PostgreSQL - Hans-Jürgen Schönig
Subject Re: workaround for expensive KNN?
Date
Msg-id 1C959340-EA09-4F06-8DC5-FD2E94BBDA46@cybertec.at
Whole thread Raw
In response to Re: workaround for expensive KNN?  (Oleg Bartunov <oleg@sai.msu.su>)
List pgsql-hackers
price has a problem :(.
"iphone" can be a 20 cents bag or a sticker or a 900 euro thing signed by whoever ...
so, words and the sort-number / price are not related in anyway. price is in this case no way to narrow down the
problem(e.g. evaluate first or so). 
many thanks,
    hans


On Apr 8, 2011, at 5:25 PM, Oleg Bartunov wrote:

> Hans,
>
> what if you create index (price,title) ?
>
>
> On Fri, 8 Apr 2011, PostgreSQL - Hans-J?rgen Sch?nig wrote:
>
>> hello ...
>>
>> i got that one ...
>>
>>   "idx_product_t_product_titleprice" gist (to_tsvector('german'::regconfig, title), int_price)
>>
>> so, i have a combined index on text + number.
>> to me the plan seems fine ... it looks like a prober KNN traversal.
>> the difference between my plan and your plan seems to be the fact that i have, say, 1 mio rows which have "handy" or
soin it (1 mio out of 11 mio or so). you are moving out from one specific place. 
>>
>> my maths is like that:
>>     11 mio in total
>>     1 mio matching "iphone"
>>     cheapest / most expensive 10 out of this mio needed.
>>
>> operator classes are all nice and in place:
>>
>> SELECT 10 <-> 4 as distance;
>> distance
>> ----------
>>       6
>> (1 row)
>>
>> what does "buffers true" in your case say?
>>
>> many thanks,
>>
>>     hans
>>
>>
>> On Apr 8, 2011, at 3:22 PM, Oleg Bartunov wrote:
>>
>>> Probably, you miss two-columnt index. From my early post:
>>> http://www.sai.msu.su/~megera/wiki/knngist
>>>
>>> =# CREATE INDEX spots_idx ON spots USING knngist (coordinates, to_tsvector('french',address));
>>> =# SELECT id, address,  (coordinates <-> '(2.29470491409302,48.858263472125)'::point) AS dist FROM spots WHERE
coordinates>< '(2.29470491409302,48.858263472125)'::point AND to_tsvector('french',address) @@
to_tsquery('french','mars') LIMIT 10; 
>>>  id    |                           address                           |         dist
---------+-------------------------------------------------------------+---------------------
>>> 366096 | 1st Floor Tour Eiffel | Champs de Mars, Paris 75007, France | 2.32488941293945e-05
>>> 4356328 | r Champ de Mars 75007 PARIS                                 |  0.00421854756964406
>>> 5200167 | Champ De Mars 75007 Paris                                   |  0.00453564562587288
>>> 9301676 | Champ de Mars, 75007 Paris,                                 |  0.00453564562587288
>>> 2152213 | 16, ave Rapp, Champ de Mars, Tour Eiffel, Paris, France     |  0.00624152097590896
>>> 1923818 | Champ de Mars Paris, France                                 |  0.00838214733539654
>>> 5165953 | 39 Rue Champ De Mars Paris, France                          |  0.00874410234569529
>>> 7395870 | 39 Rue Champ De Mars Paris, France                          |  0.00874410234569529
>>> 4358671 | 32 Rue Champ De Mars Paris, France                          |  0.00876089659276339
>>> 1923742 | 12 rue du Champ de Mars Paris, France                       |  0.00876764731845995
>>> (10 rows)
>>>
>>> Time: 7.859 ms
>>>
>>> =# EXPLAIN (COSTS OFF) SELECT id, address FROM spots WHERE coordinates ><
'(2.29470491409302,48.858263472125)'::point
>>> AND to_tsvector('french',address) @@ to_tsquery('french','mars')  LIMIT 10;
>>>
>>>                           QUERY PLAN
>>>
--------------------------------------------------------------------------------------------------------------------------------------
>>> Limit
>>>  ->  Index Scan using spots_idx on spots
>>>        Index Cond: ((coordinates >< '(2.29470491409302,48.858263472125)'::point) AND
(to_tsvector('french'::regconfig,address) @@ '''mar'''::tsquery)) 
>>> (3 rows)
>>>
>>>
>>> On Fri, 8 Apr 2011, PostgreSQL - Hans-J?rgen Sch?nig wrote:
>>>
>>>> hello all ...
>>>>
>>>> given oleg's posting before i also wanted to fire up some KNN related question.
>>>> let us consider a simple example. i got some million lines and i want all rows matching a tsquery sorted by price.
>>>> i did some tests:
>>>>
>>>> test=# explain (analyze true, buffers true, costs true) SELECT id FROM product.t_product WHERE
to_tsvector('german',title) @@ to_tsquery('german', 'iphone') ORDER BY int_price <-> 0 LIMIT 10; 
>>>>                                                                          QUERY PLAN
>>>>
>>>>
-----------------------------------------------------------------------------------------------------------------------------
>>>> --------------------------------------
>>>> Limit  (cost=0.00..41.11 rows=10 width=16) (actual time=36391.717..45542.590 rows=10 loops=1)
>>>> Buffers: shared hit=9 read=5004
>>>> ->  Index Scan using idx_product_t_product_titleprice on t_product  (cost=0.00..13251.91 rows=3224 width=16)
(actualtime= 
>>>> 36391.715..45542.573 rows=10 loops=1)
>>>>       Index Cond: (to_tsvector('german'::regconfig, title) @@ '''iphon'''::tsquery)
>>>>       Order By: (int_price <-> 0::bigint)
>>>>       Buffers: shared hit=9 read=5004
>>>> Total runtime: 45542.676 ms
>>>> (7 rows)
>>>>
>>>> test=# explain (analyze true, buffers true, costs true) SELECT id FROM product.t_product WHERE
to_tsvector('german',title) @@ to_tsquery('german', 'handy') ORDER BY int_price <-> 0 LIMIT 10; 
>>>>                                                                          QUERY PLAN
>>>>
>>>>
-----------------------------------------------------------------------------------------------------------------------------
>>>> -------------------------------------
>>>> Limit  (cost=0.00..41.03 rows=10 width=16) (actual time=7243.526..10935.227 rows=10 loops=1)
>>>> Buffers: shared hit=3 read=2316
>>>> ->  Index Scan using idx_product_t_product_titleprice on t_product  (cost=0.00..29762.61 rows=7255 width=16)
(actualtime= 
>>>> 7243.524..10935.217 rows=10 loops=1)
>>>>       Index Cond: (to_tsvector('german'::regconfig, title) @@ '''handy'''::tsquery)
>>>>       Order By: (int_price <-> 0::bigint)
>>>>       Buffers: shared hit=3 read=2316
>>>> Total runtime: 10935.265 ms
>>>> (7 rows)
>>>>
>>>> test=# explain (analyze true, buffers true, costs true) SELECT id FROM product.t_product WHERE
to_tsvector('german',title) @@ to_tsquery('german', 'handy') ORDER BY int_price <-> 0 LIMIT 1; 
>>>>                                                                       QUERY PLAN
>>>>
>>>>
-----------------------------------------------------------------------------------------------------------------------------
>>>> -------------------------------
>>>> Limit  (cost=0.00..4.10 rows=1 width=16) (actual time=28.527..28.528 rows=1 loops=1)
>>>> Buffers: shared hit=1 read=1577
>>>> ->  Index Scan using idx_product_t_product_titleprice on t_product  (cost=0.00..29762.61 rows=7255 width=16)
(actualtime= 
>>>> 28.525..28.525 rows=1 loops=1)
>>>>       Index Cond: (to_tsvector('german'::regconfig, title) @@ '''handy'''::tsquery)
>>>>       Order By: (int_price <-> 0::bigint)
>>>>       Buffers: shared hit=1 read=1577
>>>> Total runtime: 28.558 ms
>>>> (7 rows)
>>>>
>>>>
>>>> under any circumstances - there is no way to reduce the number of buffers needed for a query like that.
>>>> if everything is cached this is still ok but as soon as you have to take a single block from disk you will die a
painfulrandom I/O death. 
>>>> is there any alternative which does not simply die when i try to achieve what i want?
>>>>
>>>> the use case is quite simple: all products with a certain word (10 cheapest or so).
>>>>
>>>> is there any alternative approach to this?
>>>> i was putting some hope into KNN but it seems it needs too much random I/O :(.
>>>>
>>>>     many thanks,
>>>>
>>>>         hans
>>>>
>>>> --
>>>> Cybertec Sch?nig & Sch?nig GmbH
>>>> Gr?hrm?hlgasse 26
>>>> A-2700 Wiener Neustadt, Austria
>>>> Web: http://www.postgresql-support.de
>>>>
>>>>
>>>>
>>>
>>>     Regards,
>>>         Oleg
>>> _____________________________________________________________
>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>>> Sternberg Astronomical Institute, Moscow University, Russia
>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>>> phone: +007(495)939-16-83, +007(495)939-23-83
>>>
>>> --
>>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-hackers
>>>
>>
>> --
>> Cybertec Sch?nig & Sch?nig GmbH
>> Gr?hrm?hlgasse 26
>> A-2700 Wiener Neustadt, Austria
>> Web: http://www.postgresql-support.de
>>
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de



pgsql-hackers by date:

Previous
From: Nick Raj
Date:
Subject: Global variables in plpgsql
Next
From: Christopher Browne
Date:
Subject: Re: Global variables in plpgsql