Thread: combined indexes with Gist - planner issues?

combined indexes with Gist - planner issues?

From
Hans-Juergen Schoenig -- PostgreSQL
Date:
hello everybody,

we are seriously fighting with some planner issue which seems to be 
slightly obscure to us.
we have a table which is nicely indexed (several GB in size).
i am using btree_gist operator classes to use a combined index including 
an FTI expression along with a number:

db=# \d product.t_product                                      Table "product.t_product"       Column         |
Type     |                           
 
Modifiers                           
-----------------------+---------------+----------------------------------------------------------------id
     | bigint        | not null default 
 
nextval('product.t_product_id_seq'::regclass)shop_id               | integer       |art_number            | text
 |title                 | text          |description           | text          |display_price         | numeric(10,4)
|

Indexes:   "t_product_pkey" PRIMARY KEY, btree (id)   "idx_test" gist (display_price, to_tsvector('german'::regconfig,

(title || ' '::text) || description))
*    "idx_test2" gist (to_tsvector('german'::regconfig, (title || ' 
'::text) || description), display_price)*


what we basically expected here is that Postgres will scan the table 
using the index to give us the cheapest products containing the words we 
are looking for.
i am totally surprised to see that we have to fetch all products given 
the words, sort and then do the limit.
this totally kills performance because some words simply show up 
millions of times. this totally kills everything.

the plans look like this:

db=#  explain analyze SELECT art_number, title   FROM product.t_product   WHERE to_tsvector('german'::regconfig, (title
||' '::text) || 
 
description) @@ plainto_tsquery('harddisk')   ORDER BY display_price   LIMIT 10;
                         QUERY 
 
PLAN                                                                  

------------------------------------------------------------------------------------------------------------------------------------------------Limit
(cost=108340.08..108340.10 rows=10 width=54) (actual 
 
time=1328.900..1328.909 rows=10 loops=1)  ->  Sort  (cost=108340.08..108422.48 rows=32961 width=54) (actual 
time=1328.899..1328.905 rows=10 loops=1)        Sort Key: display_price        Sort Method:  top-N heapsort  Memory:
18kB       ->  Bitmap Heap Scan on t_product  (cost=2716.62..107627.80 
 
rows=32961 width=54) (actual time=1052.706..1328.772 rows=55 loops=1)              Recheck Cond:
(to_tsvector('german'::regconfig,((title 
 
|| ' '::text) || description)) @@ plainto_tsquery('harddisk'::text))              ->  Bitmap Index Scan on idx_test2
(cost=0.00..2708.38
 
rows=32961 width=0) (actual time=1052.576..1052.576 rows=55 loops=1)                    Index Cond:
(to_tsvector('german'::regconfig,
 
((title || ' '::text) || description)) @@ plainto_tsquery('harddisk'::text))Total runtime: 1328.942 ms
(9 rows)


runtime increases badly if words start to be more likely ...


db=#  explain analyze SELECT art_number, title       FROM product.t_product       WHERE
to_tsvector('german'::regconfig,(title || ' '::text) || 
 
description) @@ plainto_tsquery('spiel')       ORDER BY display_price       LIMIT 10;
                             QUERY 
 
PLAN                                                                 

----------------------------------------------------------------------------------------------------------------------------------------------Limit
(cost=108340.08..108340.10 rows=10 width=54) (actual 
 
time=33489.675..33489.682 rows=10 loops=1)  ->  Sort  (cost=108340.08..108422.48 rows=32961 width=54) (actual 
time=33489.675..33489.675 rows=10 loops=1)        Sort Key: display_price        Sort Method:  top-N heapsort  Memory:
18kB       ->  Bitmap Heap Scan on t_product  (cost=2716.62..107627.80 
 
rows=32961 width=54) (actual time=774.923..33408.522 rows=56047 loops=1)              Recheck Cond:
(to_tsvector('german'::regconfig,((title 
 
|| ' '::text) || description)) @@ plainto_tsquery('spiel'::text))              ->  Bitmap Index Scan on idx_test2
(cost=0.00..2708.38
 
rows=32961 width=0) (actual time=759.078..759.078 rows=56047 loops=1)                    Index Cond:
(to_tsvector('german'::regconfig,
 
((title || ' '::text) || description)) @@ plainto_tsquery('spiel'::text))Total runtime: 33489.906 ms
(9 rows)

i am wondering why postgres is not able to use a combined index here?
is this some obscure thing related to gist, a logical problem or a 
planner deficiency?

ideas are welcome.
   many thanks,
      hans



-- 
Cybertec Schoenig & Schoenig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt
Web: www.postgresql-support.de



Re: combined indexes with Gist - planner issues?

From
Tom Lane
Date:
Hans-Juergen Schoenig -- PostgreSQL <postgres@cybertec.at> writes:
> what we basically expected here is that Postgres will scan the table 
> using the index to give us the cheapest products containing the words we 
> are looking for.
> i am totally surprised to see that we have to fetch all products given 
> the words, sort and then do the limit.

I don't know why you'd find that surprising.  GIST indexes have no
support for ordering.
        regards, tom lane


Re: combined indexes with Gist - planner issues?

From
Hans-Juergen Schoenig -- PostgreSQL
Date:
Tom Lane wrote:
> Hans-Juergen Schoenig -- PostgreSQL <postgres@cybertec.at> writes:
>   
>> what we basically expected here is that Postgres will scan the table 
>> using the index to give us the cheapest products containing the words we 
>> are looking for.
>> i am totally surprised to see that we have to fetch all products given 
>> the words, sort and then do the limit.
>>     
>
> I don't know why you'd find that surprising.  GIST indexes have no
> support for ordering.
>
>             regards, tom lane
>
>   

ok, i thought it would be something gist specific i was not aware of.
the golden question now is: i am looking for the cheapest products given 
a certain text in an insane amount of data.
how to do it? other quals which could narrow down the amount of data 
would not help.

i cannot see an option with regular "weapons" ...
maybe you can an idea how to fix core to make it work? maybe there is a 
mechanism we could need.
we really have to make this work - no matter what it takes.
we are willing to put effort into that.
   many thanks,
      hans

-- 
Cybertec Schoenig & Schoenig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt
Web: www.postgresql-support.de



Re: combined indexes with Gist - planner issues?

From
Martijn van Oosterhout
Date:
On Mon, Aug 31, 2009 at 04:06:22PM +0200, Hans-Juergen Schoenig -- PostgreSQL wrote:
> ok, i thought it would be something gist specific i was not aware of.
> the golden question now is: i am looking for the cheapest products given
> a certain text in an insane amount of data.
> how to do it? other quals which could narrow down the amount of data
> would not help.
>
> i cannot see an option with regular "weapons" ...
> maybe you can an idea how to fix core to make it work? maybe there is a
> mechanism we could need.
> we really have to make this work - no matter what it takes.
> we are willing to put effort into that.

The way I usually attack such a problem is to think of a data
structure+algorithm that could produce the output you want. Once you've
got that it's usually clear how you can make postgres do it and what
changes would need to be made.

At first glance I don't see any nice data structure specific for your
problem. But it occurs to me that maybe you could just have a (btree)
index on the price and just scan in asceding order until you have
enough records. Expensive if the first record is expensive.

Another possibility is to change your query to use the price in the
GiST index: execute multiple queries of the form:

... AND display_price >= 0.01 and display_price < 1;
... AND display_price >= 1 and display_price < 10;

Because you match less records the sort won't be so expensive and you
can stop once you have enough records.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: combined indexes with Gist - planner issues?

From
Hans-Juergen Schoenig -- PostgreSQL
Date:
Martijn van Oosterhout wrote:
> On Mon, Aug 31, 2009 at 04:06:22PM +0200, Hans-Juergen Schoenig -- PostgreSQL wrote:
>   
>> ok, i thought it would be something gist specific i was not aware of.
>> the golden question now is: i am looking for the cheapest products given  
>> a certain text in an insane amount of data.
>> how to do it? other quals which could narrow down the amount of data  
>> would not help.
>>
>> i cannot see an option with regular "weapons" ...
>> maybe you can an idea how to fix core to make it work? maybe there is a  
>> mechanism we could need.
>> we really have to make this work - no matter what it takes.
>> we are willing to put effort into that.
>>     
>
> The way I usually attack such a problem is to think of a data
> structure+algorithm that could produce the output you want. Once you've
> got that it's usually clear how you can make postgres do it and what
> changes would need to be made.
>
> At first glance I don't see any nice data structure specific for your
> problem. But it occurs to me that maybe you could just have a (btree)
> index on the price and just scan in asceding order until you have
> enough records. Expensive if the first record is expensive.
>
> Another possibility is to change your query to use the price in the
> GiST index: execute multiple queries of the form:
>
> ... AND display_price >= 0.01 and display_price < 1;
> ... AND display_price >= 1 and display_price < 10;
>
>   

hello ...

i had a similar idea here but the problem is: prices will pretty much 
depends on products.
to get to some critical example: "book" is a horribly frequent word and 
you will find just too many in a too narrow price range.
using a price index is alone is not a good idea. how many products which 
cost USD 9.95 do you know and how many of them are books? :(
i did some experiments which PL/proxy to scale out a little and i wrote 
some C code to explicitly cache data from the start and so on.
this is all shit, however - it is too much data and I have too many request.
i don't want to fallback to some java-based stuff such as solr. it would 
totally ruin my credibility and the stand postgres has at this customer.
whatever it takes - a PG based solution has to be found and implemented.

my knowledge of how gist works internally is not too extensive. any 
"kickstart" idea would be appreciated.
   many thanks,
      hans

-- 
Cybertec Schoenig & Schoenig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt
Web: www.postgresql-support.de



Re: combined indexes with Gist - planner issues?

From
Heikki Linnakangas
Date:
Hans-Juergen Schoenig -- PostgreSQL wrote:
> my knowledge of how gist works internally is not too extensive. any
> "kickstart" idea would be appreciated.

If there's not too many of those common words, you can create a simple
partial b-tree index for each, and handle the less common words with the
gist index you have (you can drop the display_price column from the index).

Another idea:

Create a table containing one row for each word in each product:

CREATE TABLE t_product_word (id bigint, word text, display_price
numeric(10,4));

with triggers to keep it up-to-date. You can then create a regular two
column b-tree index on that:

CREATE INDEX idx_word_price ON t_product_word (word, display_price);

And query with:

SELECT p.art_number, p.title  FROM t_product p INNER JOIN t_product_word pw ON p.id = pw.id  WHERE pw.word =
'harddisk'
ORDER BY pw.display_price DESC LIMIT 10;

The t_product_word table will be huge, but with a few gigabytes of data
it should still be manageable.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: combined indexes with Gist - planner issues?

From
Hans-Juergen Schoenig -- PostgreSQL
Date:
hello ...

we did some experiments with doing such a table.
the problem is if you want to allow arbitrary combinations of words 
which can be modeled perfectly with FTI.
you would instantly end up with a self join with 5 relations or so - 
which is again bad.

there are too many common words to consider doing with partly with gist 
and partly with a btree.

is there any option to adapt gist in a way that a combined index would 
make sense here?
   many thanks,
      hans




Heikki Linnakangas wrote:
> Hans-Juergen Schoenig -- PostgreSQL wrote:
>   
>> my knowledge of how gist works internally is not too extensive. any
>> "kickstart" idea would be appreciated.
>>     
>
> If there's not too many of those common words, you can create a simple
> partial b-tree index for each, and handle the less common words with the
> gist index you have (you can drop the display_price column from the index).
>
> Another idea:
>
> Create a table containing one row for each word in each product:
>
> CREATE TABLE t_product_word (id bigint, word text, display_price
> numeric(10,4));
>
> with triggers to keep it up-to-date. You can then create a regular two
> column b-tree index on that:
>
> CREATE INDEX idx_word_price ON t_product_word (word, display_price);
>
> And query with:
>
> SELECT p.art_number, p.title
>    FROM t_product p INNER JOIN t_product_word pw ON p.id = pw.id
>    WHERE pw.word = 'harddisk'
> ORDER BY pw.display_price DESC LIMIT 10;
>
> The t_product_word table will be huge, but with a few gigabytes of data
> it should still be manageable.
>
>   


-- 
Cybertec Schoenig & Schoenig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt
Web: www.postgresql-support.de



Re: combined indexes with Gist - planner issues?

From
Markus Wanner
Date:
Hi,

Hans-Juergen Schoenig -- PostgreSQL wrote:
> we did some experiments with doing such a table.
> the problem is if you want to allow arbitrary combinations of words
> which can be modeled perfectly with FTI.
> you would instantly end up with a self join with 5 relations or so -
> which is again bad.
> 
> there are too many common words to consider doing with partly with gist
> and partly with a btree.

How about an inverted index, either via GIN or with a custom table, such
that you have the cheapest price per existing word. (That's pretty close
to how full text searching itself works). Either reduce the number of
words with tsearch2's stemming algorithms. Or go for trigrams right
away. Split a word or search query in all its trigrams, then look up the
(cheapest) price(s) per trigram and return the n least expensive ones.

I've done somethings pretty similar for a customer, using the custom
table approach, as integration of GIN just started back then. Even now,
you need to consider the downside of that index lacking visibility
information and having to recreate the index from time to time. OTOH a
custom table needs a lot more manual twiddling with triggers and bulk
"index" rebuilding.

I guess I'd still go for a custom table today, as it's simply more
flexible. Something like:

CREATE TABLE cheapest_product_by_word ( trgm TEXT, cheapest_products bigint[]
);

However, all of this is assuming that data (i.e. prices, products)
change very rarely and it's beneficial to calculate such an intermediate
lookup-table in advance. Not sure how much that's the case for you.

Regards

Markus Wanner