Thread: Re: Full text index not being used

Re: Full text index not being used

From
Alex
Date:
So this seems to be because the result size is too big.  I still don't
know why it is looping through every record and printing a warning,
but adding a LIMIT makes the queries complete in a reasonable time
(although not all that fast).

However I need to sort and also have many other facets that may or may
not be included in the query.  Adding a sort makes it load every
record again and take forever.

I tried to create an index including all of the fields I query on to
see if that would work, but I get an error the the index row is too
large:

=> create index master_index on source_listings(geo_lat, geo_lon,
price, bedrooms, region, city, listing_type, to_tsvector('english',
full_listing), post_time);
NOTICE:  word is too long to be indexed
DETAIL:  Words longer than 2047 characters are ignored.
NOTICE:  word is too long to be indexed
DETAIL:  Words longer than 2047 characters are ignored.
NOTICE:  word is too long to be indexed
DETAIL:  Words longer than 2047 characters are ignored.
NOTICE:  word is too long to be indexed
DETAIL:  Words longer than 2047 characters are ignored.
ERROR:  index row requires 13356 bytes, maximum size is 8191

Any ideas about how to resolve this?

Re: Full text index not being used

From
Oleg Bartunov
Date:
Alex,

what text you're indexing ? I don't believe you have meaningful
very long words ( > 2047 characters).

Do you really need multicolumn index ?

I'd recommend to separate problem - create column fts for
tsvector('english',full_listing), create index on it and try full-text
query. The way you're doing imply calling to_tsvector every time you
search, which can be very costly.


Olegk

On Sun, 1 Feb 2009, Alex wrote:

> So this seems to be because the result size is too big.  I still don't
> know why it is looping through every record and printing a warning,
> but adding a LIMIT makes the queries complete in a reasonable time
> (although not all that fast).
>
> However I need to sort and also have many other facets that may or may
> not be included in the query.  Adding a sort makes it load every
> record again and take forever.
>
> I tried to create an index including all of the fields I query on to
> see if that would work, but I get an error the the index row is too
> large:
>
> => create index master_index on source_listings(geo_lat, geo_lon,
> price, bedrooms, region, city, listing_type, to_tsvector('english',
> full_listing), post_time);
> NOTICE:  word is too long to be indexed
> DETAIL:  Words longer than 2047 characters are ignored.
> NOTICE:  word is too long to be indexed
> DETAIL:  Words longer than 2047 characters are ignored.
> NOTICE:  word is too long to be indexed
> DETAIL:  Words longer than 2047 characters are ignored.
> NOTICE:  word is too long to be indexed
> DETAIL:  Words longer than 2047 characters are ignored.
> ERROR:  index row requires 13356 bytes, maximum size is 8191
>
> Any ideas about how to resolve this?
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Full text index not being used

From
Teodor Sigaev
Date:

> I tried to create an index including all of the fields I query on to
> see if that would work, but I get an error the the index row is too
> large:
>
> => create index master_index on source_listings(geo_lat, geo_lon,
> price, bedrooms, region, city, listing_type, to_tsvector('english',
> full_listing), post_time);
It's not a fulltext index - btree doesn't support @@ operation. Read
carefully: http://www.postgresql.org/docs/8.3/static/textsearch.html ,
and about full text indexes:
http://www.postgresql.org/docs/8.3/static/textsearch-tables.html ,
http://www.postgresql.org/docs/8.3/static/textsearch-indexes.html

Re: Full text index not being used

From
Alex Neth
Date:
Based on suggestions from this list, I am trying to create a tsvector
column and index that, since it is perhaps the recheck and rebuilding
of all the vectors that is slowing things down.  I don't understand
why a recheck is necessary on a gin index.....

My update statement has been running for 36 hours now and has not
finished.  The statement is:  update source_listings set flv =
to_tsvector('english', full_listing);  I know that it is still working
because it occasionally prints out one of those long word errors.

I have only 1.6M rows and each entry in that column is a standard size
web page with just the text, maybe 3-5K.

For sure I don't have meaningful long words.  Perhaps that is because
it is not handling the HTML well and I should be parsing down the web
page first.  Hopefully that doesn't mean I need to rebuild this column
over the course of 3 days - I didn't expect it to take this long so I
thought I'd just try it out.





On Feb 2, 2009, at 1:49 AM, Oleg Bartunov wrote:

> Alex,
>
> what text you're indexing ? I don't believe you have meaningful
> very long words ( > 2047 characters).
>
> Do you really need multicolumn index ?
>
> I'd recommend to separate problem - create column fts for
> tsvector('english',full_listing), create index on it and try full-text
> query. The way you're doing imply calling to_tsvector every time you
> search, which can be very costly.
>
>
> Olegk
>
> On Sun, 1 Feb 2009, Alex wrote:
>
>> So this seems to be because the result size is too big.  I still
>> don't
>> know why it is looping through every record and printing a warning,
>> but adding a LIMIT makes the queries complete in a reasonable time
>> (although not all that fast).
>>
>> However I need to sort and also have many other facets that may or
>> may
>> not be included in the query.  Adding a sort makes it load every
>> record again and take forever.
>>
>> I tried to create an index including all of the fields I query on to
>> see if that would work, but I get an error the the index row is too
>> large:
>>
>> => create index master_index on source_listings(geo_lat, geo_lon,
>> price, bedrooms, region, city, listing_type, to_tsvector('english',
>> full_listing), post_time);
>> NOTICE:  word is too long to be indexed
>> DETAIL:  Words longer than 2047 characters are ignored.
>> NOTICE:  word is too long to be indexed
>> DETAIL:  Words longer than 2047 characters are ignored.
>> NOTICE:  word is too long to be indexed
>> DETAIL:  Words longer than 2047 characters are ignored.
>> NOTICE:  word is too long to be indexed
>> DETAIL:  Words longer than 2047 characters are ignored.
>> ERROR:  index row requires 13356 bytes, maximum size is 8191
>>
>> Any ideas about how to resolve this?
>>
>>
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83


Re: Full text index not being used

From
Oleg Bartunov
Date:
Alex, can you somehow identify document, which has problem with
long word errors ?
Also, if you have space on disk I'd recommend to try

select *, to_tsvector('english',full_listing) as flv from source_listings;

I don't remember if you said us information about
your setup (pg version, OS, memory, what did you change in postgresql.conf..)


Oleg
On Tue, 3 Feb 2009, Alex Neth wrote:

> Based on suggestions from this list, I am trying to create a tsvector column
> and index that, since it is perhaps the recheck and rebuilding of all the
> vectors that is slowing things down.  I don't understand why a recheck is
> necessary on a gin index.....
>
> My update statement has been running for 36 hours now and has not finished.
> The statement is:  update source_listings set flv = to_tsvector('english',
> full_listing);  I know that it is still working because it occasionally
> prints out one of those long word errors.
>
> I have only 1.6M rows and each entry in that column is a standard size web
> page with just the text, maybe 3-5K.
>
> For sure I don't have meaningful long words.  Perhaps that is because it is
> not handling the HTML well and I should be parsing down the web page first.
> Hopefully that doesn't mean I need to rebuild this column over the course of
> 3 days - I didn't expect it to take this long so I thought I'd just try it
> out.
>
>
>
>
>
> On Feb 2, 2009, at 1:49 AM, Oleg Bartunov wrote:
>
>> Alex,
>>
>> what text you're indexing ? I don't believe you have meaningful
>> very long words ( > 2047 characters).
>>
>> Do you really need multicolumn index ?
>>
>> I'd recommend to separate problem - create column fts for
>> tsvector('english',full_listing), create index on it and try full-text
>> query. The way you're doing imply calling to_tsvector every time you
>> search, which can be very costly.
>>
>>
>> Olegk
>>
>> On Sun, 1 Feb 2009, Alex wrote:
>>
>>> So this seems to be because the result size is too big.  I still don't
>>> know why it is looping through every record and printing a warning,
>>> but adding a LIMIT makes the queries complete in a reasonable time
>>> (although not all that fast).
>>>
>>> However I need to sort and also have many other facets that may or may
>>> not be included in the query.  Adding a sort makes it load every
>>> record again and take forever.
>>>
>>> I tried to create an index including all of the fields I query on to
>>> see if that would work, but I get an error the the index row is too
>>> large:
>>>
>>> => create index master_index on source_listings(geo_lat, geo_lon,
>>> price, bedrooms, region, city, listing_type, to_tsvector('english',
>>> full_listing), post_time);
>>> NOTICE:  word is too long to be indexed
>>> DETAIL:  Words longer than 2047 characters are ignored.
>>> NOTICE:  word is too long to be indexed
>>> DETAIL:  Words longer than 2047 characters are ignored.
>>> NOTICE:  word is too long to be indexed
>>> DETAIL:  Words longer than 2047 characters are ignored.
>>> NOTICE:  word is too long to be indexed
>>> DETAIL:  Words longer than 2047 characters are ignored.
>>> ERROR:  index row requires 13356 bytes, maximum size is 8191
>>>
>>> Any ideas about how to resolve this?
>>>
>>>
>>
>>     Regards,
>>         Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Full text index not being used

From
Alex Neth
Date:
I've also found other queries that were really fast with MySQL are
really slow in Postgres.  I'm hoping that is a matter of tuning.
Overall I'm finding the query times to be extremely unpredictable.

I added a slow query logger to my application that also does an
explain.  Check these out.  The time in parens is the time for the
initial execution (before it was cached).  These are not under heavy
load.  Note that there are around 400 users - not a lot.  22 seconds
is ridiculous.

Slow query: (22.0198369026184) [0] SELECT * FROM "users" WHERE
("users"."remember_token" =
E'26a2b54d2b2e18edc8aa2e1f0e0508b0d858d650')  LIMIT 1
Limit  (cost=0.00..33.17 rows=1 width=784) (actual time=0.182..0.183
rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..33.17 rows=1 width=784) (actual
time=0.181..0.181 rows=1 loops=1)
         Filter: ((remember_token)::text =
'26a2b54d2b2e18edc8aa2e1f0e0508b0d858d650'::text)
Total runtime: 0.223 ms

Slow query: (59.8780090808868) [0] SELECT id FROM "source_listings"
WHERE (post_time BETWEEN '2009-01-27 08:14:58.261978' AND '2009-02-05
08:14:58.262034' AND ((geo_lon BETWEEN 5751555 AND 5759369 AND geo_lat
BETWEEN 12773379 AND 12776908)))  ORDER BY post_time DESC LIMIT 60
OFFSET 0
Limit  (cost=89.38..89.38 rows=1 width=12) (actual
time=1368.555..1368.644 rows=60 loops=1)  ->  Sort  (cost=89.38..89.38
rows=1 width=12) (actual time=1368.552..1368.588 rows=60 loops=1)
         Sort Key: post_time        Sort Method:  top-N heapsort
Memory: 19kB        ->  Index Scan using x_sl_lat_lon_pt_br_lt_region
on source_listings  (cost=0.00..89.37 rows=1 width=12) (actual
time=0.097..1365.469 rows=2078 loops=1)
               Index Cond: ((geo_lat >= 12773379) AND (geo_lat <=
12776908) AND (geo_lon >= 5751555) AND (geo_lon
<= 5759369) AND (post_time >= '2009-01-27 08:14:58.261978'::timestamp
without time zone) AND (post_time <= '2009-02-05
08:14:58.262034'::timestamp without time zone))Total runtime: 1368.722
ms



On Feb 3, 2009, at 8:15 PM, Oleg Bartunov wrote:

> Alex, can you somehow identify document, which has problem with
> long word errors ? Also, if you have space on disk I'd recommend to
> try
>
> select *, to_tsvector('english',full_listing) as flv from
> source_listings;

This is equally slow.

>
>
> I don't remember if you said us information about
> your setup (pg version, OS, memory, what did you change in
> postgresql.conf..)
>

Version is 8.3.5.  Ubuntu 2.6.21.  2Gb RAM.  postgresql.conf changes:

shared_buffers = 24MB            # min 128kB or max_connections*16kB

work_mem = 10MB                # min 64kB

max_fsm_pages = 153600            # min max_fsm_relations*16, 6 bytes each


>
> Oleg
> On Tue, 3 Feb 2009, Alex Neth wrote:
>
>> Based on suggestions from this list, I am trying to create a
>> tsvector column and index that, since it is perhaps the recheck and
>> rebuilding of all the vectors that is slowing things down.  I don't
>> understand why a recheck is necessary on a gin index.....
>>
>> My update statement has been running for 36 hours now and has not
>> finished. The statement is:  update source_listings set flv =
>> to_tsvector('english', full_listing);  I know that it is still
>> working because it occasionally prints out one of those long word
>> errors.
>>
>> I have only 1.6M rows and each entry in that column is a standard
>> size web page with just the text, maybe 3-5K.
>>
>> For sure I don't have meaningful long words.  Perhaps that is
>> because it is not handling the HTML well and I should be parsing
>> down the web page first. Hopefully that doesn't mean I need to
>> rebuild this column over the course of 3 days - I didn't expect it
>> to take this long so I thought I'd just try it out.
>>
>>
>>
>>
>>
>> On Feb 2, 2009, at 1:49 AM, Oleg Bartunov wrote:
>>
>>> Alex,
>>> what text you're indexing ? I don't believe you have meaningful
>>> very long words ( > 2047 characters).
>>> Do you really need multicolumn index ?
>>> I'd recommend to separate problem - create column fts for
>>> tsvector('english',full_listing), create index on it and try full-
>>> text
>>> query. The way you're doing imply calling to_tsvector every time you
>>> search, which can be very costly.
>>> Olegk
>>> On Sun, 1 Feb 2009, Alex wrote:
>>>> So this seems to be because the result size is too big.  I still
>>>> don't
>>>> know why it is looping through every record and printing a warning,
>>>> but adding a LIMIT makes the queries complete in a reasonable time
>>>> (although not all that fast).
>>>> However I need to sort and also have many other facets that may
>>>> or may
>>>> not be included in the query.  Adding a sort makes it load every
>>>> record again and take forever.
>>>> I tried to create an index including all of the fields I query on
>>>> to
>>>> see if that would work, but I get an error the the index row is too
>>>> large:
>>>> => create index master_index on source_listings(geo_lat, geo_lon,
>>>> price, bedrooms, region, city, listing_type, to_tsvector('english',
>>>> full_listing), post_time);
>>>> NOTICE:  word is too long to be indexed
>>>> DETAIL:  Words longer than 2047 characters are ignored.
>>>> NOTICE:  word is too long to be indexed
>>>> DETAIL:  Words longer than 2047 characters are ignored.
>>>> NOTICE:  word is too long to be indexed
>>>> DETAIL:  Words longer than 2047 characters are ignored.
>>>> NOTICE:  word is too long to be indexed
>>>> DETAIL:  Words longer than 2047 characters are ignored.
>>>> ERROR:  index row requires 13356 bytes, maximum size is 8191
>>>> Any ideas about how to resolve this?
>>>
>>>     Regards,
>>>         Oleg
>>> _____________________________________________________________
>>> Oleg Bartunov, Research Scientist, Head of AstroNet
>>> (www.astronet.ru),
>>> Sternberg Astronomical Institute, Moscow University, Russia
>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>>> phone: +007(495)939-16-83, +007(495)939-23-83
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83


Re: Full text index not being used

From
Oleg Bartunov
Date:
Alex,

looks like you need to read documentation and tune postgresql.conf.
Currently, it looks not good. I have no time to guide you, so search
archives for tuning postgresql.conf. This was discussed a lot of time.

Oleg
On Wed, 4 Feb 2009, Alex Neth wrote:

> I've also found other queries that were really fast with MySQL are really
> slow in Postgres.  I'm hoping that is a matter of tuning.  Overall I'm
> finding the query times to be extremely unpredictable.
>
> I added a slow query logger to my application that also does an explain.
> Check these out.  The time in parens is the time for the initial execution
> (before it was cached).  These are not under heavy load.  Note that there are
> around 400 users - not a lot.  22 seconds is ridiculous.
>
> Slow query: (22.0198369026184) [0] SELECT * FROM "users" WHERE
> ("users"."remember_token" = E'26a2b54d2b2e18edc8aa2e1f0e0508b0d858d650')
> LIMIT 1
> Limit  (cost=0.00..33.17 rows=1 width=784) (actual time=0.182..0.183 rows=1
> loops=1)
> ->  Seq Scan on users  (cost=0.00..33.17 rows=1 width=784) (actual
> time=0.181..0.181 rows=1 loops=1)
>       Filter: ((remember_token)::text =
> '26a2b54d2b2e18edc8aa2e1f0e0508b0d858d650'::text)
> Total runtime: 0.223 ms
>
> Slow query: (59.8780090808868) [0] SELECT id FROM "source_listings" WHERE
> (post_time BETWEEN '2009-01-27 08:14:58.261978' AND '2009-02-05
> 08:14:58.262034' AND ((geo_lon BETWEEN 5751555 AND 5759369 AND geo_lat
> BETWEEN 12773379 AND 12776908)))  ORDER BY post_time DESC LIMIT 60 OFFSET 0
> Limit  (cost=89.38..89.38 rows=1 width=12) (actual time=1368.555..1368.644
> rows=60 loops=1)  ->  Sort  (cost=89.38..89.38 rows=1 width=12) (actual
> time=1368.552..1368.588 rows=60 loops=1)
>       Sort Key: post_time        Sort Method:  top-N heapsort  Memory: 19kB
> ->  Index Scan using x_sl_lat_lon_pt_br_lt_region on source_listings
> (cost=0.00..89.37 rows=1 width=12) (actual time=0.097..1365.469 rows=2078
> loops=1)
>             Index Cond: ((geo_lat >= 12773379) AND (geo_lat <= 12776908) AND
> (geo_lon >= 5751555) AND (geo_lon
> <= 5759369) AND (post_time >= '2009-01-27 08:14:58.261978'::timestamp without
> time zone) AND (post_time <= '2009-02-05 08:14:58.262034'::timestamp without
> time zone))Total runtime: 1368.722 ms
>
>
>
> On Feb 3, 2009, at 8:15 PM, Oleg Bartunov wrote:
>
>> Alex, can you somehow identify document, which has problem with
>> long word errors ? Also, if you have space on disk I'd recommend to try
>>
>> select *, to_tsvector('english',full_listing) as flv from source_listings;
>
> This is equally slow.
>
>>
>>
>> I don't remember if you said us information about
>> your setup (pg version, OS, memory, what did you change in
>> postgresql.conf..)
>>
>
> Version is 8.3.5.  Ubuntu 2.6.21.  2Gb RAM.  postgresql.conf changes:
>
> shared_buffers = 24MB            # min 128kB or max_connections*16kB
>
> work_mem = 10MB                # min 64kB
>
> max_fsm_pages = 153600            # min max_fsm_relations*16, 6 bytes
> each
>
>
>>
>> Oleg
>> On Tue, 3 Feb 2009, Alex Neth wrote:
>>
>>> Based on suggestions from this list, I am trying to create a tsvector
>>> column and index that, since it is perhaps the recheck and rebuilding of
>>> all the vectors that is slowing things down.  I don't understand why a
>>> recheck is necessary on a gin index.....
>>>
>>> My update statement has been running for 36 hours now and has not
>>> finished. The statement is:  update source_listings set flv =
>>> to_tsvector('english', full_listing);  I know that it is still working
>>> because it occasionally prints out one of those long word errors.
>>>
>>> I have only 1.6M rows and each entry in that column is a standard size web
>>> page with just the text, maybe 3-5K.
>>>
>>> For sure I don't have meaningful long words.  Perhaps that is because it
>>> is not handling the HTML well and I should be parsing down the web page
>>> first. Hopefully that doesn't mean I need to rebuild this column over the
>>> course of 3 days - I didn't expect it to take this long so I thought I'd
>>> just try it out.
>>>
>>>
>>>
>>>
>>>
>>> On Feb 2, 2009, at 1:49 AM, Oleg Bartunov wrote:
>>>
>>>> Alex,
>>>> what text you're indexing ? I don't believe you have meaningful
>>>> very long words ( > 2047 characters).
>>>> Do you really need multicolumn index ?
>>>> I'd recommend to separate problem - create column fts for
>>>> tsvector('english',full_listing), create index on it and try full-text
>>>> query. The way you're doing imply calling to_tsvector every time you
>>>> search, which can be very costly.
>>>> Olegk
>>>> On Sun, 1 Feb 2009, Alex wrote:
>>>>> So this seems to be because the result size is too big.  I still don't
>>>>> know why it is looping through every record and printing a warning,
>>>>> but adding a LIMIT makes the queries complete in a reasonable time
>>>>> (although not all that fast).
>>>>> However I need to sort and also have many other facets that may or may
>>>>> not be included in the query.  Adding a sort makes it load every
>>>>> record again and take forever.
>>>>> I tried to create an index including all of the fields I query on to
>>>>> see if that would work, but I get an error the the index row is too
>>>>> large:
>>>>> => create index master_index on source_listings(geo_lat, geo_lon,
>>>>> price, bedrooms, region, city, listing_type, to_tsvector('english',
>>>>> full_listing), post_time);
>>>>> NOTICE:  word is too long to be indexed
>>>>> DETAIL:  Words longer than 2047 characters are ignored.
>>>>> NOTICE:  word is too long to be indexed
>>>>> DETAIL:  Words longer than 2047 characters are ignored.
>>>>> NOTICE:  word is too long to be indexed
>>>>> DETAIL:  Words longer than 2047 characters are ignored.
>>>>> NOTICE:  word is too long to be indexed
>>>>> DETAIL:  Words longer than 2047 characters are ignored.
>>>>> ERROR:  index row requires 13356 bytes, maximum size is 8191
>>>>> Any ideas about how to resolve this?
>>>>
>>>>     Regards,
>>>>         Oleg
>>>> _____________________________________________________________
>>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>>>> Sternberg Astronomical Institute, Moscow University, Russia
>>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>>>> phone: +007(495)939-16-83, +007(495)939-23-83
>>
>>     Regards,
>>         Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83