Re: FTI Queries and Explain (long) - Mailing list pgsql-general

From Gordan Bobic
Subject Re: FTI Queries and Explain (long)
Date
Msg-id 200110181602.f9IG2uT01099@sentinel.bobich.net
Whole thread Raw
In response to Re: FTI Queries and Explain  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: FTI Queries and Explain (long)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
[Broken SQL instead of performance issue fixed]

It would appear that when I define the index on the FTI table (string and
oid) to be unique (which makes sense, since there is little point in having
duplicate rows in this case), a lot of inserts fail where they shouldn't. I
am guessing that if the insert into the look-up FTI table fails, the insert
into the master table fails as well.

I can understand that this might be useful for matches where the number of
occurences is important, but in this particular application, that is not the
case. Before I go and look into modifying the fti function code for my
specific purpose, it would be nice to have a confirmation of this behaviour -
otherwise it may take me a while to find what I'm looking for. ;-)

Another question - there are (as often happens) multiple ways of doing what I
want in SQL, but I am not sure what is the fastest and most efficient way of
doing it (in theory at least).

I want to do a multi-criterion search on the same field (the FTI indexed
one), and have separate AND, NOT and OR search terms.
AND = "terms that must occur in the text"
OR = "terms of which at least one has to occur in the text"
NOT = "terms which must not occur in the text"

Initially, before FTI, I used a big ILIKE query which worked reasonably well.

I should point out that my test bed machine for this is a Pentium 100 MHz
with 128 MB of RAM and an IDE disk. My database is expected to be around
50K-100K records, and about 100-200 MB on disk in PostgreSQL files (that's
what the disk consumption of the vacuumed database was before FTI).

Using the same example data set as before, yhe query was something like:

SELECT    *
FROM    Jobs
WHERE    (
        Description ILIKE '%AndTerm1%'    AND
        Description ILIKE '%AndTerm2%'    AND
        Description ILIKE '%AndTerm3%'
        ...
    )
    AND
    (
        Description ILIKE '%OrTerm1%'    OR
        Description ILIKE '%OrTerm2%'    OR
        Description ILIKE '%OrTerm3%'
        ...
    )
    AND
    (
        Description NOT ILIKE '%OrTerm1%'    AND
        Description NOT ILIKE '%OrTerm2%'    AND
        Description NOT ILIKE '%OrTerm3%'
        ...
    )

This usually returned the required data within 30 seconds or so, after,
obviously, doing as sequential search through the database due to the
non-anchored ILIKE match.

After implementing FTI, the insertion speed has gone through the floor (as
expected), but the select speed doesn't seem to be that much greater, even
when using the index (string, oid) on the FTI look-up table. On simple
queries that only require one or two terms there is a big speed improvement,
but on queries with three or more terms, the improvement is not that great.

The queries typically return 10 - 200 rows (give or take a bit, depending on
the specific query).

The queries I am using at the moment to replace the above ILIKE solution are
in the form

SELECT     Jobs.*
FROM    Jobs,
    Jobs_Description_FTI
WHERE    Jobs_Description_FTI.string    = $And/Or/NotTerms[$i]    AND
    Jobs_Description_FTI.id    = Jobs.oid

The AND queries are INTERSECTed together, OR queries and UNIONed together,
both are UNIONed, and then the NOT queries are EXCEPTed.

In some cases, this has yielded a signifficant improvement in performance, as
Tom suggested (thanks for that, it was much appreciated). Sometimes, however,
things go the other way.

To cut the long story short, I seem to have tracked the problem down to a
certain situation.

If there is, say, 10K records in the master table, there is about 4M records
in the lookup table. This in itself isn't an issue. Queries that return small
numbers of records, e.g.

SELECT    count(*)
FROM    Jobs_Description_FTI
WHERE    string = 'linux'

(returns ~300 rows)

happen more or less instantaneously.

However, a very similar query such as:

SELECT    count(*)
FROM    Jobs_Description_FTI
WHERE    string = 'nt'

(returns ~30K rows)

takes around two-three minutes.

I tried doing a

SELECT    count(*)
FROM    Jobs
WHERE    Description ILIKE '%nt%'

(returns 11K records out of 12K)

and that only takes about 10 seconds or so.

SELECT    count(*)
FROM    Jobs
WHERE    Description ILIKE '% nt %'

returns ~800 records out of 12K, which is much more reasonable.

Ideally, that should be

SELECT    count(*)
FROM    Jobs
WHERE    Description ~* '.*[!a-z]nt[!a-z].*'

or something like that, which yields a similar number of records to the
previous query, but is slower.

I am fully aware that this is fairly normal under the circumstances, but I
need a way of defeating this performance issue. The only way of doing that
that I can see at the moment is to:

1) Modify the FTI function to split the text field only at non-alphanumeric
characters, and only return whole words, rather than substrings of words.

2) Allow the insert into master table to succeed, even if some of the inserts
driven by the trigger fail, and define a unique string-oid index, which would
prevent duplicates, thus yielding a smaller lookup table.

One of the other things I'm considering is pruning the lookup table
duplicates periodically to shrink the table to a more reasonable size.

If anyone can suggest other courses of action, I am most interested to hear
them. Is there anything in the pipeline for addressing FTI in the next
version of PostgreSQL?

At the moment, the best average case scenario for my application seems to be
just doing an ILIKE or a ~* search, because although it takes a while, it
takes a comparatively similar amount of time for most queries, unlike the FTI
search which can go away for minutes at a time.

Is this another case of my being thick and producing broken SQL? Can anybody
think of a different way of doing this that would yield a performance
increase? I don't want to believe that doing a ~* unindexed sequential search
is the best solution here...

Thanks.

Gordan

pgsql-general by date:

Previous
From: Adam Haberlach
Date:
Subject: Re: python Backend message type 0x44 arrived while idle
Next
From: Martijn van Oosterhout
Date:
Subject: Re: Getting OID after Insert