Re: FTI Queries and Explain (long) - Mailing list pgsql-general
From | Gordan Bobic |
---|---|
Subject | Re: FTI Queries and Explain (long) |
Date | |
Msg-id | 200110181602.f9IG2uT01099@sentinel.bobich.net Whole thread Raw |
In response to | Re: FTI Queries and Explain (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: FTI Queries and Explain (long)
|
List | pgsql-general |
[Broken SQL instead of performance issue fixed] It would appear that when I define the index on the FTI table (string and oid) to be unique (which makes sense, since there is little point in having duplicate rows in this case), a lot of inserts fail where they shouldn't. I am guessing that if the insert into the look-up FTI table fails, the insert into the master table fails as well. I can understand that this might be useful for matches where the number of occurences is important, but in this particular application, that is not the case. Before I go and look into modifying the fti function code for my specific purpose, it would be nice to have a confirmation of this behaviour - otherwise it may take me a while to find what I'm looking for. ;-) Another question - there are (as often happens) multiple ways of doing what I want in SQL, but I am not sure what is the fastest and most efficient way of doing it (in theory at least). I want to do a multi-criterion search on the same field (the FTI indexed one), and have separate AND, NOT and OR search terms. AND = "terms that must occur in the text" OR = "terms of which at least one has to occur in the text" NOT = "terms which must not occur in the text" Initially, before FTI, I used a big ILIKE query which worked reasonably well. I should point out that my test bed machine for this is a Pentium 100 MHz with 128 MB of RAM and an IDE disk. My database is expected to be around 50K-100K records, and about 100-200 MB on disk in PostgreSQL files (that's what the disk consumption of the vacuumed database was before FTI). Using the same example data set as before, yhe query was something like: SELECT * FROM Jobs WHERE ( Description ILIKE '%AndTerm1%' AND Description ILIKE '%AndTerm2%' AND Description ILIKE '%AndTerm3%' ... ) AND ( Description ILIKE '%OrTerm1%' OR Description ILIKE '%OrTerm2%' OR Description ILIKE '%OrTerm3%' ... ) AND ( Description NOT ILIKE '%OrTerm1%' AND Description NOT ILIKE '%OrTerm2%' AND Description NOT ILIKE '%OrTerm3%' ... ) This usually returned the required data within 30 seconds or so, after, obviously, doing as sequential search through the database due to the non-anchored ILIKE match. After implementing FTI, the insertion speed has gone through the floor (as expected), but the select speed doesn't seem to be that much greater, even when using the index (string, oid) on the FTI look-up table. On simple queries that only require one or two terms there is a big speed improvement, but on queries with three or more terms, the improvement is not that great. The queries typically return 10 - 200 rows (give or take a bit, depending on the specific query). The queries I am using at the moment to replace the above ILIKE solution are in the form SELECT Jobs.* FROM Jobs, Jobs_Description_FTI WHERE Jobs_Description_FTI.string = $And/Or/NotTerms[$i] AND Jobs_Description_FTI.id = Jobs.oid The AND queries are INTERSECTed together, OR queries and UNIONed together, both are UNIONed, and then the NOT queries are EXCEPTed. In some cases, this has yielded a signifficant improvement in performance, as Tom suggested (thanks for that, it was much appreciated). Sometimes, however, things go the other way. To cut the long story short, I seem to have tracked the problem down to a certain situation. If there is, say, 10K records in the master table, there is about 4M records in the lookup table. This in itself isn't an issue. Queries that return small numbers of records, e.g. SELECT count(*) FROM Jobs_Description_FTI WHERE string = 'linux' (returns ~300 rows) happen more or less instantaneously. However, a very similar query such as: SELECT count(*) FROM Jobs_Description_FTI WHERE string = 'nt' (returns ~30K rows) takes around two-three minutes. I tried doing a SELECT count(*) FROM Jobs WHERE Description ILIKE '%nt%' (returns 11K records out of 12K) and that only takes about 10 seconds or so. SELECT count(*) FROM Jobs WHERE Description ILIKE '% nt %' returns ~800 records out of 12K, which is much more reasonable. Ideally, that should be SELECT count(*) FROM Jobs WHERE Description ~* '.*[!a-z]nt[!a-z].*' or something like that, which yields a similar number of records to the previous query, but is slower. I am fully aware that this is fairly normal under the circumstances, but I need a way of defeating this performance issue. The only way of doing that that I can see at the moment is to: 1) Modify the FTI function to split the text field only at non-alphanumeric characters, and only return whole words, rather than substrings of words. 2) Allow the insert into master table to succeed, even if some of the inserts driven by the trigger fail, and define a unique string-oid index, which would prevent duplicates, thus yielding a smaller lookup table. One of the other things I'm considering is pruning the lookup table duplicates periodically to shrink the table to a more reasonable size. If anyone can suggest other courses of action, I am most interested to hear them. Is there anything in the pipeline for addressing FTI in the next version of PostgreSQL? At the moment, the best average case scenario for my application seems to be just doing an ILIKE or a ~* search, because although it takes a while, it takes a comparatively similar amount of time for most queries, unlike the FTI search which can go away for minutes at a time. Is this another case of my being thick and producing broken SQL? Can anybody think of a different way of doing this that would yield a performance increase? I don't want to believe that doing a ~* unindexed sequential search is the best solution here... Thanks. Gordan
pgsql-general by date: