Thread: Full Text Index disk space requirements
So, I'm trying to create a full text index as described here: http://techdocs.postgresql.org/techdocs/fulltextindexing.php Everything was going mostly okay... I had to hack a quick PHP script instead of using the Perl once since I didn't have a working Pg.pm, but that was a minor speed bump. Then I hit a real road-block... \copy article_fti from fulltext.sorted \. ERROR: copy: line 34635390, cannot extend article_fti: No space left on device. Check free disk space. PQendcopy: resetting connection archive=> \q [root@rm-004-24 utilities]# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 15G 15G 0 100% / /dev/sda1 48M 6.1M 39M 14% /boot none 439M 0 439M 0% /dev/shm Oh. Yeah. I guess that *IS* going to be kind of big... Any SWAGs how much disk space is required for a 500 M fulltext.sorted file? IE, the ASCII file of string/OID pairs, in tab-delimited form, is 500 M -- How much PostgreSQL space does that turn into with the tables/indices as described the URL above? When I deleted all the fti rows, and did a VACUUM, there was almost 2G available... ALSO: Wouldn't using f1.string = 'perth' be faster than f1.string ~ '^perth' and equally useful? Or is ~ with ^ somehow actually faster than the seemingly simple = comparison? AND: Would using OR for the individual word comparisons be a big drag on speed? I'd kind of like to give ranked results based on how many of the terms were present rather than a complete filter. I'd be happy to try the EXPLAIN queries, but I don't think they're going to be accurate without the data in the FTI table... I got some weird results when I did a test run with a very small dataset in the FTI table -- But I also think I was doing it in the middle of a train-wreck between daily VACUUM and pg_dump, which were thrashing each other with all the FTI data I had imported just for the small test... I've altered the cron jobs to have more time in between. THANKS IN ADVANCE!
Hi, if memory serves me right, the space requirements for this would be something like: 42 (per tuple overhead) 4 (size of OID?) 16 (substitute with the maximum length of any 'string' in your fulltext.sorted) + ------------- 62 20,000,000 (substitute with number of lines in fulltext.sorted, i.e. 'wc -l fulltext.sorted') *--------------------------- 1,240,000,000 or about 1.2G? or On 11/26/2002 01:36:59 PM typea wrote: > Wouldn't using f1.string = 'perth' be faster than f1.string ~ '^perth' and > equally useful? Or is ~ with ^ somehow actually faster than the seemingly > simple = comparison? f1.string = 'perth' would only match 'perth', while f1.string ~ '^perth' would also match 'perthinent' (yes, I know this word does not exist). Maarten ps. are you trying to use the stuf from the 'fulltextindex' directory in contrib/? I originally wrote this as an experiment, and it actually turned out not to be fast enough for my purpose. I've never done anything with full text indexing again, but I believe that currently there are better solutions based on PostgreSQL (i.e. OpenFTI?) -------------------------------------------------------------- -- Visit our Internet site at http://www.reuters.com Get closer to the financial markets with Reuters Messaging - for more information and to register, visit http://www.reuters.com/messaging Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.
> ps. are you trying to use the stuf from the 'fulltextindex' directory in > contrib/? Yes. > I originally wrote this as an experiment, and it actually > turned out not to be fast enough for my purpose. I've never done > anything with full text indexing again, but I believe that currently > there are better solutions based on PostgreSQL (i.e. OpenFTI?) Oh. ... In case anybody finds these archived, it's OpenFTS: http://sourceforge.net/projects/openfts/ Perhaps my question should be "What's the best full-text-index solution?" Too open-ended? PostgreSQL 7.1.3 (upgrading is not out of the question, but...) ~20,000 text articles scanned with OCR from _The Bulletin of the Atomic Scientists_ (the Doomsday Clock folks) Average text length: 9635 characters Max text length: 278227 Only 2000 of the texts are null or '', and those are probably "buglets" Any other pertinent facts needed?
typea@l-i-e.com kirjutas K, 27.11.2002 kell 01:13: > > ps. are you trying to use the stuf from the 'fulltextindex' directory in > > contrib/? > > Yes. > > > I originally wrote this as an experiment, and it actually > > turned out not to be fast enough for my purpose. I've never done > > anything with full text indexing again, but I believe that currently > > there are better solutions based on PostgreSQL (i.e. OpenFTI?) > > Oh. ... > > In case anybody finds these archived, it's OpenFTS: > http://sourceforge.net/projects/openfts/ > > Perhaps my question should be "What's the best full-text-index solution?" > You should at least check possibilities of using contrib/tsearch and contrib/intarray If you find out some good answers, report back to this list :) -------------- Hannu
>> Wouldn't using f1.string = 'perth' be faster than f1.string ~ '^perth' >> > and >> equally useful? Or is ~ with ^ somehow actually faster than the > seemingly >> simple = comparison? > > f1.string = 'perth' would only match 'perth', while f1.string ~ '^perth' > would also match 'perthinent' (yes, I know this word does not exist). D'oh! I figured that one out in the shower this morning. Sleep deprivation, I guess... But something is very wrong with what I've done... archive=> explain SELECT article.* FROM article , article_fti as f1, article_fti as f2 WHERE TRUE AND (TRUE AND (f1.string ~ '^nuclear' AND f1.id = article.oid ) AND (f2.string ~ '^winter' AND f2.id = article.oid ) ) ; NOTICE: QUERY PLAN: Merge Join (cost=1476541.78..1492435.98 rows=77581 width=228) -> Merge Join (cost=740017.07..744846.55 rows=368824 width=224) -> Sort (cost=3492.36..3492.36 rows=17534 width=220) -> Seq Scan on article (cost=0.00..1067.34 rows=17534 width=220) -> Sort (cost=736524.71..736524.71 rows=368824 width=4) -> Seq Scan on article_fti f2 (cost=0.00..693812.18 rows=368824 width=4) -> Sort (cost=736524.71..736524.71 rows=368824 width=4) -> Seq Scan on article_fti f1 (cost=0.00..693812.18 rows=368824 width=4) EXPLAIN archive=> explain select * from article where text like '%nuclear%' and text like '%winter%'; NOTICE: QUERY PLAN: Seq Scan on article (cost=0.00..1155.01 rows=1 width=216) EXPLAIN archive=> \d article_fti Table "article_fti" Attribute | Type | Modifier -----------+------+---------- string | text | id | oid | Indices: article_fti_id_index, article_fti_string_index archive=> \d article Table "article" Attribute | Type | Modifier -------------------+---------+---------------------------------------------- id | integer | not null default nextval('article_ID'::text) ... text | text | Indices: article_id_index, article_oid_index, article_type_index archive=> I'm befuddled.
I looked for a "known bugs" sort of database to search before bugging you guys, but failed to find it... But I am at least asking before I submit a new bug report :-) In version 7.1.3 on a Linux box: A particularly long, nasty query works "just fine" (returning seemingly correct results in about 15 seconds) until I tack on "LIMIT 1" Adding LIMIT 1, however, seems to make the query take an infinite amount of time. Well, more than 5 minutes, anyway, and I'm not that patient when I know it worked okay without it the LIMIT, if you know what I mean. Here is the query: SELECT DISTINCT *, 0 + 10 * (lower(title) like '%albert einstein%') ::int + 10 * (lower(author_flattened) like '%albert einstein%') ::int + 30 * (lower(subject_flattened) like '%albert einstein%') ::int + 9 * (substring(lower(title), 1, 20) like '%albert%') ::int + 25 * (substring(lower(text), 1, 20) LIKE '%albert%') ::int + (8 * (lower(title) LIKE '%albert%' AND lower(title) LIKE '%einstein%' AND ((title ~* 'albert.{0,20}einstein') OR (title ~* 'einstein.{0,20}albert'))) ::int) + (1 * ( (lower(title) LIKE '%albert%') )::int) + (1 * ( (lower(author_flattened) LIKE '%albert%') )::int) + (1 * ( (lower(subject_flattened) LIKE '%albert%') )::int) + 9 * (substring(lower(title), 1, 20) like '%einstein%') ::int + 25 * (substring(lower(text), 1, 20) LIKE '%einstein%') ::int + (8 * (lower(title) LIKE '%einstein%' AND lower(title) LIKE '%albert%' AND ((title ~* 'einstein.{0,20}albert') OR (title ~* 'albert.{0,20}einstein'))) ::int) + (1 * ( (lower(title) LIKE '%einstein%') )::int) + (1 * ( (lower(author_flattened) LIKE '%einstein%') )::int) + (1 * ( (lower(subject_flattened) LIKE '%einstein%') )::int) AS points FROM article WHERE FALSE OR (lower(title) LIKE '%albert%') OR (lower(author_flattened) LIKE '%albert%') OR (lower(subject_flattened) LIKE '%albert%') OR (lower(title) LIKE '%einstein%') OR (lower(author_flattened) LIKE '%einstein%') OR (lower(subject_flattened) LIKE '%einstein%') ORDER BY points desc, volume, number, article.article LIMIT 1 , 1; explain with or without the LIMIT part is about what you'd expect. Limit (cost=1596.50..1596.50 rows=1 width=216) -> Unique (cost=1596.45..1596.50 rows=1 width=216) -> Sort (cost=1596.45..1596.45 rows=1 width=216) -> Seq Scan on article (cost=0.00..1596.44 rows=1 width=216) Obviously the "Limit" line is gone from the explain output when there is no LIMIT, but the other lines are all the same. Is this a known bug, is there a fix or work-around? If not, should I report it, or will the first answer be "Upgrade." ? The table in question has 17,000 reords, and the various fields mentioned here are all rather short -- Just author names, subject lines, and titles of text articles. [The articles themselves are super long, but are not involved in this query.] I can take out the ~* parts, and life is good again, so almost for sure that's a critical component in the failure. ps auxwwww | grep postgrs seems to report an "idle" postgres process for each failed query -- attempting to ^C the query and/or killing the idle process (I know, "Don't") is unfruitful. kill -9 does nuke the idle processes, IIRC, but I'm not 100% sure... I restarted the server soon after that, since (A) PHP command-line (aka "CGI") was refusing to start, complaining about "mm" not being loadable, and there was not much free RAM and the web-server was not particularly happy about that state of affairs... The schema is probably not particularly interesting -- Pretty much every field involved is a 'text' field, but here you go: Table "article" Attribute | Type | Modifier -------------------+---------+---------------------------------------------- id | integer | not null default nextval('article_ID'::text) volume | text | number | text | article | text | date | text | cover_date | text | title | text | author | text | author_last | text | author_first | text | subject | text | pages | text | artwork | text | text | text | type | integer | type_hardcoded | text | type_detailed | integer | abstract | text | subject_flattened | text | author_flattened | text | Indices: article_id_index, article_oid_index, article_type_index Just FYI, the _flattened fields are de-normalizing (or is it re-normalizing?) some relation tables so that we're not making a zillion tuples here, and it's just a simple (we though) short and sweet text search. PS Thanks for all your help on the full text index! I'm still evaluating some options, but a home-brew concordance is showing the most promise. I'll post source/details if it works out.
typea, > I looked for a "known bugs" sort of database to search before bugging you > guys, but failed to find it... But I am at least asking before I submit a > new bug report :-) > > In version 7.1.3 on a Linux box: You'll get a snarky response, and then be told to upgrade, if you try to submit a bug in 7.1.3. -- -Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > You'll get a snarky response, and then be told to upgrade, if you try to > submit a bug in 7.1.3. 7.1 is a tad long in the tooth, but still I'm curious about this. I don't see how <plan A> can possibly take longer than <plan A> + <LIMIT node on top>. regards, tom lane
> Josh Berkus <josh@agliodbs.com> writes: >> You'll get a snarky response, and then be told to upgrade, if you try >> to submit a bug in 7.1.3. > > 7.1 is a tad long in the tooth, but still I'm curious about this. I > don't see how <plan A> can possibly take longer than <plan A> + <LIMIT > node on top>. Hey Tom. I think we met very briefly at the International PHP Conference in Frankfurt in 2001... Anyway. It's actually the other way around. <Plan A> takes like 4 seconds. <Plan A> + <LIMIT node on top> takes literally FOREVER and leaves a postgres process hanging 'round that I have to kill -9 to get rid of. I'd understand the LIMIT clause taking a bit longer, or being faster for startup (if there were no ORDER BY, which there is) but I never even considered it would hang the whole thing. Actually, PostgreSQL has been so reliable over the years, the idea that I'd run across a bug was just foreign to me... So I've been trying to tune performance on this query for weeks now, not realizing that the speed wasn't the issue at all. I could almost rip out the LIMIT completely if the application logic let me, and if the performance were a bit better. It occurred to me last night that the actual data *MIGHT* be involved -- It's some OCR text, and there are a few scattered non-ASCII characters involved... So *MAYBE* the actual text getting scanned could also be important. It seems unlikely, since the non-LIMIT query returns all the data just fine, but just in case... Here's a schema and a full dump for anybody that wants to dig in: http://bulletinarchive.org/pg_dump/ I could provide PHP source as well, or the query posted in this thread can serve as the test case. At the moment, I've altered the application to not use LIMIT when I have ~* in the query, and it "works" -- only the paging of results is broken, and the whole page takes twice as long to load as it should in those cases, since it's doing the same query twice and snarfing all the monster data and then throwing away the majority of rows in both cases. I need the first row to get the highest score, and the rows for paging in the real application... Anyway, my point is that the queries seem fine without the LIMIT clause, and "hang" with both "~*" and LIMIT, and I've even gone so far as to incorporate that into the application logic for now, just to have a page that loads at all instead of one that hangs. Meanwhile, I guess I should flail at it and try 7.3 in the hopes the bug disappeared. I was hoping to know for sure that it was a fixed bug in that upgrade path. Boss actually said we should go ahead and upgrade just on principle anyway. It's nice to have a smart boss. :-)
Typea, > At the moment, I've altered the application to not use LIMIT when I > have > ~* in the query, and it "works" -- only the paging of results is > broken, Would your application allow you to use " ILIKE '%<VALUE>%'" in the query instead of "~*" ? If so, does the query still hang with ILIKE ... LIMIT? -Josh
typea@l-i-e.com kirjutas P, 15.12.2002 kell 05:41: > It occurred to me last night that the actual data *MIGHT* be involved -- > It's some OCR text, and there are a few scattered non-ASCII characters > involved... So *MAYBE* the actual text getting scanned could also be > important. > > It seems unlikely, since the non-LIMIT query returns all the data just > fine, but just in case... Have you tried using DECLARE CURSOR...; FETCH 1; CLOSE CURSOR; instead of LIMIT ? > Here's a schema and a full dump for anybody that wants to dig in: > http://bulletinarchive.org/pg_dump/ gzipping the data could make sense - data.sql goes from 200M to 60M ;) > I could provide PHP source as well, or the query posted in this thread can > serve as the test case. > > At the moment, I've altered the application to not use LIMIT when I have > ~* in the query, and it "works" -- only the paging of results is broken, > and the whole page takes twice as long to load as it should in those > cases, since it's doing the same query twice and snarfing all the monster > data and then throwing away the majority of rows in both cases. I need > the first row to get the highest score, and the rows for paging in the > real application... > > Anyway, my point is that the queries seem fine without the LIMIT clause, > and "hang" with both "~*" and LIMIT, and I've even gone so far as to > incorporate that into the application logic for now, just to have a page > that loads at all instead of one that hangs. > > Meanwhile, I guess I should flail at it and try 7.3 in the hopes the bug > disappeared. I tested (part of) it on 7.3 , had to manually change ::int to case-when-then-else-end as there is no cast from bool to int in7.3 This ran fine: SELECT DISTINCT *, 0 + case when (title ilike '%albert einstein%') then 10 else 0 end + case when ( title iLIKE '%einstein%' AND title iLIKE '%albert%' AND ( (title ~* 'einstein.{0,20}albert') OR (title ~* 'albert.{0,20}einstein'))) then 8 else 0 end as points FROM article WHERE FALSE OR (title iLIKE '%albert%') OR (author_flattened iLIKE '%albert%') OR (subject_flattened iLIKE '%albert%') OR (title iLIKE '%einstein%') OR (author_flattened iLIKE '%einstein%') OR (subject_flattened iLIKE '%einstein%') ORDER BY points desc, volume, number, article.article LIMIT 1 OFFSET 1; I also changed "lower(field) like '%albert%'" to "field ilike '%albert%'" and got about 20% speed boost - EXPLAIN ANALYZE reported 0.189 insead of 0.263 sec as actual time. > I was hoping to know for sure that it was a fixed bug in > that upgrade path. > > Boss actually said we should go ahead and upgrade just on principle > anyway. It's nice to have a smart boss. :-) > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) -- Hannu Krosing <hannu@tm.ee>
> Have you tried using DECLARE CURSOR...; FETCH 1; CLOSE CURSOR; instead > of LIMIT ? I think I did, in the monitory, and it worked fine. > I tested (part of) it on 7.3 , had to manually change ::int to > case-when-then-else-end as there is no cast from bool to int in7.3 An upgrade to 7.3 has, in fact, gotten rid of that bug... Though now I'm forced to use localhost for connecting, since: A) Upon boot, I'm told I can't use password or crypt, but B) When trying to connect, I can't use md5 C) the passwords get turned into md5 whether I like it or not What's up with all that? I also don't understand why the incredibly useful function I had to auto-typecast from bool to int won't work using ::int syntax, but will if I use int4(...) syntax. Grrr. And breaking the LIMIT x, y thing was annoying. Oh well. I can move forward with some changes in the way we do things. Now that the query runs, I can start in on the optimization again :-) THANKS ALL!!! Oh, and the lower(field) LIKE is MySQL compatible, but I don't think MySQL has an ILIKE... We're abandoning the MySQL support now anyway, since we NEED performance way more than we need MySQL compatibility. Thanks again!
>> I tested (part of) it on 7.3 , had to manually change ::int to >> case-when-then-else-end as there is no cast from bool to int in7.3 > > An upgrade to 7.3 has, in fact, gotten rid of that bug... Damn. I spoke to soon. It *SEEMS* like it's back again. Very, very strange. If explain claims the "cost" will be ~1000, and then a query takes SO long to return I give up and hit ^C, that's just not right, right? I mean, that "cost" near 1000 may not be in seconds or anything, but 1000 is pretty low, isn't it? I give up for now. Need sleep.
Typea, > Oh, and the lower(field) LIKE is MySQL compatible, but I don't think > MySQL > has an ILIKE... We're abandoning the MySQL support now anyway, since > we > NEED performance way more than we need MySQL compatibility. ILIKE is SQL-spec. There's reasons to use any: ILIKE is slightly faster on un-anchored text searches ("name ILIKE '%john%'") lower(column) can be indexed for anchored text searches ("lower(name) LIKE 'john%'") "~*" cannot be indexed, but will accept regexp operators for sophisticated text searches ("name ~* 'jo[han]n?'") -Josh Berkus
"Josh Berkus" <josh@agliodbs.com> writes: > ILIKE is SQL-spec. It is? I don't see it in there ... regards, tom lane