Thread: Full Text Index disk space requirements

Full Text Index disk space requirements

From
Date:
So, I'm trying to create a full text index as described here:

http://techdocs.postgresql.org/techdocs/fulltextindexing.php

Everything was going mostly okay...

I had to hack a quick PHP script instead of using the Perl once since I
didn't have a working Pg.pm, but that was a minor speed bump.

Then I hit a real road-block...

\copy article_fti from fulltext.sorted
\.
ERROR:  copy: line 34635390, cannot extend article_fti: No space left on
device.
        Check free disk space.
PQendcopy: resetting connection
archive=> \q
[root@rm-004-24 utilities]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3              15G   15G     0 100% /
/dev/sda1              48M  6.1M   39M  14% /boot
none                  439M     0  439M   0% /dev/shm

Oh.  Yeah.  I guess that *IS* going to be kind of big...

Any SWAGs how much disk space is required for a 500 M fulltext.sorted file?

IE, the ASCII file of string/OID pairs, in tab-delimited form, is 500 M --
How much PostgreSQL space does that turn into with the tables/indices as
described the URL above?

When I deleted all the fti rows, and did a VACUUM, there was almost 2G
available...

ALSO:
Wouldn't using f1.string = 'perth' be faster than f1.string ~ '^perth' and
equally useful?  Or is ~ with ^ somehow actually faster than the seemingly
simple = comparison?

AND:
Would using OR for the individual word comparisons be a big drag on speed?
 I'd kind of like to give ranked results based on how many of the terms
were present rather than a complete filter.

I'd be happy to try the EXPLAIN queries, but I don't think they're going
to be accurate without the data in the FTI table...

I got some weird results when I did a test run with a very small dataset
in the FTI table -- But I also think I was doing it in the middle of a
train-wreck between daily VACUUM and pg_dump, which were thrashing each
other with all the FTI data I had imported just for the small test...

I've altered the cron jobs to have more time in between.

THANKS IN ADVANCE!




Re: Full Text Index disk space requirements

From
Maarten Boekhold
Date:
Hi,

if memory serves me right, the space requirements for this would be
something like:

        42 (per tuple overhead)
        4 (size of OID?)
        16 (substitute with the maximum length of any 'string' in your
fulltext.sorted)
+ -------------
        62
        20,000,000 (substitute with number of lines in fulltext.sorted,
i.e. 'wc -l fulltext.sorted')
*---------------------------
        1,240,000,000

or about 1.2G?

or
On 11/26/2002 01:36:59 PM typea wrote:
> Wouldn't using f1.string = 'perth' be faster than f1.string ~ '^perth'
and
> equally useful?  Or is ~ with ^ somehow actually faster than the
seemingly
> simple = comparison?

f1.string = 'perth' would only match 'perth', while f1.string ~ '^perth'
would also match 'perthinent' (yes, I know this word does not exist).

Maarten

ps. are you trying to use the stuf from the 'fulltextindex' directory in
contrib/? I originally wrote this as an experiment, and it actually turned
out not to be fast enough for my purpose. I've never done anything with
full text indexing again, but I believe that currently there are better
solutions based on PostgreSQL (i.e. OpenFTI?)


-------------------------------------------------------------- --
        Visit our Internet site at http://www.reuters.com

Get closer to the financial markets with Reuters Messaging - for more
information and to register, visit http://www.reuters.com/messaging

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.

Re: Full Text Index disk space requirements

From
Date:
> ps. are you trying to use the stuf from the 'fulltextindex' directory in
>  contrib/?

Yes.

> I originally wrote this as an experiment, and it actually
> turned  out not to be fast enough for my purpose. I've never done
> anything with  full text indexing again, but I believe that currently
> there are better  solutions based on PostgreSQL (i.e. OpenFTI?)

Oh.  ...

In case anybody finds these archived, it's OpenFTS:
http://sourceforge.net/projects/openfts/

Perhaps my question should be "What's the best full-text-index solution?"

Too open-ended?

PostgreSQL 7.1.3 (upgrading is not out of the question, but...)
~20,000 text articles scanned with OCR from _The Bulletin of the Atomic
Scientists_ (the Doomsday Clock folks)
Average text length: 9635 characters
    Max text length: 278227
Only 2000 of the texts are null or '', and those are probably "buglets"

Any other pertinent facts needed?




Re: Full Text Index disk space requirements

From
Hannu Krosing
Date:
typea@l-i-e.com kirjutas K, 27.11.2002 kell 01:13:
> > ps. are you trying to use the stuf from the 'fulltextindex' directory in
> >  contrib/?
>
> Yes.
>
> > I originally wrote this as an experiment, and it actually
> > turned  out not to be fast enough for my purpose. I've never done
> > anything with  full text indexing again, but I believe that currently
> > there are better  solutions based on PostgreSQL (i.e. OpenFTI?)
>
> Oh.  ...
>
> In case anybody finds these archived, it's OpenFTS:
> http://sourceforge.net/projects/openfts/
>
> Perhaps my question should be "What's the best full-text-index solution?"
>

You should at least check possibilities of using

contrib/tsearch

and

contrib/intarray


If you find out some good answers, report back to this list :)

--------------
Hannu


Re: Full Text Index disk space requirements

From
Date:
>> Wouldn't using f1.string = 'perth' be faster than f1.string ~ '^perth'
>>
> and
>> equally useful?  Or is ~ with ^ somehow actually faster than the
> seemingly
>> simple = comparison?
>
> f1.string = 'perth' would only match 'perth', while f1.string ~ '^perth'
>  would also match 'perthinent' (yes, I know this word does not exist).

D'oh!  I figured that one out in the shower this morning.  Sleep
deprivation, I guess...

But something is very wrong with what I've done...

archive=> explain SELECT article.* FROM article , article_fti as f1,
article_fti as f2  WHERE TRUE  AND (TRUE  AND  (f1.string ~ '^nuclear' AND
f1.id = article.oid )   AND  (f2.string ~ '^winter' AND f2.id =
article.oid )   ) ;
NOTICE:  QUERY PLAN:

Merge Join  (cost=1476541.78..1492435.98 rows=77581 width=228)
  ->  Merge Join  (cost=740017.07..744846.55 rows=368824 width=224)
        ->  Sort  (cost=3492.36..3492.36 rows=17534 width=220)
              ->  Seq Scan on article  (cost=0.00..1067.34 rows=17534
width=220)
        ->  Sort  (cost=736524.71..736524.71 rows=368824 width=4)
              ->  Seq Scan on article_fti f2  (cost=0.00..693812.18
rows=368824 width=4)
  ->  Sort  (cost=736524.71..736524.71 rows=368824 width=4)
        ->  Seq Scan on article_fti f1  (cost=0.00..693812.18 rows=368824
width=4)

EXPLAIN
archive=> explain select * from article where text like '%nuclear%' and
text like '%winter%';
NOTICE:  QUERY PLAN:

Seq Scan on article  (cost=0.00..1155.01 rows=1 width=216)

EXPLAIN
archive=> \d article_fti
     Table "article_fti"
 Attribute | Type | Modifier
-----------+------+----------
 string    | text |
 id        | oid  |
Indices: article_fti_id_index,
         article_fti_string_index

archive=> \d article
                              Table "article"
     Attribute     |  Type   |                   Modifier
-------------------+---------+----------------------------------------------
 id                | integer | not null default nextval('article_ID'::text)
...
 text              | text    |
Indices: article_id_index,
         article_oid_index,
         article_type_index

archive=>

I'm befuddled.




~* + LIMIT => infinite time?

From
Date:
I looked for a "known bugs" sort of database to search before bugging you
guys, but failed to find it...  But I am at least asking before I submit a
new bug report :-)

In version 7.1.3 on a Linux box:

A particularly long, nasty query works "just fine" (returning seemingly
correct results in about 15 seconds) until I tack on "LIMIT 1"

Adding LIMIT 1, however, seems to make the query take an infinite amount
of time.  Well, more than 5 minutes, anyway, and I'm not that patient when
I know it worked okay without it the LIMIT, if you know what I mean.

Here is the query:

SELECT DISTINCT *, 0 + 10 * (lower(title) like '%albert einstein%') ::int
+ 10 * (lower(author_flattened) like '%albert einstein%') ::int + 30 *
(lower(subject_flattened) like '%albert einstein%') ::int + 9 *
(substring(lower(title), 1, 20) like '%albert%') ::int + 25 *
(substring(lower(text), 1, 20) LIKE '%albert%') ::int + (8 * (lower(title)
LIKE '%albert%' AND lower(title) LIKE '%einstein%' AND ((title ~*
'albert.{0,20}einstein') OR (title ~* 'einstein.{0,20}albert'))) ::int) +
(1 * ( (lower(title) LIKE '%albert%') )::int) + (1 * (
(lower(author_flattened) LIKE '%albert%') )::int) + (1 * (
(lower(subject_flattened) LIKE '%albert%') )::int) + 9 *
(substring(lower(title), 1, 20) like '%einstein%') ::int + 25 *
(substring(lower(text), 1, 20) LIKE '%einstein%') ::int + (8 *
(lower(title) LIKE '%einstein%' AND lower(title) LIKE '%albert%' AND
((title ~* 'einstein.{0,20}albert') OR (title ~*
'albert.{0,20}einstein'))) ::int) + (1 * ( (lower(title) LIKE
'%einstein%') )::int) + (1 * ( (lower(author_flattened) LIKE '%einstein%')
)::int) + (1 * ( (lower(subject_flattened) LIKE '%einstein%') )::int) AS
points FROM article WHERE FALSE OR (lower(title) LIKE '%albert%') OR
(lower(author_flattened) LIKE '%albert%') OR (lower(subject_flattened)
LIKE '%albert%') OR (lower(title) LIKE '%einstein%') OR
(lower(author_flattened) LIKE '%einstein%') OR (lower(subject_flattened)
LIKE '%einstein%') ORDER BY points desc, volume, number, article.article
LIMIT 1 , 1;


explain with or without the LIMIT part is about what you'd expect.

Limit  (cost=1596.50..1596.50 rows=1 width=216)
  ->  Unique  (cost=1596.45..1596.50 rows=1 width=216)
        ->  Sort  (cost=1596.45..1596.45 rows=1 width=216)
              ->  Seq Scan on article  (cost=0.00..1596.44 rows=1 width=216)

Obviously the "Limit" line is gone from the explain output when there is
no LIMIT, but the other lines are all the same.

Is this a known bug, is there a fix or work-around?
If not, should I report it, or will the first answer be "Upgrade." ?

The table in question has 17,000 reords, and the various fields mentioned
here are all rather short -- Just author names, subject lines, and titles
of text articles.  [The articles themselves are super long, but are not
involved in this query.]

I can take out the ~* parts, and life is good again, so almost for sure
that's a critical component in the failure.

ps auxwwww | grep postgrs seems to report an "idle" postgres process for
each failed query -- attempting to ^C the query and/or killing the idle
process (I know, "Don't") is unfruitful.

kill -9 does nuke the idle processes, IIRC, but I'm not 100% sure...

I restarted the server soon after that, since (A) PHP command-line (aka
"CGI") was refusing to start, complaining about "mm" not being loadable,
and there was not much free RAM and the web-server was not particularly
happy about that state of affairs...

The schema is probably not particularly interesting -- Pretty much every
field involved is a 'text' field, but here you go:

                             Table "article"
     Attribute     |  Type   |                   Modifier
-------------------+---------+----------------------------------------------
 id                | integer | not null default nextval('article_ID'::text)
 volume            | text    |
 number            | text    |
 article           | text    |
 date              | text    |
 cover_date        | text    |
 title             | text    |
 author            | text    |
 author_last       | text    |
 author_first      | text    |
 subject           | text    |
 pages             | text    |
 artwork           | text    |
 text              | text    |
 type              | integer |
 type_hardcoded    | text    |
 type_detailed     | integer |
 abstract          | text    |
 subject_flattened | text    |
 author_flattened  | text    |
Indices: article_id_index,
         article_oid_index,
         article_type_index

Just FYI, the _flattened fields are de-normalizing (or is it
re-normalizing?) some relation tables so that we're not making a zillion
tuples here, and it's just a simple (we though) short and sweet text
search.


PS  Thanks for all your help on the full text index!  I'm still evaluating
some options, but a home-brew concordance is showing the most promise.
I'll post source/details if it works out.




Re: ~* + LIMIT => infinite time?

From
Josh Berkus
Date:
typea,

> I looked for a "known bugs" sort of database to search before bugging you
> guys, but failed to find it...  But I am at least asking before I submit a
> new bug report :-)
>
> In version 7.1.3 on a Linux box:

You'll get a snarky response, and then be told to upgrade, if you try to
submit a bug in 7.1.3.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: ~* + LIMIT => infinite time?

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> You'll get a snarky response, and then be told to upgrade, if you try to
> submit a bug in 7.1.3.

7.1 is a tad long in the tooth, but still I'm curious about this.  I
don't see how <plan A> can possibly take longer than <plan A> + <LIMIT
node on top>.

            regards, tom lane

Re: ~* + LIMIT => infinite time?

From
Date:
> Josh Berkus <josh@agliodbs.com> writes:
>> You'll get a snarky response, and then be told to upgrade, if you try
>> to  submit a bug in 7.1.3.
>
> 7.1 is a tad long in the tooth, but still I'm curious about this.  I
> don't see how <plan A> can possibly take longer than <plan A> + <LIMIT
> node on top>.

Hey Tom.  I think we met very briefly at the International PHP Conference
in Frankfurt in 2001...  Anyway.

It's actually the other way around.  <Plan A> takes like 4 seconds.  <Plan
A> + <LIMIT node on top> takes literally FOREVER and leaves a postgres
process hanging 'round that I have to kill -9 to get rid of.

I'd understand the LIMIT clause taking a bit longer, or being faster for
startup (if there were no ORDER BY, which there is) but I never even
considered it would hang the whole thing.  Actually, PostgreSQL has been
so reliable over the years, the idea that I'd run across a bug was just
foreign to me...  So I've been trying to tune performance on this query
for weeks now, not realizing that the speed wasn't the issue at all.  I
could almost rip out the LIMIT completely if the application logic let me,
and if the performance were a bit better.

It occurred to me last night that the actual data *MIGHT* be involved --
It's some OCR text, and there are a few scattered non-ASCII characters
involved...  So *MAYBE* the actual text getting scanned could also be
important.

It seems unlikely, since the non-LIMIT query returns all the data just
fine, but just in case...

Here's a schema and a full dump for anybody that wants to dig in:
http://bulletinarchive.org/pg_dump/

I could provide PHP source as well, or the query posted in this thread can
serve as the test case.

At the moment, I've altered the application to not use LIMIT when I have
~* in the query, and it "works" -- only the paging of results is broken,
and the whole page takes twice as long to load as it should in those
cases, since it's doing the same query twice and snarfing all the monster
data and then throwing away the majority of rows in both cases.  I need
the first row to get the highest score, and the rows for paging in the
real application...

Anyway, my point is that the queries seem fine without the LIMIT clause,
and "hang" with both "~*" and LIMIT, and I've even gone so far as to
incorporate that into the application logic for now, just to have a page
that loads at all instead of one that hangs.

Meanwhile, I guess I should flail at it and try 7.3 in the hopes the bug
disappeared.  I was hoping to know for sure that it was a fixed bug in
that upgrade path.

Boss actually said we should go ahead and upgrade just on principle
anyway.  It's nice to have a smart boss. :-)




Re: ~* + LIMIT => infinite time?

From
"Josh Berkus"
Date:
Typea,

> At the moment, I've altered the application to not use LIMIT when I
> have
> ~* in the query, and it "works" -- only the paging of results is
> broken,

Would your application allow you to use " ILIKE '%<VALUE>%'" in the
query instead of "~*"  ?  If so, does the query still hang with ILIKE
... LIMIT?

-Josh

Re: ~* + LIMIT => infinite time?

From
Hannu Krosing
Date:
typea@l-i-e.com kirjutas P, 15.12.2002 kell 05:41:
> It occurred to me last night that the actual data *MIGHT* be involved --
> It's some OCR text, and there are a few scattered non-ASCII characters
> involved...  So *MAYBE* the actual text getting scanned could also be
> important.
>
> It seems unlikely, since the non-LIMIT query returns all the data just
> fine, but just in case...

Have you tried using DECLARE CURSOR...; FETCH 1; CLOSE CURSOR; instead
of LIMIT ?

> Here's a schema and a full dump for anybody that wants to dig in:
> http://bulletinarchive.org/pg_dump/

gzipping the data could make sense - data.sql goes from 200M to 60M ;)

> I could provide PHP source as well, or the query posted in this thread can
> serve as the test case.
>
> At the moment, I've altered the application to not use LIMIT when I have
> ~* in the query, and it "works" -- only the paging of results is broken,
> and the whole page takes twice as long to load as it should in those
> cases, since it's doing the same query twice and snarfing all the monster
> data and then throwing away the majority of rows in both cases.  I need
> the first row to get the highest score, and the rows for paging in the
> real application...
>
> Anyway, my point is that the queries seem fine without the LIMIT clause,
> and "hang" with both "~*" and LIMIT, and I've even gone so far as to
> incorporate that into the application logic for now, just to have a page
> that loads at all instead of one that hangs.
>
> Meanwhile, I guess I should flail at it and try 7.3 in the hopes the bug
> disappeared.

I tested (part of) it on 7.3 , had to manually change ::int to
case-when-then-else-end as there is no cast from bool to int in7.3

This ran fine:

SELECT DISTINCT
  *,
  0 + case when (title ilike '%albert einstein%') then 10 else 0 end
    + case when  (    title iLIKE '%einstein%'
            AND title iLIKE '%albert%'
            AND (    (title ~* 'einstein.{0,20}albert')
                  OR (title ~* 'albert.{0,20}einstein'))) then 8 else 0
end
  as points
 FROM article
WHERE FALSE
   OR (title iLIKE '%albert%')
   OR (author_flattened iLIKE '%albert%')
   OR (subject_flattened iLIKE '%albert%')
   OR (title iLIKE '%einstein%')
   OR (author_flattened iLIKE '%einstein%')
   OR (subject_flattened iLIKE '%einstein%')
ORDER BY points desc, volume, number, article.article
LIMIT 1 OFFSET 1;

I also changed
   "lower(field) like '%albert%'"
 to
   "field ilike '%albert%'"

and got about 20% speed boost - EXPLAIN ANALYZE reported 0.189 insead of
0.263 sec as actual time.

>   I was hoping to know for sure that it was a fixed bug in
> that upgrade path.
>
> Boss actually said we should go ahead and upgrade just on principle
> anyway.  It's nice to have a smart boss. :-)
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
--
Hannu Krosing <hannu@tm.ee>

Re: ~* + LIMIT => infinite time?

From
Date:
> Have you tried using DECLARE CURSOR...; FETCH 1; CLOSE CURSOR; instead
> of LIMIT ?

I think I did, in the monitory, and it worked fine.

> I tested (part of) it on 7.3 , had to manually change ::int to
> case-when-then-else-end as there is no cast from bool to int in7.3

An upgrade to 7.3 has, in fact, gotten rid of that bug...

Though now I'm forced to use localhost for connecting, since:
A) Upon boot, I'm told I can't use password or crypt, but
B) When trying to connect, I can't use md5
C) the passwords get turned into md5 whether I like it or not
What's up with all that?

I also don't understand why the incredibly useful function I had to
auto-typecast from bool to int won't work using ::int syntax, but will if
I use int4(...) syntax.  Grrr.

And breaking the LIMIT x, y thing was annoying.

Oh well.  I can move forward with some changes in the way we do things.

Now that the query runs, I can start in on the optimization again :-)

THANKS ALL!!!

Oh, and the lower(field) LIKE is MySQL compatible, but I don't think MySQL
has an ILIKE... We're abandoning the MySQL support now anyway, since we
NEED performance way more than we need MySQL compatibility.

Thanks again!




Re: ~* + LIMIT => infinite time?

From
Date:
>> I tested (part of) it on 7.3 , had to manually change ::int to
>> case-when-then-else-end as there is no cast from bool to int in7.3
>
> An upgrade to 7.3 has, in fact, gotten rid of that bug...

Damn.  I spoke to soon.  It *SEEMS* like it's back again.  Very, very
strange.

If explain claims the "cost" will be ~1000, and then a query takes SO long
to return I give up and hit ^C, that's just not right, right?  I mean,
that "cost" near 1000 may not be in seconds or anything, but 1000 is
pretty low, isn't it?

I give up for now.  Need sleep.





Re: ~* + LIMIT => infinite time?

From
"Josh Berkus"
Date:
Typea,

> Oh, and the lower(field) LIKE is MySQL compatible, but I don't think
> MySQL
> has an ILIKE... We're abandoning the MySQL support now anyway, since
> we
> NEED performance way more than we need MySQL compatibility.

ILIKE is SQL-spec.  There's reasons to use any:

ILIKE is slightly faster on un-anchored text searches ("name ILIKE
'%john%'")

lower(column) can be indexed for anchored text searches ("lower(name)
LIKE 'john%'")

"~*" cannot be indexed, but will accept regexp operators for
sophisticated text searches ("name ~* 'jo[han]n?'")

-Josh Berkus

Re: ~* + LIMIT => infinite time?

From
Tom Lane
Date:
"Josh Berkus" <josh@agliodbs.com> writes:
> ILIKE is SQL-spec.

It is?  I don't see it in there ...

            regards, tom lane