Re: indexing words slow - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: indexing words slow
Date
Msg-id 199803140627.BAA03989@candle.pha.pa.us
Whole thread Raw
In response to indexing words slow  (Bruce Momjian <maillist@candle.pha.pa.us>)
Responses Re: [HACKERS] Re: indexing words slow
List pgsql-hackers
> OK, I have the artist_fti table with 4.5 million rows, with an index
> artist_fti_idx on artist_fti.string.  I don't have a 'product' table
> because I don't have the disk space, but that is not really a problem
> for testing.
>
> I used the product table to populate the artits_fti, then deleted the
> table so I could create the artist_fti_idx index.  Single table, no
> join.
>
> I am running on BSD/OS 3.1 PP200 with SCSI Ultra Barracuda disks.
>
> I am seeing the same thing you are.  'lling' and 'tones' take 11-22
> seconds on the first few runs, then return almost immediately.  If I do
> another query and come back to the first one, I have the same slowness,
> with the disk light being on almost constantly.  Certain queries seem to
> lend themselves to speeding up, while 'rol' never seems to get really
> fast.
>
> I have to conclude that because of the way this table is created by
> slicing words, its layout is almost random.  The table is 272MB, and
> with 8k/page, that is 34,000 pages.  If we are going for 2,500 rows,
> odds are that each row is in a separate page.  So, to do the query, we
> are retrieving 2,500 8k pages, or 20MB of random data on the disk.  How
> long does it take to issue 2,500 disk requests that are scattered all
> over the disk.  Probably 11-22 seconds.
>
> My OS only lets me have 400 8k buffers, or 3.2MB of buffer.  As we are
> transferring 2,500 8k pages or 20MB of data, it is no surprise that the
> buffer cache doesn't help much.  Sometime, the data is grouped together
> on the disk, and that is why some are fast, but often they are not, and
> certainly in a situation where you are looking for two words to appear
> on the same row, they certainly are not on adjacent pages.
>
> Just started running CLUSTER, and it appears to be taking forever.  Does
> not seem to be using the disk, but a lot of CPU.  It appears to be
> caught in an endless loop.

OK, I have an answer for you.

First, the CLUSTER problem I was having was some fluke, probably
something else that got stuck somewhere.  Not a bug.

Second, CLUSTER takes forever because it is moving all over the disk
retrieving each row in order.

Third, I should have been able to solve this for you sooner.  The issues
of slowness you are seeing are the exact issues I dealt with five years
ago when I designed this fragmentation system.  I was so excited that
you could develop triggers to slice the words, I had forgotten the other
issues.

OK, it is.  As part of this fragmentation job, EVERY NIGHT, I re-slice
the user strings and dump them into a flat file.  I then load them into
Ingres as a heap table, then modify the table to ISAM.

You may say, why every night.  Well, the problem is I load the data at
100% fill-factor, and ISAM rapidly becomes fragmented with
insertions/deletions.

Btree does not become fragmented, but the problem there is that btree
use was very slow.  I believe this is because ISAM is basically a SORTED
HEAP with an index, so everything is close and packed tight.  Btree
doesn't seem to do that as well.  It is more spread out.

Users complain about the nightly re-index, but I tell them, 'Hey, you
are searching for fragments of words in 200k entries in 6 seconds.  I can
design it without the night job, but your searches all day will take
much longer."  That stops the conversation.

So, to your solution.  CLUSTER is too slow.  The disk is going crazy
moving single rows into the temp table.  I recommend doing a COPY of
artist_fti to a flat file, doing a Unix 'sort' on the flat file, then
re-loading the data into the artist_fti, and then putting the index on
the table and vacuum.

I have done this, and now all searches are instantaneous THE FIRST TIME
and every time.

With this change, I am anxious to hear how fast you can now do your
multi-word searches.  Daily changes will not really impact performance
because they are a small part of the total search, but this process of
COPY/sort/reload/reindex will need to be done on a regular basis to
maintain performance.


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: disk space
Next
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] casting & type comments