Slow self-join on a 100 million record table - Mailing list pgsql-sql

From mdavies@ilstu.edu (Mark Davies)
Subject Slow self-join on a 100 million record table
Date
Msg-id a28f4bc4.0212310746.4a272b00@posting.google.com
Whole thread Raw
Responses Re: Slow self-join on a 100 million record table  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-sql
I have a database containing 100 million records, in which each record
contains (in sequence) all of the words in a 100 million word
collection of texts.  There are two columns: TheID (offset value) and
TheWord (sequential words), e.g.:

TheID  TheWord
----  -----

1  I 
2  saw 
3  the 
4  man 
5  that
6  came. . . 
100000000 xxx 

To extract strings, I then use self-joins on this one table, in which
[ID], [ID-1], [ID+1] etc are used to find preceding and following
words, e.g.:

select count(*),w1.w1,w2.w1,w3.w1 from 
((select w1, ID+1 as ID from seq where w1 in ('the','that','this')) w1
inner join 
(select w1, ID as ID from seq where w1 in ('man','woman','person')) w2
on w2.ID = w1.ID) 
inner join 
(select w1, ID-1 as ID from seq where w1 in ('who','that','which')) w3
on w3.ID=w1.ID 
group by w1.w1,w2.w1,w3.w1 

This would yield results like "the man that" (words 3-5 above),"that
woman who","this man which", etc.

The problem is, the self-join solution is extremely slow.  I have a
SQL Server 7.0 database with a clustered index on TheWord (sequential
words) and a normal index on TheID.  Even with all of this, however, a
self-join query like the one just listed takes about 15 seconds on my
machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0).

Any suggestions?  Have I messed up in terms of the SQL statement? 
Thanks in advance for any help that you can give.

Mark Davies
Illinois State University

P.S. Yes, I know about Full-Text Indexing in SQL Server, but it's not
adequate for my purposes -- there's a lot more to the project than
what I've described here.


pgsql-sql by date:

Previous
From:
Date:
Subject: empty arrays
Next
From: "Andrew J. Kelly"
Date:
Subject: Re: Slow self-join on a 100 million record table