Home > mailing lists

Re: Slow self-join on a 100 million record table - Mailing list pgsql-sql

From	Andrew J. Kelly
Subject	Re: Slow self-join on a 100 million record table
Date	January 2, 2003 00:14:48
Msg-id	e9kyJZOsCHA.1808@TK2MSFTNGP09 Whole thread
In response to	Slow self-join on a 100 million record table (mdavies@ilstu.edu (Mark Davies))
List	pgsql-sql

Tree view

Mark,

I doubt very much you will ever get much faster results (without increasing
hardware) in a situation such as that.  Your queries don't look selective
enough to effectively use the indexes.  What is the query plan for each of
the individual selects and what does it look like as a whole?   How many
rows does each individual select return and how many for the final
statement?

-- 
Andrew J. Kelly
SQL Server MVP


"Mark Davies" <mdavies@ilstu.edu> wrote in message
news:a28f4bc4.0212310746.4a272b00@posting.google.com...
> I have a database containing 100 million records, in which each record
> contains (in sequence) all of the words in a 100 million word
> collection of texts.  There are two columns: TheID (offset value) and
> TheWord (sequential words), e.g.:
>
> TheID  TheWord
> ----  -----
>
> 1  I
> 2  saw
> 3  the
> 4  man
> 5  that
> 6  came
>  . . .
> 100000000 xxx
>
> To extract strings, I then use self-joins on this one table, in which
> [ID], [ID-1], [ID+1] etc are used to find preceding and following
> words, e.g.:
>
> select count(*),w1.w1,w2.w1,w3.w1 from
> ((select w1, ID+1 as ID from seq where w1 in ('the','that','this')) w1
> inner join
> (select w1, ID as ID from seq where w1 in ('man','woman','person')) w2
> on w2.ID = w1.ID)
> inner join
> (select w1, ID-1 as ID from seq where w1 in ('who','that','which')) w3
> on w3.ID=w1.ID
> group by w1.w1,w2.w1,w3.w1
>
> This would yield results like "the man that" (words 3-5 above),"that
> woman who","this man which", etc.
>
> The problem is, the self-join solution is extremely slow.  I have a
> SQL Server 7.0 database with a clustered index on TheWord (sequential
> words) and a normal index on TheID.  Even with all of this, however, a
> self-join query like the one just listed takes about 15 seconds on my
> machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0).
>
> Any suggestions?  Have I messed up in terms of the SQL statement?
> Thanks in advance for any help that you can give.
>
> Mark Davies
> Illinois State University
>
> P.S. Yes, I know about Full-Text Indexing in SQL Server, but it's not
> adequate for my purposes -- there's a lot more to the project than
> what I've described here.

pgsql-sql by date:

From: mdavies@ilstu.edu (Mark Davies)
Date: 02 January 2003, 00:14:32
Subject: Slow self-join on a 100 million record table

From: Minghann Ho
Date: 02 January 2003, 00:15:01
Subject: bug?

Re: Slow self-join on a 100 million record table - Mailing list pgsql-sql

Previous

Next