Thread: Slow self-join on a 100 million record table

Slow self-join on a 100 million record table

From
mdavies@ilstu.edu (Mark Davies)
Date:
I have a database containing 100 million records, in which each record
contains (in sequence) all of the words in a 100 million word
collection of texts.  There are two columns: TheID (offset value) and
TheWord (sequential words), e.g.:

TheID  TheWord
----  -----

1  I 
2  saw 
3  the 
4  man 
5  that
6  came. . . 
100000000 xxx 

To extract strings, I then use self-joins on this one table, in which
[ID], [ID-1], [ID+1] etc are used to find preceding and following
words, e.g.:

select count(*),w1.w1,w2.w1,w3.w1 from 
((select w1, ID+1 as ID from seq where w1 in ('the','that','this')) w1
inner join 
(select w1, ID as ID from seq where w1 in ('man','woman','person')) w2
on w2.ID = w1.ID) 
inner join 
(select w1, ID-1 as ID from seq where w1 in ('who','that','which')) w3
on w3.ID=w1.ID 
group by w1.w1,w2.w1,w3.w1 

This would yield results like "the man that" (words 3-5 above),"that
woman who","this man which", etc.

The problem is, the self-join solution is extremely slow.  I have a
SQL Server 7.0 database with a clustered index on TheWord (sequential
words) and a normal index on TheID.  Even with all of this, however, a
self-join query like the one just listed takes about 15 seconds on my
machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0).

Any suggestions?  Have I messed up in terms of the SQL statement? 
Thanks in advance for any help that you can give.

Mark Davies
Illinois State University

P.S. Yes, I know about Full-Text Indexing in SQL Server, but it's not
adequate for my purposes -- there's a lot more to the project than
what I've described here.


Re: Slow self-join on a 100 million record table

From
"Andrew J. Kelly"
Date:
Mark,

I doubt very much you will ever get much faster results (without increasing
hardware) in a situation such as that.  Your queries don't look selective
enough to effectively use the indexes.  What is the query plan for each of
the individual selects and what does it look like as a whole?   How many
rows does each individual select return and how many for the final
statement?

-- 
Andrew J. Kelly
SQL Server MVP


"Mark Davies" <mdavies@ilstu.edu> wrote in message
news:a28f4bc4.0212310746.4a272b00@posting.google.com...
> I have a database containing 100 million records, in which each record
> contains (in sequence) all of the words in a 100 million word
> collection of texts.  There are two columns: TheID (offset value) and
> TheWord (sequential words), e.g.:
>
> TheID  TheWord
> ----  -----
>
> 1  I
> 2  saw
> 3  the
> 4  man
> 5  that
> 6  came
>  . . .
> 100000000 xxx
>
> To extract strings, I then use self-joins on this one table, in which
> [ID], [ID-1], [ID+1] etc are used to find preceding and following
> words, e.g.:
>
> select count(*),w1.w1,w2.w1,w3.w1 from
> ((select w1, ID+1 as ID from seq where w1 in ('the','that','this')) w1
> inner join
> (select w1, ID as ID from seq where w1 in ('man','woman','person')) w2
> on w2.ID = w1.ID)
> inner join
> (select w1, ID-1 as ID from seq where w1 in ('who','that','which')) w3
> on w3.ID=w1.ID
> group by w1.w1,w2.w1,w3.w1
>
> This would yield results like "the man that" (words 3-5 above),"that
> woman who","this man which", etc.
>
> The problem is, the self-join solution is extremely slow.  I have a
> SQL Server 7.0 database with a clustered index on TheWord (sequential
> words) and a normal index on TheID.  Even with all of this, however, a
> self-join query like the one just listed takes about 15 seconds on my
> machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0).
>
> Any suggestions?  Have I messed up in terms of the SQL statement?
> Thanks in advance for any help that you can give.
>
> Mark Davies
> Illinois State University
>
> P.S. Yes, I know about Full-Text Indexing in SQL Server, but it's not
> adequate for my purposes -- there's a lot more to the project than
> what I've described here.




Re: Slow self-join on a 100 million record table

From
Tom Lane
Date:
mdavies@ilstu.edu (Mark Davies) writes:
> The problem is, the self-join solution is extremely slow.  I have a
> SQL Server 7.0 database with a clustered index on TheWord (sequential
> words) and a normal index on TheID.

Kindly do not pester Postgres mailing lists with SQL Server questions.
        regards, tom lane


Re: Slow self-join on a 100 million record table

From
"Paul Thornett"
Date:
I've played with this quite a lot on a fairly similar machine to
yours - i.e. Dual pentium III 700s, 5x4.5Gb 10k Scsi disks Hardware
Raid0, 1 Gb. Ram, Windows 2000 Professional (SP3), Sql Server 2000
Desktop (SP2). As expected my elapse times are almost exactly twice as
long as yours.

I created my 100 million rows using an 8k file of words which I then
copied and appended until it contained 100 million rows. Then I DTssed
it into my table:
CREATE TABLE test (wid  int         IDENTITY (1, 1) NOT NULL, word varchar(20) NOT NULL) ON [PRIMARY]

I went through numerous sequences, finally ending up with the data and
clustered index in the Primary Filegroup on my Raid0 disks, and the
nonclustered primary key (wid) in a secondary Filegroup on a separate
Firewire disk.

I also simplified your Sql (I don't see any need for subqueries), as
follows:

SELECT Refs=COUNT(*), t1.word, t2.word, t3.word
FROM test t1
JOIN test t2
ON t2.wid=t1.wid+1
AND t1.word in ('any','of','a')
AND t2.word in ('corel','woman','person')
JOIN test t3
ON t3.wid=t1.wid+2
AND t3.word in ('ventura','that','which')
GROUP BY t1.word, t2.word, t3.word

This gives me:

Refs       word   word      word
---------------------------------
23086   a          Corel     VENTURA
57715   of         Corel     VENTURA

All the permutations I tried gave an identical result - it always
takes 30 seconds to run the above query on my machine, or your version
of the same query, or any number of other ideas I tried. Just to
select the first word with no joins takes 5 seconds. I was surprised
the secondary Filegroup didn't improve speed at all.

I feel like I've been on a long journey, and ended up exactly where I
started!!

-- 
Paul Thornett


"Mark Davies" <mdavies@ilstu.edu> wrote in message
news:a28f4bc4.0212310746.4a272b00@posting.google.com...
> I have a database containing 100 million records, in which each
record
> contains (in sequence) all of the words in a 100 million word
> collection of texts.  There are two columns: TheID (offset value)
and
> TheWord (sequential words), e.g.:
>
> TheID  TheWord
> ----  -----
>
> 1  I
> 2  saw
> 3  the
> 4  man
> 5  that
> 6  came
>  . . .
> 100000000 xxx
>
> To extract strings, I then use self-joins on this one table, in
which
> [ID], [ID-1], [ID+1] etc are used to find preceding and following
> words, e.g.:
> select count(*),w1.w1,w2.w1,w3.w1 from
> ((select w1, ID+1 as ID from seq where w1 in ('the','that','this'))
w1
> inner join
> (select w1, ID as ID from seq where w1 in ('man','woman','person'))
w2
> on w2.ID = w1.ID)
> inner join
> (select w1, ID-1 as ID from seq where w1 in ('who','that','which'))
w3
> on w3.ID=w1.ID
> group by w1.w1,w2.w1,w3.w1
>
> This would yield results like "the man that" (words 3-5 above),"that
> woman who","this man which", etc.
>
> The problem is, the self-join solution is extremely slow.  I have a
> SQL Server 7.0 database with a clustered index on TheWord
(sequential
> words) and a normal index on TheID.  Even with all of this, however,
a
> self-join query like the one just listed takes about 15 seconds on
my
> machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0).
>
> Any suggestions?  Have I messed up in terms of the SQL statement?
> Thanks in advance for any help that you can give.