Thread: Slow self-join on a 100 million record table
I have a database containing 100 million records, in which each record contains (in sequence) all of the words in a 100 million word collection of texts. There are two columns: TheID (offset value) and TheWord (sequential words), e.g.: TheID TheWord ---- ----- 1 I 2 saw 3 the 4 man 5 that 6 came. . . 100000000 xxx To extract strings, I then use self-joins on this one table, in which [ID], [ID-1], [ID+1] etc are used to find preceding and following words, e.g.: select count(*),w1.w1,w2.w1,w3.w1 from ((select w1, ID+1 as ID from seq where w1 in ('the','that','this')) w1 inner join (select w1, ID as ID from seq where w1 in ('man','woman','person')) w2 on w2.ID = w1.ID) inner join (select w1, ID-1 as ID from seq where w1 in ('who','that','which')) w3 on w3.ID=w1.ID group by w1.w1,w2.w1,w3.w1 This would yield results like "the man that" (words 3-5 above),"that woman who","this man which", etc. The problem is, the self-join solution is extremely slow. I have a SQL Server 7.0 database with a clustered index on TheWord (sequential words) and a normal index on TheID. Even with all of this, however, a self-join query like the one just listed takes about 15 seconds on my machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0). Any suggestions? Have I messed up in terms of the SQL statement? Thanks in advance for any help that you can give. Mark Davies Illinois State University P.S. Yes, I know about Full-Text Indexing in SQL Server, but it's not adequate for my purposes -- there's a lot more to the project than what I've described here.
Mark, I doubt very much you will ever get much faster results (without increasing hardware) in a situation such as that. Your queries don't look selective enough to effectively use the indexes. What is the query plan for each of the individual selects and what does it look like as a whole? How many rows does each individual select return and how many for the final statement? -- Andrew J. Kelly SQL Server MVP "Mark Davies" <mdavies@ilstu.edu> wrote in message news:a28f4bc4.0212310746.4a272b00@posting.google.com... > I have a database containing 100 million records, in which each record > contains (in sequence) all of the words in a 100 million word > collection of texts. There are two columns: TheID (offset value) and > TheWord (sequential words), e.g.: > > TheID TheWord > ---- ----- > > 1 I > 2 saw > 3 the > 4 man > 5 that > 6 came > . . . > 100000000 xxx > > To extract strings, I then use self-joins on this one table, in which > [ID], [ID-1], [ID+1] etc are used to find preceding and following > words, e.g.: > > select count(*),w1.w1,w2.w1,w3.w1 from > ((select w1, ID+1 as ID from seq where w1 in ('the','that','this')) w1 > inner join > (select w1, ID as ID from seq where w1 in ('man','woman','person')) w2 > on w2.ID = w1.ID) > inner join > (select w1, ID-1 as ID from seq where w1 in ('who','that','which')) w3 > on w3.ID=w1.ID > group by w1.w1,w2.w1,w3.w1 > > This would yield results like "the man that" (words 3-5 above),"that > woman who","this man which", etc. > > The problem is, the self-join solution is extremely slow. I have a > SQL Server 7.0 database with a clustered index on TheWord (sequential > words) and a normal index on TheID. Even with all of this, however, a > self-join query like the one just listed takes about 15 seconds on my > machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0). > > Any suggestions? Have I messed up in terms of the SQL statement? > Thanks in advance for any help that you can give. > > Mark Davies > Illinois State University > > P.S. Yes, I know about Full-Text Indexing in SQL Server, but it's not > adequate for my purposes -- there's a lot more to the project than > what I've described here.
mdavies@ilstu.edu (Mark Davies) writes: > The problem is, the self-join solution is extremely slow. I have a > SQL Server 7.0 database with a clustered index on TheWord (sequential > words) and a normal index on TheID. Kindly do not pester Postgres mailing lists with SQL Server questions. regards, tom lane
I've played with this quite a lot on a fairly similar machine to yours - i.e. Dual pentium III 700s, 5x4.5Gb 10k Scsi disks Hardware Raid0, 1 Gb. Ram, Windows 2000 Professional (SP3), Sql Server 2000 Desktop (SP2). As expected my elapse times are almost exactly twice as long as yours. I created my 100 million rows using an 8k file of words which I then copied and appended until it contained 100 million rows. Then I DTssed it into my table: CREATE TABLE test (wid int IDENTITY (1, 1) NOT NULL, word varchar(20) NOT NULL) ON [PRIMARY] I went through numerous sequences, finally ending up with the data and clustered index in the Primary Filegroup on my Raid0 disks, and the nonclustered primary key (wid) in a secondary Filegroup on a separate Firewire disk. I also simplified your Sql (I don't see any need for subqueries), as follows: SELECT Refs=COUNT(*), t1.word, t2.word, t3.word FROM test t1 JOIN test t2 ON t2.wid=t1.wid+1 AND t1.word in ('any','of','a') AND t2.word in ('corel','woman','person') JOIN test t3 ON t3.wid=t1.wid+2 AND t3.word in ('ventura','that','which') GROUP BY t1.word, t2.word, t3.word This gives me: Refs word word word --------------------------------- 23086 a Corel VENTURA 57715 of Corel VENTURA All the permutations I tried gave an identical result - it always takes 30 seconds to run the above query on my machine, or your version of the same query, or any number of other ideas I tried. Just to select the first word with no joins takes 5 seconds. I was surprised the secondary Filegroup didn't improve speed at all. I feel like I've been on a long journey, and ended up exactly where I started!! -- Paul Thornett "Mark Davies" <mdavies@ilstu.edu> wrote in message news:a28f4bc4.0212310746.4a272b00@posting.google.com... > I have a database containing 100 million records, in which each record > contains (in sequence) all of the words in a 100 million word > collection of texts. There are two columns: TheID (offset value) and > TheWord (sequential words), e.g.: > > TheID TheWord > ---- ----- > > 1 I > 2 saw > 3 the > 4 man > 5 that > 6 came > . . . > 100000000 xxx > > To extract strings, I then use self-joins on this one table, in which > [ID], [ID-1], [ID+1] etc are used to find preceding and following > words, e.g.: > select count(*),w1.w1,w2.w1,w3.w1 from > ((select w1, ID+1 as ID from seq where w1 in ('the','that','this')) w1 > inner join > (select w1, ID as ID from seq where w1 in ('man','woman','person')) w2 > on w2.ID = w1.ID) > inner join > (select w1, ID-1 as ID from seq where w1 in ('who','that','which')) w3 > on w3.ID=w1.ID > group by w1.w1,w2.w1,w3.w1 > > This would yield results like "the man that" (words 3-5 above),"that > woman who","this man which", etc. > > The problem is, the self-join solution is extremely slow. I have a > SQL Server 7.0 database with a clustered index on TheWord (sequential > words) and a normal index on TheID. Even with all of this, however, a > self-join query like the one just listed takes about 15 seconds on my > machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0). > > Any suggestions? Have I messed up in terms of the SQL statement? > Thanks in advance for any help that you can give.