Re: Slow self-join on a 100 million record table - Mailing list pgsql-sql
From | Paul Thornett |
---|---|
Subject | Re: Slow self-join on a 100 million record table |
Date | |
Msg-id | ONtn28gsCHA.440@TK2MSFTNGP12 Whole thread Raw |
In response to | Slow self-join on a 100 million record table (mdavies@ilstu.edu (Mark Davies)) |
List | pgsql-sql |
I've played with this quite a lot on a fairly similar machine to yours - i.e. Dual pentium III 700s, 5x4.5Gb 10k Scsi disks Hardware Raid0, 1 Gb. Ram, Windows 2000 Professional (SP3), Sql Server 2000 Desktop (SP2). As expected my elapse times are almost exactly twice as long as yours. I created my 100 million rows using an 8k file of words which I then copied and appended until it contained 100 million rows. Then I DTssed it into my table: CREATE TABLE test (wid int IDENTITY (1, 1) NOT NULL, word varchar(20) NOT NULL) ON [PRIMARY] I went through numerous sequences, finally ending up with the data and clustered index in the Primary Filegroup on my Raid0 disks, and the nonclustered primary key (wid) in a secondary Filegroup on a separate Firewire disk. I also simplified your Sql (I don't see any need for subqueries), as follows: SELECT Refs=COUNT(*), t1.word, t2.word, t3.word FROM test t1 JOIN test t2 ON t2.wid=t1.wid+1 AND t1.word in ('any','of','a') AND t2.word in ('corel','woman','person') JOIN test t3 ON t3.wid=t1.wid+2 AND t3.word in ('ventura','that','which') GROUP BY t1.word, t2.word, t3.word This gives me: Refs word word word --------------------------------- 23086 a Corel VENTURA 57715 of Corel VENTURA All the permutations I tried gave an identical result - it always takes 30 seconds to run the above query on my machine, or your version of the same query, or any number of other ideas I tried. Just to select the first word with no joins takes 5 seconds. I was surprised the secondary Filegroup didn't improve speed at all. I feel like I've been on a long journey, and ended up exactly where I started!! -- Paul Thornett "Mark Davies" <mdavies@ilstu.edu> wrote in message news:a28f4bc4.0212310746.4a272b00@posting.google.com... > I have a database containing 100 million records, in which each record > contains (in sequence) all of the words in a 100 million word > collection of texts. There are two columns: TheID (offset value) and > TheWord (sequential words), e.g.: > > TheID TheWord > ---- ----- > > 1 I > 2 saw > 3 the > 4 man > 5 that > 6 came > . . . > 100000000 xxx > > To extract strings, I then use self-joins on this one table, in which > [ID], [ID-1], [ID+1] etc are used to find preceding and following > words, e.g.: > select count(*),w1.w1,w2.w1,w3.w1 from > ((select w1, ID+1 as ID from seq where w1 in ('the','that','this')) w1 > inner join > (select w1, ID as ID from seq where w1 in ('man','woman','person')) w2 > on w2.ID = w1.ID) > inner join > (select w1, ID-1 as ID from seq where w1 in ('who','that','which')) w3 > on w3.ID=w1.ID > group by w1.w1,w2.w1,w3.w1 > > This would yield results like "the man that" (words 3-5 above),"that > woman who","this man which", etc. > > The problem is, the self-join solution is extremely slow. I have a > SQL Server 7.0 database with a clustered index on TheWord (sequential > words) and a normal index on TheID. Even with all of this, however, a > self-join query like the one just listed takes about 15 seconds on my > machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0). > > Any suggestions? Have I messed up in terms of the SQL statement? > Thanks in advance for any help that you can give.