Re: Slow self-join on a 100 million record table - Mailing list pgsql-sql

From Paul Thornett
Subject Re: Slow self-join on a 100 million record table
Date
Msg-id ONtn28gsCHA.440@TK2MSFTNGP12
Whole thread Raw
In response to Slow self-join on a 100 million record table  (mdavies@ilstu.edu (Mark Davies))
List pgsql-sql
I've played with this quite a lot on a fairly similar machine to
yours - i.e. Dual pentium III 700s, 5x4.5Gb 10k Scsi disks Hardware
Raid0, 1 Gb. Ram, Windows 2000 Professional (SP3), Sql Server 2000
Desktop (SP2). As expected my elapse times are almost exactly twice as
long as yours.

I created my 100 million rows using an 8k file of words which I then
copied and appended until it contained 100 million rows. Then I DTssed
it into my table:
CREATE TABLE test (wid  int         IDENTITY (1, 1) NOT NULL, word varchar(20) NOT NULL) ON [PRIMARY]

I went through numerous sequences, finally ending up with the data and
clustered index in the Primary Filegroup on my Raid0 disks, and the
nonclustered primary key (wid) in a secondary Filegroup on a separate
Firewire disk.

I also simplified your Sql (I don't see any need for subqueries), as
follows:

SELECT Refs=COUNT(*), t1.word, t2.word, t3.word
FROM test t1
JOIN test t2
ON t2.wid=t1.wid+1
AND t1.word in ('any','of','a')
AND t2.word in ('corel','woman','person')
JOIN test t3
ON t3.wid=t1.wid+2
AND t3.word in ('ventura','that','which')
GROUP BY t1.word, t2.word, t3.word

This gives me:

Refs       word   word      word
---------------------------------
23086   a          Corel     VENTURA
57715   of         Corel     VENTURA

All the permutations I tried gave an identical result - it always
takes 30 seconds to run the above query on my machine, or your version
of the same query, or any number of other ideas I tried. Just to
select the first word with no joins takes 5 seconds. I was surprised
the secondary Filegroup didn't improve speed at all.

I feel like I've been on a long journey, and ended up exactly where I
started!!

-- 
Paul Thornett


"Mark Davies" <mdavies@ilstu.edu> wrote in message
news:a28f4bc4.0212310746.4a272b00@posting.google.com...
> I have a database containing 100 million records, in which each
record
> contains (in sequence) all of the words in a 100 million word
> collection of texts.  There are two columns: TheID (offset value)
and
> TheWord (sequential words), e.g.:
>
> TheID  TheWord
> ----  -----
>
> 1  I
> 2  saw
> 3  the
> 4  man
> 5  that
> 6  came
>  . . .
> 100000000 xxx
>
> To extract strings, I then use self-joins on this one table, in
which
> [ID], [ID-1], [ID+1] etc are used to find preceding and following
> words, e.g.:
> select count(*),w1.w1,w2.w1,w3.w1 from
> ((select w1, ID+1 as ID from seq where w1 in ('the','that','this'))
w1
> inner join
> (select w1, ID as ID from seq where w1 in ('man','woman','person'))
w2
> on w2.ID = w1.ID)
> inner join
> (select w1, ID-1 as ID from seq where w1 in ('who','that','which'))
w3
> on w3.ID=w1.ID
> group by w1.w1,w2.w1,w3.w1
>
> This would yield results like "the man that" (words 3-5 above),"that
> woman who","this man which", etc.
>
> The problem is, the self-join solution is extremely slow.  I have a
> SQL Server 7.0 database with a clustered index on TheWord
(sequential
> words) and a normal index on TheID.  Even with all of this, however,
a
> self-join query like the one just listed takes about 15 seconds on
my
> machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0).
>
> Any suggestions?  Have I messed up in terms of the SQL statement?
> Thanks in advance for any help that you can give.




pgsql-sql by date:

Previous
From: "Pedro Igor"
Date:
Subject: PostgreSQL X Resin EE
Next
From: Steve Crawford
Date:
Subject: Re: empty arrays