Re: Simple machine-killing query! - Mailing list pgsql-performance

From Stephan Szabo
Subject Re: Simple machine-killing query!
Date
Msg-id 20041021075757.A54616@megazone.bigpanda.com
Whole thread Raw
In response to Simple machine-killing query!  (Victor Ciurus <vikcious@gmail.com>)
List pgsql-performance
On Thu, 21 Oct 2004, Victor Ciurus wrote:

> Hi all,
>
> I'm writing this because I've reached the limit of my imagination and
> patience! So here is it...
>
> 2 tables:
> 1 containing 27 million variable lenght, alpha-numeric records
> (strings) in 1 (one) field. (10 - 145 char lenght per record)
> 1 containing 2.5 million variable lenght, alpha-numeric records
> (strings) in 1 (one) field.
>
> table wehere created using:
> CREATE TABLE "public"."BIGMA" ("string" VARCHAR(255) NOT NULL) WITH OIDS; +
> CREATE INDEX "BIGMA_INDEX" ON "public"."BIGMA" USING btree ("string");
> and
> CREATE TABLE "public"."DIRTY" ("string" VARCHAR(128) NOT NULL) WITH OIDS; +
> CREATE INDEX "DIRTY_INDEX" ON "public"."DIRTY" USING btree ("string");
>
> What I am requested to do is to keep all records from 'BIGMA' that do
> not apear in 'DIRTY'
> So far I have tried solving this by going for:
>
> [explain] select * from BIGMA where string not in (select * from DIRTY);
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  Seq Scan on bigma  (cost=0.00..24582291.25 rows=500 width=145)
>    Filter: (NOT (subplan))
>    SubPlan
>      ->  Seq Scan on dirty  (cost=0.00..42904.63 rows=2503963 width=82)
> (4 rows)

Have you analyzed bigma? The number of rows from the two explains for that
table look suspiciously like default values.

Also, what version are you using, because there are some differences from
7.3 to 7.4 that change possible suggestions.

The first is that on 7.4, you may be able to do better with a higher
sort_mem which could possible switch over to the hashed implementation,
although I think it's probably going to take a pretty high value given the
size.

The second is that you might get better results (even on older versions)
from an exists or left join solution, something like (assuming no nulls in
bigma.email):

select * from bigma where not exists(select 1 from dirty where dirty.email
!= bigma.email);

select bigma.* from bigma left outer join dirty on (dirty.email =
bigma.email) where dirty.email is null;

If you've got nulls in bigma.email you have to be a little more careful.

> [explain] select * from bigma,dirty where bigma.email!=dirty.email;

This *almost* certainly does not do what you want.  For most data sets
this is going to give you a number of rows very close to # of rows in
dirty * # of rows in bigma.  Needless to say, this is going to take a long
time.

pgsql-performance by date:

Previous
From: Jan Dittmer
Date:
Subject: Re: Anything to be gained from a 'Postgres Filesystem'?
Next
From: Aaron Werman
Date:
Subject: Re: Simple machine-killing query!