Re: Simple machine-killing query! - Mailing list pgsql-performance

From Aaron Werman
Subject Re: Simple machine-killing query!
Date
Msg-id 157f648404102108141836a5ac@mail.gmail.com
Whole thread Raw
In response to Simple machine-killing query!  (Victor Ciurus <vikcious@gmail.com>)
List pgsql-performance
Sounds like you need some way to match a subset of the data first,
rather than try indices that are bigger than the data. Can you add
operation indices, perhaps on the first 10 bytes of the keys in both
tables or on a integer hash of all of the strings? If so you could
join on the exact set difference over the set difference of the
operation match.

/Aaron


On Thu, 21 Oct 2004 17:34:17 +0300, Victor Ciurus <vikcious@gmail.com> wrote:
> Hi all,
>
> I'm writing this because I've reached the limit of my imagination and
> patience! So here is it...
>
> 2 tables:
> 1 containing 27 million variable lenght, alpha-numeric records
> (strings) in 1 (one) field. (10 - 145 char lenght per record)
> 1 containing 2.5 million variable lenght, alpha-numeric records
> (strings) in 1 (one) field.
>
> table wehere created using:
> CREATE TABLE "public"."BIGMA" ("string" VARCHAR(255) NOT NULL) WITH OIDS; +
> CREATE INDEX "BIGMA_INDEX" ON "public"."BIGMA" USING btree ("string");
> and
> CREATE TABLE "public"."DIRTY" ("string" VARCHAR(128) NOT NULL) WITH OIDS; +
> CREATE INDEX "DIRTY_INDEX" ON "public"."DIRTY" USING btree ("string");
>
> What I am requested to do is to keep all records from 'BIGMA' that do
> not apear in 'DIRTY'
> So far I have tried solving this by going for:
>
> [explain] select * from BIGMA where string not in (select * from DIRTY);
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  Seq Scan on bigma  (cost=0.00..24582291.25 rows=500 width=145)
>    Filter: (NOT (subplan))
>    SubPlan
>      ->  Seq Scan on dirty  (cost=0.00..42904.63 rows=2503963 width=82)
> (4 rows)
>
> AND
>
> [explain] select * from bigma,dirty where bigma.email!=dirty.email;
>                               QUERY PLAN
> -----------------------------------------------------------------------
>  Nested Loop  (cost=20.00..56382092.13 rows=2491443185 width=227)
>    Join Filter: (("inner".email)::text <> ("outer".email)::text)
>    ->  Seq Scan on dirty  (cost=0.00..42904.63 rows=2503963 width=82)
>    ->  Materialize  (cost=20.00..30.00 rows=1000 width=145)
>          ->  Seq Scan on bigma  (cost=0.00..20.00 rows=1000 width=145)
> (5 rows)
>
> Now the problem is that both of my previous tries seem to last
> forever! I'm not a pqsql guru so that's why I'm asking you fellas to
> guide mw right! I've tried this on mysql previosly but there seems to
> be no way mysql can handle this large query.
>
> QUESTIONS:
> What can I do in order to make this work?
> Where do I make mistakes? Is there a way I can improve the performance
> in table design, query style, server setting so that I can get this
> monster going and producing a result?
>
> Thanks all for your preciuos time and answers!
>
> Victor C.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>


--

Regards,
/Aaron

pgsql-performance by date:

Previous
From: Stephan Szabo
Date:
Subject: Re: Simple machine-killing query!
Next
From: Tom Lane
Date:
Subject: Re: Simple machine-killing query!