Thread: Simple machine-killing query!

Simple machine-killing query!

From
Victor Ciurus
Date:
Hi all,

I'm writing this because I've reached the limit of my imagination and
patience! So here is it...

2 tables:
1 containing 27 million variable lenght, alpha-numeric records
(strings) in 1 (one) field. (10 - 145 char lenght per record)
1 containing 2.5 million variable lenght, alpha-numeric records
(strings) in 1 (one) field.

table wehere created using:
CREATE TABLE "public"."BIGMA" ("string" VARCHAR(255) NOT NULL) WITH OIDS; +
CREATE INDEX "BIGMA_INDEX" ON "public"."BIGMA" USING btree ("string");
and
CREATE TABLE "public"."DIRTY" ("string" VARCHAR(128) NOT NULL) WITH OIDS; +
CREATE INDEX "DIRTY_INDEX" ON "public"."DIRTY" USING btree ("string");

What I am requested to do is to keep all records from 'BIGMA' that do
not apear in 'DIRTY'
So far I have tried solving this by going for:

[explain] select * from BIGMA where string not in (select * from DIRTY);
                               QUERY PLAN
------------------------------------------------------------------------
 Seq Scan on bigma  (cost=0.00..24582291.25 rows=500 width=145)
   Filter: (NOT (subplan))
   SubPlan
     ->  Seq Scan on dirty  (cost=0.00..42904.63 rows=2503963 width=82)
(4 rows)

AND

[explain] select * from bigma,dirty where bigma.email!=dirty.email;
                              QUERY PLAN
-----------------------------------------------------------------------
 Nested Loop  (cost=20.00..56382092.13 rows=2491443185 width=227)
   Join Filter: (("inner".email)::text <> ("outer".email)::text)
   ->  Seq Scan on dirty  (cost=0.00..42904.63 rows=2503963 width=82)
   ->  Materialize  (cost=20.00..30.00 rows=1000 width=145)
         ->  Seq Scan on bigma  (cost=0.00..20.00 rows=1000 width=145)
(5 rows)

Now the problem is that both of my previous tries seem to last
forever! I'm not a pqsql guru so that's why I'm asking you fellas to
guide mw right! I've tried this on mysql previosly but there seems to
be no way mysql can handle this large query.

QUESTIONS:
What can I do in order to make this work?
Where do I make mistakes? Is there a way I can improve the performance
in table design, query style, server setting so that I can get this
monster going and producing a result?

Thanks all for your preciuos time and answers!

Victor C.

Re: Simple machine-killing query!

From
Stephan Szabo
Date:
On Thu, 21 Oct 2004, Victor Ciurus wrote:

> Hi all,
>
> I'm writing this because I've reached the limit of my imagination and
> patience! So here is it...
>
> 2 tables:
> 1 containing 27 million variable lenght, alpha-numeric records
> (strings) in 1 (one) field. (10 - 145 char lenght per record)
> 1 containing 2.5 million variable lenght, alpha-numeric records
> (strings) in 1 (one) field.
>
> table wehere created using:
> CREATE TABLE "public"."BIGMA" ("string" VARCHAR(255) NOT NULL) WITH OIDS; +
> CREATE INDEX "BIGMA_INDEX" ON "public"."BIGMA" USING btree ("string");
> and
> CREATE TABLE "public"."DIRTY" ("string" VARCHAR(128) NOT NULL) WITH OIDS; +
> CREATE INDEX "DIRTY_INDEX" ON "public"."DIRTY" USING btree ("string");
>
> What I am requested to do is to keep all records from 'BIGMA' that do
> not apear in 'DIRTY'
> So far I have tried solving this by going for:
>
> [explain] select * from BIGMA where string not in (select * from DIRTY);
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  Seq Scan on bigma  (cost=0.00..24582291.25 rows=500 width=145)
>    Filter: (NOT (subplan))
>    SubPlan
>      ->  Seq Scan on dirty  (cost=0.00..42904.63 rows=2503963 width=82)
> (4 rows)

Have you analyzed bigma? The number of rows from the two explains for that
table look suspiciously like default values.

Also, what version are you using, because there are some differences from
7.3 to 7.4 that change possible suggestions.

The first is that on 7.4, you may be able to do better with a higher
sort_mem which could possible switch over to the hashed implementation,
although I think it's probably going to take a pretty high value given the
size.

The second is that you might get better results (even on older versions)
from an exists or left join solution, something like (assuming no nulls in
bigma.email):

select * from bigma where not exists(select 1 from dirty where dirty.email
!= bigma.email);

select bigma.* from bigma left outer join dirty on (dirty.email =
bigma.email) where dirty.email is null;

If you've got nulls in bigma.email you have to be a little more careful.

> [explain] select * from bigma,dirty where bigma.email!=dirty.email;

This *almost* certainly does not do what you want.  For most data sets
this is going to give you a number of rows very close to # of rows in
dirty * # of rows in bigma.  Needless to say, this is going to take a long
time.

Re: Simple machine-killing query!

From
Aaron Werman
Date:
Sounds like you need some way to match a subset of the data first,
rather than try indices that are bigger than the data. Can you add
operation indices, perhaps on the first 10 bytes of the keys in both
tables or on a integer hash of all of the strings? If so you could
join on the exact set difference over the set difference of the
operation match.

/Aaron


On Thu, 21 Oct 2004 17:34:17 +0300, Victor Ciurus <vikcious@gmail.com> wrote:
> Hi all,
>
> I'm writing this because I've reached the limit of my imagination and
> patience! So here is it...
>
> 2 tables:
> 1 containing 27 million variable lenght, alpha-numeric records
> (strings) in 1 (one) field. (10 - 145 char lenght per record)
> 1 containing 2.5 million variable lenght, alpha-numeric records
> (strings) in 1 (one) field.
>
> table wehere created using:
> CREATE TABLE "public"."BIGMA" ("string" VARCHAR(255) NOT NULL) WITH OIDS; +
> CREATE INDEX "BIGMA_INDEX" ON "public"."BIGMA" USING btree ("string");
> and
> CREATE TABLE "public"."DIRTY" ("string" VARCHAR(128) NOT NULL) WITH OIDS; +
> CREATE INDEX "DIRTY_INDEX" ON "public"."DIRTY" USING btree ("string");
>
> What I am requested to do is to keep all records from 'BIGMA' that do
> not apear in 'DIRTY'
> So far I have tried solving this by going for:
>
> [explain] select * from BIGMA where string not in (select * from DIRTY);
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  Seq Scan on bigma  (cost=0.00..24582291.25 rows=500 width=145)
>    Filter: (NOT (subplan))
>    SubPlan
>      ->  Seq Scan on dirty  (cost=0.00..42904.63 rows=2503963 width=82)
> (4 rows)
>
> AND
>
> [explain] select * from bigma,dirty where bigma.email!=dirty.email;
>                               QUERY PLAN
> -----------------------------------------------------------------------
>  Nested Loop  (cost=20.00..56382092.13 rows=2491443185 width=227)
>    Join Filter: (("inner".email)::text <> ("outer".email)::text)
>    ->  Seq Scan on dirty  (cost=0.00..42904.63 rows=2503963 width=82)
>    ->  Materialize  (cost=20.00..30.00 rows=1000 width=145)
>          ->  Seq Scan on bigma  (cost=0.00..20.00 rows=1000 width=145)
> (5 rows)
>
> Now the problem is that both of my previous tries seem to last
> forever! I'm not a pqsql guru so that's why I'm asking you fellas to
> guide mw right! I've tried this on mysql previosly but there seems to
> be no way mysql can handle this large query.
>
> QUESTIONS:
> What can I do in order to make this work?
> Where do I make mistakes? Is there a way I can improve the performance
> in table design, query style, server setting so that I can get this
> monster going and producing a result?
>
> Thanks all for your preciuos time and answers!
>
> Victor C.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>


--

Regards,
/Aaron

Re: Simple machine-killing query!

From
Tom Lane
Date:
Victor Ciurus <vikcious@gmail.com> writes:
> What I am requested to do is to keep all records from 'BIGMA' that do
> not apear in 'DIRTY'
> So far I have tried solving this by going for:

> [explain] select * from BIGMA where string not in (select * from DIRTY);
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  Seq Scan on bigma  (cost=0.00..24582291.25 rows=500 width=145)
>    Filter: (NOT (subplan))
>    SubPlan
>      ->  Seq Scan on dirty  (cost=0.00..42904.63 rows=2503963 width=82)
> (4 rows)

If you are using PG 7.4, you can get reasonable performance out of this
approach, but you need to jack sort_mem up to the point where the whole
DIRTY table will fit into sort_mem (so that you get a hashed-subplan
plan and not a plain subplan).  If you find yourself setting sort_mem to
more than say half of your machine's available RAM, you should probably
forget that idea.

> [explain] select * from bigma,dirty where bigma.email!=dirty.email;

This of course does not give the right answer at all.

A trick that people sometimes use is an outer join:

select * from bigma left join dirty on (bigma.email=dirty.email)
where dirty.email is null;

Understanding why this works is left as an exercise for the reader
... but it does work, and pretty well too.  If you're using pre-7.4
PG then this is about the only effective solution AFAIR.

            regards, tom lane

Re: Simple machine-killing query!

From
Victor Ciurus
Date:
Well guys,

Your replies have been more than helpful to me, showing me both the
learning stuff I still have to get in my mind about real SQL and the
wonder called PostgreSQL and a very good solution from Tom Lane
(thanks a lot sir!)!

Indeed, changing mem_sort and other server parmeters along with the
quite strange (to me!) outer join Tom mentioned finally got me to
finalize the cleaning task and indeed in warp speed (some 5 mintues or
less!). I am running PG v7.4.5 on a PIV Celeron 1,7Ghz with 1,5Gb ram
so talking about the time performance I might say that I'm more than
pleased with the result. As with the amazement PG "caused" me through
its reliability so far I am decided to go even deeper in learning it!

Thanks again all for your precious help!

Regards,
Victor


>
> If you are using PG 7.4, you can get reasonable performance out of this
> approach, but you need to jack sort_mem up to the point where the whole
> DIRTY table will fit into sort_mem (so that you get a hashed-subplan
> plan and not a plain subplan).  If you find yourself setting sort_mem to
> more than say half of your machine's available RAM, you should probably
> forget that idea.
>
> > [explain] select * from bigma,dirty where bigma.email!=dirty.email;
>
> This of course does not give the right answer at all.
>
> A trick that people sometimes use is an outer join:
>
> select * from bigma left join dirty on (bigma.email=dirty.email)
> where dirty.email is null;
>
> Understanding why this works is left as an exercise for the reader
> ... but it does work, and pretty well too.  If you're using pre-7.4
> PG then this is about the only effective solution AFAIR.
>
>                         regards, tom lane
>

Re: Simple machine-killing query!

From
Josh Berkus
Date:
Victor,

> [explain] select * from BIGMA where string not in (select * from DIRTY);
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  Seq Scan on bigma  (cost=0.00..24582291.25 rows=500 width=145)
>    Filter: (NOT (subplan))
>    SubPlan
>      ->  Seq Scan on dirty  (cost=0.00..42904.63 rows=2503963 width=82)

This is what you call an "evil query".   I'm not surprised it takes forever;
you're telling the database "Compare every value in 2.7 million rows of text
against 2.5 million rows of text and give me those that don't match."   There
is simply no way, on ANY RDBMS, for this query to execute and not eat all of
your RAM and CPU for a long time.

You're simply going to have to allocate shared_buffers and sort_mem (about 2GB
of sort_mem would be good) to the query, and turn the computer over to the
task until it's done.   And, for the sake of sanity, when you find the
200,000 rows that don't match, flag them so that you don't have to do this
again.

--
Josh Berkus
Aglio Database Solutions
San Francisco