Thread: best way to query

best way to query

From

Steve Clark

Date:

25 January 2008, 09:04:08

Hello List,

this is a noobie question:

I have had to take over an existing system - it was supposed to have
some contraints that
prevented dangling references - but...

any way i have 2 table - A and B.
each table has a key field and if a row is in B it should have a
corresponding row in A - but theres
the problem it doesn't for all the rows in B.

So I want to do something like
delete from B where key not in (select key from A order by key);

The problem is there are about 1,000,000 rows in A and 300,000 rows in
B. I let the above run
all night and it was still running the next morning. Does anyone have
an idea of a better way.

B = t_event_ack_log
A = t_unit_event_log

explain shows:

  Aggregate  (cost=4712921585.30..4712921585.31 rows=1 width=0)
    ->  Seq Scan on t_event_ack_log a  (cost=103170.29..4712920878.60
rows=282677 width=0)
          Filter: (NOT (subplan))
          SubPlan
            ->  Materialize  (cost=103170.29..117301.92 rows=1016163
width=4)
                  ->  Index Scan using pk_tuel_eln on t_unit_event_log
  (cost=0.00..98184.12 rows=1016163 width=4)

OBTW: how do I interpret the cost - the manual says:
planner's guess at how long it will take to run the statement
(measured in units of disk page fetches)"

Not sure I understand (measured in units of disk page fetches)


Thanks,
Steve

Re: best way to query

From

Tom Lane

Date:

25 January 2008, 11:48:08

Steve Clark <sclark@netwolves.com> writes:
> explain shows:

>   Aggregate  (cost=4712921585.30..4712921585.31 rows=1 width=0)
>     ->  Seq Scan on t_event_ack_log a  (cost=103170.29..4712920878.60
> rows=282677 width=0)
>           Filter: (NOT (subplan))
>           SubPlan
>             ->  Materialize  (cost=103170.29..117301.92 rows=1016163
> width=4)
>                   ->  Index Scan using pk_tuel_eln on t_unit_event_log
>   (cost=0.00..98184.12 rows=1016163 width=4)

Yeah, that's going to suck.  A brute force solution is to see if you
can get it to switch to a "hashed subplan" by increasing work_mem.

Also, whatever is the ORDER BY for?

            regards, tom lane

Re: best way to query

From

Steve Clark

Date:

25 January 2008, 12:24:01

Tom Lane wrote:
> Steve Clark <sclark@netwolves.com> writes:
>
>>explain shows:
>
>
>>  Aggregate  (cost=4712921585.30..4712921585.31 rows=1 width=0)
>>    ->  Seq Scan on t_event_ack_log a  (cost=103170.29..4712920878.60
>>rows=282677 width=0)
>>          Filter: (NOT (subplan))
>>          SubPlan
>>            ->  Materialize  (cost=103170.29..117301.92 rows=1016163
>>width=4)
>>                  ->  Index Scan using pk_tuel_eln on t_unit_event_log
>>  (cost=0.00..98184.12 rows=1016163 width=4)
>
>
> Yeah, that's going to suck.  A brute force solution is to see if you
> can get it to switch to a "hashed subplan" by increasing work_mem.
>
> Also, whatever is the ORDER BY for?
>
>             regards, tom lane
>

without the order by it wants to do a seq scan of t_unit_event_log.
see below:
  explain select count(*) from t_event_ack_log where event_log_no not
in (select event_log_no from t_unit_event_log);
                                        QUERY PLAN
----------------------------------------------------------------------------------------
  Aggregate  (cost=12144872193.82..12144872193.82 rows=1 width=0)
    ->  Seq Scan on t_event_ack_log  (cost=0.00..12144871485.07
rows=283497 width=0)
          Filter: (NOT (subplan))
          SubPlan
            ->  Seq Scan on t_unit_event_log  (cost=0.00..40286.56
rows=1021156 width=4)
(5 rows)


Will try increasing work_memory.

Re: best way to query

From

"Daniel Verite"

Date:

25 January 2008, 12:50:59

    Steve Clark wrote:

> any way i have 2 table - A and B.
> each table has a key field and if a row is in B it should have a
> corresponding row in A - but theres
> the problem it doesn't for all the rows in B.
>
> So I want to do something like
> delete from B where key not in (select key from A order by key);
>
> The problem is there are about 1,000,000 rows in A and 300,000 rows
in
> B. I let the above run
> all night and it was still running the next morning. Does anyone have

> an idea of a better way.

An outer join is sometimes spectacularly more efficient for this
particular kind of query.

I'd suggest you try:

delete from B where key in
 (select B.key from B left outer join A on A.key=B.key
   where A.key is null)

--
 Daniel
 PostgreSQL-powered mail user agent and storage:
http://www.manitou-mail.org

Re: best way to query

From

Tom Lane

Date:

25 January 2008, 13:05:18

Steve Clark <sclark@netwolves.com> writes:
> Tom Lane wrote:
>> Also, whatever is the ORDER BY for?

> without the order by it wants to do a seq scan of t_unit_event_log.
> see below:
>   explain select count(*) from t_event_ack_log where event_log_no not
> in (select event_log_no from t_unit_event_log);
>                                         QUERY PLAN
> ----------------------------------------------------------------------------------------
>   Aggregate  (cost=12144872193.82..12144872193.82 rows=1 width=0)
>     ->  Seq Scan on t_event_ack_log  (cost=0.00..12144871485.07
> rows=283497 width=0)
>           Filter: (NOT (subplan))
>           SubPlan
>             ->  Seq Scan on t_unit_event_log  (cost=0.00..40286.56
> rows=1021156 width=4)
> (5 rows)

Hmm, the big problem with that isn't the seqscan but the lack of a
Materialize step to buffer it; which says to me that you're running a
pretty old Postgres version (8.0 or older).  You should think about an
update if you're running into performance issues.

            regards, tom lane

Re: best way to query

From

Steve Clark

Date:

25 January 2008, 13:15:57

Daniel Verite wrote:
>     Steve Clark wrote:
>
>
>>any way i have 2 table - A and B.
>>each table has a key field and if a row is in B it should have a
>>corresponding row in A - but theres
>>the problem it doesn't for all the rows in B.
>>
>>So I want to do something like
>>delete from B where key not in (select key from A order by key);
>>
>>The problem is there are about 1,000,000 rows in A and 300,000 rows
>
> in
>
>>B. I let the above run
>>all night and it was still running the next morning. Does anyone have
>
>
>>an idea of a better way.
>
>
> An outer join is sometimes spectacularly more efficient for this
> particular kind of query.
>
> I'd suggest you try:
>
> delete from B where key in
>  (select B.key from B left outer join A on A.key=B.key
>    where A.key is null)
>
WOW!

this runs in about 10 seconds - thanks Daniel.

  explain select count(*) from t_event_ack_log  where event_log_no in
(select t_event_ack_log.event_log_no from t_event_ack_log left outer
join t_unit_event_log a on a.event_log_no=t_event_ack_log.event_log_no
  where a.event_log_no is null);
                                                    QUERY PLAN

----------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=128349.56..128349.57 rows=1 width=0)
    ->  Hash Join  (cost=94512.91..126935.36 rows=565681 width=0)
          Hash Cond: (public.t_event_ack_log.event_log_no =
public.t_event_ack_log.event_log_no)
          ->  Seq Scan on t_event_ack_log  (cost=0.00..14759.85
rows=565685 width=4)
          ->  Hash  (cost=92609.85..92609.85 rows=152245 width=4)
                ->  HashAggregate  (cost=91087.40..92609.85
rows=152245 width=4)
                      ->  Hash Left Join  (cost=57337.95..90380.29
rows=282842 width=4)
                            Hash Cond:
(public.t_event_ack_log.event_log_no = a.event_log_no)
                            Filter: (a.event_log_no IS NULL)
                            ->  Seq Scan on t_event_ack_log
(cost=0.00..14759.85 rows=565685 width=4)
                            ->  Hash  (cost=40696.09..40696.09
rows=1014309 width=4)
                                  ->  Seq Scan on t_unit_event_log a
(cost=0.00..40696.09 rows=1014309 width=4)

Re: best way to query

From

Steve Clark

Date:

25 January 2008, 13:19:24

Tom Lane wrote:
> Steve Clark <sclark@netwolves.com> writes:
>
>>Tom Lane wrote:
>>
>>>Also, whatever is the ORDER BY for?
>
>
>>without the order by it wants to do a seq scan of t_unit_event_log.
>>see below:
>>  explain select count(*) from t_event_ack_log where event_log_no not
>>in (select event_log_no from t_unit_event_log);
>>                                        QUERY PLAN
>>----------------------------------------------------------------------------------------
>>  Aggregate  (cost=12144872193.82..12144872193.82 rows=1 width=0)
>>    ->  Seq Scan on t_event_ack_log  (cost=0.00..12144871485.07
>>rows=283497 width=0)
>>          Filter: (NOT (subplan))
>>          SubPlan
>>            ->  Seq Scan on t_unit_event_log  (cost=0.00..40286.56
>>rows=1021156 width=4)
>>(5 rows)
>
>
> Hmm, the big problem with that isn't the seqscan but the lack of a
> Materialize step to buffer it; which says to me that you're running a
> pretty old Postgres version (8.0 or older).  You should think about an
> update if you're running into performance issues.
>
>             regards, tom lane
>
>

Hi Tom,

Actually this is

  show server_version;
  server_version
----------------
  8.2.5


On FreeBSD 6.2

And Daniel gave me a query that does the job in just a few seconds.


Thanks,
Steve