Thread: how to speed up query

how to speed up query

From
"Andrus"
Date:
How to speed up the query

delete from firma1.rid where dokumnr not in (select dokumnr from firma1.dok)

which runs approx 30 minutes

I have dokumnr indexes on both tables, both tables are analyzed.


CREATE TABLE firma1.dok
(
  doktyyp character(1) NOT NULL,
  dokumnr integer NOT NULL DEFAULT nextval('dok_dokumnr_seq'::regclass),
....

CREATE INDEX dok_dokumnr_idx  ON firma1.dok  USING btree  (dokumnr);



CREATE TABLE firma1.rid
(
  id integer NOT NULL DEFAULT nextval('rid_id_seq'::regclass),
  reanr integer NOT NULL DEFAULT nextval('rid_reanr_seq'::regclass),
  dokumnr integer NOT NULL,
....
  CONSTRAINT rid_dokumnr_fkey FOREIGN KEY (dokumnr)
      REFERENCES firma1.dok (dokumnr) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITIALLY IMMEDIATE,
..
)

CREATE INDEX rid_dokumnr_idx  ON firma1.rid  USING btree  (dokumnr);


explain delete from firma1.rid where dokumnr not in (select dokumnr from
firma1.dok)  returns

"Seq Scan on rid  (cost=7703.59..99687857.75 rows=102358 width=6)"
"  Filter: (NOT (subplan))"
"  SubPlan"
"    ->  Materialize  (cost=7703.59..8537.22 rows=55963 width=4)"
"          ->  Seq Scan on dok  (cost=0.00..7373.63 rows=55963 width=4)"




Andrus.


Re: how to speed up query

From
Andrew Sullivan
Date:
On Fri, Jun 08, 2007 at 11:29:12AM +0300, Andrus wrote:
> How to speed up the query

We don't know.  You don't tell us what version you're running, show
us any EXPLAIN ANALYSE output, tell us about the data. . .

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
Unfortunately reformatting the Internet is a little more painful
than reformatting your hard drive when it gets out of whack.
        --Scott Morris

Re: how to speed up query

From
Erwin Brandstetter
Date:
Hi Andrus!

On Jun 8, 10:29 am, "Andrus" <kobrule...@hot.ee> wrote:
> How to speed up the query
>
> delete from firma1.rid where dokumnr not in (select dokumnr from firma1.dok)

> CREATE TABLE firma1.dok
> (
>   doktyyp character(1) NOT NULL,
>   dokumnr integer NOT NULL DEFAULT nextval('dok_dokumnr_seq'::regclass),
> ....
>
> CREATE INDEX dok_dokumnr_idx  ON firma1.dok  USING btree  (dokumnr);
>
> CREATE TABLE firma1.rid
> (
>   id integer NOT NULL DEFAULT nextval('rid_id_seq'::regclass),
>   reanr integer NOT NULL DEFAULT nextval('rid_reanr_seq'::regclass),
>   dokumnr integer NOT NULL,
> ....
>   CONSTRAINT rid_dokumnr_fkey FOREIGN KEY (dokumnr)
>       REFERENCES firma1.dok (dokumnr) MATCH SIMPLE
>       ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITIALLY IMMEDIATE,
(...)


This whole operation looks contradictory in several ways.

firma1.rid references firma1.dok on (dokumnr)
Therefore, referential integrity commands that there be NO rows in
firma1.rid with a dokumnr not present in firma1.dok.
Therefore your DELETE cannot possibly be deleting anything. It is
nonsensical:
   delete from firma1.rid where dokumnr not in (select dokumnr from
firma1.dok)

Did you mean:
   delete from firma1.dok where dokumnr not in (select dokumnr from
firma1.rid)
??

The next weird thing:
I see no UNIQUE index (or primary key) ON firma1.dok.dokumnr. As it is
being referenced by foreign key constraint from firma1.rid, the system
would require that.
This index makes no sense at all:
   CREATE INDEX dok_dokumnr_idx  ON firma1.dok  USING btree
(dokumnr);


Either your problem description is messed up or your postgres
installation is. My money is on the former.


Aside from that, my ideas would be (assuming that you got the
statement backwards):
1.) Try to make the foreign key DEFERRABLE INITIALLY DEFERRED (instead
of INITIALLY IMMEDIATE), because every delete on firma1.dok CASCADES
to firma1.rid.

2.) Add a DISTINCT clause:
   delete from firma1.dok where dokumnr not in (select DISTINCT
dokumnr from firma1.rid)


3.) Write results of the subquery in a temp table, then DELETE:

CREATE TEMP TABLE mydel AS SELECT DISTINCT dokumnr  FROM firma1.rid;
DELETE FROM firma1.dok USING mydel WHERE firma1.rid.doumnr =
mydel.doumnr;

If these things do not solve your problem, it might still be helpful
to tell us what they do.


Regards
Erwin


Re: how to speed up query

From
Erwin Brandstetter
Date:
On Jun 9, 12:15 am, Erwin Brandstetter <brsaw...@gmail.com> wrote:
> 3.) Write results of the subquery in a temp table, then DELETE:
>
> CREATE TEMP TABLE mydel AS SELECT DISTINCT dokumnr  FROM firma1.rid;
> DELETE FROM firma1.dok USING mydel WHERE firma1.rid.doumnr =
> mydel.doumnr;

Ah! 3.) should read:
CREATE TEMP TABLE mydel AS SELECT DISTINCT dokumnr  FROM firma1.rid;
DELETE FROM firma1.dok WHERE dokumnr NOT IN (SELECT dukumnr FROM
mydel);

Or 4.)
If "NOT IN" should be the culprit, there is an alternative:
( I seem to remember issues with its performance in the past, but
hasn't that been improved? Not sure.)
Haven't tested, whether the temp table is useful here:

CREATE TEMP TABLE mydel AS
SELECT d.dokumnr
FROM firma1.dok d
LEFT JOIN (SELECT DISTINCT dokumnr FROM firma1.rid) r USING (dokumnr)
WHERE r.dokumnr IS NULL;
DELETE FROM firma1.dok USING mydel WHERE firma1.dok.doumnr =
mydel.documnr;


Regards
Erwin


Re: how to speed up query

From
Erwin Brandstetter
Date:
On Jun 11, 2:01 pm, "Andrus" <kobrule...@hot.ee> wrote:
(...)
> > This index makes no sense at all:
> > CREATE INDEX dok_dokumnr_idx  ON firma1.dok  USING btree (dokumnr);
>
> I listed table structure and constraints partially.
> Theis is also primary key constraint in dok table:
>
>  CONSTRAINT dok_pkey PRIMARY KEY (dokumnr),


On a sidenote: this primary implements a unique index anyway. The
additional index is useless. You can delete it to save time and
storage.
(Or maybe this is just another discrepancy between reality and problem
description.)


Regards
Erwin


Re: how to speed up query

From
Erwin Brandstetter
Date:
On Jun 11, 2:23 pm, "Andrus" <kobrule...@hot.ee> wrote:
> I tried
>
> CREATE TEMP TABLE mydel AS
>  SELECT r.dokumnr
>  FROM rid r
>  LEFT JOIN dok d USING (dokumnr)
>  WHERE d.dokumnr IS NULL;
> DELETE FROM rid USING mydel WHERE rid.dokumnr =mydel.dokumnr;
> drop table mydel;
>
> and this runs 1 seconds intead for 2.2 hours.
>
> Thank you very much.
> This works!
>
> It's sad that PostgreSQL cannot optimize this delete statement
> automatically.


1 second does sound a lot better than 2 hours, doesn't it? :)
As to why Postgres seems to fail, I cannot say any more, as your
description is unclear. I am pretty sure there is some
misunderstanding, though.


Regards
Erwin


Re: how to speed up query

From
Erwin Brandstetter
Date:
Hi Andrus!

On Jun 12, 6:38 pm, "Andrus" <kobrule...@hot.ee> wrote:
> 1 second if for repeated runs from pgAdmin.
> I my script same CREATE TEMP TABLE command takes appox 11 minutes for same
> data (see log below).

I cannot make much sense of this information. I can see no reason why
your script should take 11 minutes, while executing it from pgAdmin
would take only a second. How do you run the script?


> After your suggested change my database creation script runs 6 hours.

Is that down from the 14 hours you mentioned before? Which would be an
amazing 8 hours faster?


(...)
> I used query
>
> SELECT relname as Table_Name, relpages * 8/1024 as size_in_mb
>  FROM pg_class
>  where  relpages * 8/1024>0
>  ORDER BY relpages DESC

Looks like a useful query. Compare with:
SELECT pg_size_pretty(pg_database_size(' bilkaib'))
SELECT pg_size_pretty(pg_relation_size(' bilkaib'))
SELECT pg_size_pretty(pg_total_relation_size(' bilkaib'))
See  http://www.postgresql.org/docs/8.2/interactive/functions-admin.html

> Biggest database (bilkaib) load time is 8 minutes, it contains 329000
> records.
> Total data loading time is approx 49 minutes.

You mean table, not database?


> Remaining 5 hours are used for index and key creation. This seems too much.
>
> Here is log file for minutes  49 .. 135 ie. first 86 minutes after loading
> data.
>
> It shows statements which ran more than 1 minute.
>
> First number (49,4500) is the number minutes from start of script (starting
> from database creation).
>
> The slowest statement is
>
> CREATE TEMP TABLE mydel AS
>  SELECT r.dokumnr
>  FROM rid r
>  LEFT JOIN dok d USING (dokumnr)
>  WHERE d.dokumnr IS NULL

It might be worth checking the order in which you create objects.
Creating relevant indices before using complex queries is one thing to
look for.
If that still runs so slow it's probably indication that your RDBMS is
in dire need of more RAM. Look to your setup in postgresql.conf.

As everything runs slow, you should look to your hardware, system
configuration and PostgreSQL setup. Do you have enough RAM (you
mentioned 2 GB) and does PostgreSQL get its share? (-> setup in
postgresql.conf). There is probably a bottleneck somewhere.

I have a machine just like the one you described above ( dual AMD
Opteron 240, 2 GB RAM, 10k rpm HDDs in RAID 1) - slow CPUs and
conservative RAID setup - and a complete dump AND restore of a
database where SELECT pg_size_pretty(pg_database_size('event')) = 300
MB takes about 1,5 minutes. And this is with pg 8.1.8, so your setup
should have better hardware (guessing here) and newer software. But
your DB is also a lot bigger.

Anyway, I am not a postgres hacker, I am just a DB admin myself, so
don't expect too much from me. Someone else might know more. But if
you want help, you'd better learn how to present a problem in a way,
that deserves help.


My best guess: buy at least 2 GB more RAM. Look to your settings in
postgresql.conf. Read up here:
http://www.postgresql.org/docs/current/static/performance-tips.html
http://revsys.com/writings/postgresql-performance.html
http://www.powerpostgresql.com/Docs
http://www.powerpostgresql.com/Downloads/annotated_conf_80.html

If that does not solve your problem, post your setup or your script -
whichever you suspect to be the problem - and try to present all the
necessary information in a concise manner. That is much more likely to
get help. Nobody wants to waste time, especially not helping someone
free of charge. Your first posting was just not good enough. If you
keep mixing things up, people will be frustrated and rather not help.
Read your posting, before you send it.


Regards
Erwin


Re: how to speed up query

From
"Andrus"
Date:
Andrew,

>> How to speed up the query
>
> We don't know.

Thank you.

Explain seems to show that PostgreSQL makes sequential scan of whole dok
table for every rid table row.
This is very slow since dok contains 55963 and rid  202421 rows.
I expected that there exists some trick like to force this DELETE command to
use bitmaps by re-writing it using joins.

> You don't tell us what version you're running

"PostgreSQL 8.2.3 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC) 3.4.2
(mingw-special)"

> , show us any EXPLAIN ANALYSE output

I tried

explain analyze delete from firma1.rid where dokumnr not in (select dokumnr
from firma1.dok)

but after 40 minutes this command is not finished.

I leave this query running for a weekend.


> , tell us about the data. . .

I'm creating PostgreSQL database and loading data to it from my application.
I use this type of delete commands to remove bad rows before adding foreign
keys.
My loading application is running  14 hours and is not finished.Server and
application are in the same fast computer with 2 GB RAM and 10000 RPM WD
Raptor HDD running Windows Vista.
postgresql.conf file is not modified, it is from PostgreSQL Windows
installer.
After 14 hours CPU usage is 50% (postgres process uses it all), no disk
activity.

I seems that PostgreSQL cannot use 100% of CPU in dual core processors when
creating new database and loading data using single connection. It can use
only 50% CPU.
Maybe two connections which can ran two child processes can use 100% of cpu
but I have no generic idea how to split database loading and foreign key
creation into two connections.

Andrus.


Re: how to speed up query

From
"Andrus"
Date:
> This whole operation looks contradictory in several ways.
>
> firma1.rid references firma1.dok on (dokumnr)
> Therefore, referential integrity commands that there be NO rows in
> firma1.rid with a dokumnr not present in firma1.dok.
> Therefore your DELETE cannot possibly be deleting anything. It is
> nonsensical:
>   delete from firma1.rid where dokumnr not in (select dokumnr from
> firma1.dok)

Yes, it is nonsensial. However, this command should run fast even if it is
nonsensial.

I my application I add foreign key after running this delete command.
I displayed the table structure after addind, I'm sorry.

I tried the following command

alter table firma1.rid drop constraint rid_dokumnr_fkey;
set constraints all deferred;
explain analyze delete from firma1.rid where dokumnr not in (select dokumnr
from firma1.dok)

but it still produces plan

"Seq Scan on rid  (cost=7703.59..98570208.00 rows=101210 width=6)"
"  Filter: (NOT (subplan))"
"  SubPlan"
"    ->  Materialize  (cost=7703.59..8537.22 rows=55963 width=4)"
"          ->  Seq Scan on dok  (cost=0.00..7373.63 rows=55963 width=4)"


> Did you mean:
>   delete from firma1.dok where dokumnr not in (select dokumnr from
> firma1.rid)
> ??

No. I mean

 delete from firma1.rid where dokumnr not in (select dokumnr from
 firma1.dok)

> The next weird thing:
> I see no UNIQUE index (or primary key) ON firma1.dok.dokumnr. As it is
> being referenced by foreign key constraint from firma1.rid, the system
> would require that.
> This index makes no sense at all:
>   CREATE INDEX dok_dokumnr_idx  ON firma1.dok  USING btree
> (dokumnr);

I listed table structure and constraints partially.
Theis is also primary key constraint in dok table:

 CONSTRAINT dok_pkey PRIMARY KEY (dokumnr),

> Either your problem description is messed up or your postgres
> installation is. My money is on the former.
>
>
> Aside from that, my ideas would be (assuming that you got the
> statement backwards):
> 1.) Try to make the foreign key DEFERRABLE INITIALLY DEFERRED (instead
> of INITIALLY IMMEDIATE), because every delete on firma1.dok CASCADES
> to firma1.rid.

I delete from firma1.rid table.
I dropped the foreign key using

alter table firma1.rid drop constraint rid_dokumnr_fkey;

but the problem persist.

> 2.) Add a DISTINCT clause:
>   delete from firma1.dok where dokumnr not in (select DISTINCT
> dokumnr from firma1.rid)

I tried
   delete from firma1.rid  where dokumnr not in (select DISTINCT
 dokumnr from firma1.dok)

but this runs still very long time.

output from explain:

"Seq Scan on rid  (cost=20569.69..98583074.10 rows=101210 width=6)"
"  Filter: (NOT (subplan))"
"  SubPlan"
"    ->  Materialize  (cost=20569.69..21403.32 rows=55963 width=4)"
"          ->  Unique  (cost=0.00..20239.73 rows=55963 width=4)"
"                ->  Index Scan using dok_dokumnr_idx on dok
(cost=0.00..20099.82 rows=55963 width=4)"


Andrus.


Re: how to speed up query

From
"Andrus"
Date:
>> I tried
>>
>> CREATE TEMP TABLE mydel AS
>>  SELECT r.dokumnr
>>  FROM rid r
>>  LEFT JOIN dok d USING (dokumnr)
>>  WHERE d.dokumnr IS NULL;
>> DELETE FROM rid USING mydel WHERE rid.dokumnr =mydel.dokumnr;
>> drop table mydel;
>>
>> and this runs 1 seconds intead for 2.2 hours.
>>
>> Thank you very much.
>> This works!
>>
>> It's sad that PostgreSQL cannot optimize this delete statement
>> automatically.
>
>
> 1 second does sound a lot better than 2 hours, doesn't it? :)

1 second if for repeated runs from pgAdmin.
I my script same CREATE TEMP TABLE command takes appox 11 minutes for same
data (see log below).

> As to why Postgres seems to fail, I cannot say any more, as your
> description is unclear. I am pretty sure there is some
> misunderstanding, though.

After your suggested change my database creation script runs 6 hours.

Result database biggest 15 tables are:

      1  bilkaib  152MB
      2  omrid  146MB
      3  klient  130MB
      4  rid 120MB
      5  omdok 59MB
      6  dok  48MB
      7  mailbox 28MB
      8  report  19MB
      9  bilkaib_db_idx 16MB
     10  bilkaib_cr_idx 16MB
     11  bilkaib_pkey 14MB
     12  bilkaib_kuupaev_idx  13MB
     13  bilkaib_dokumnr_idx 11MB
     14  summav  9MB
     15  desktop 7MB

I used query

SELECT relname as Table_Name, relpages * 8/1024 as size_in_mb
 FROM pg_class
 where  relpages * 8/1024>0
 ORDER BY relpages DESC

for this.

Biggest database (bilkaib) load time is 8 minutes, it contains 329000
records.
Total data loading time is approx 49 minutes.

Remaining 5 hours are used for index and key creation. This seems too much.

Here is log file for minutes  49 .. 135 ie. first 86 minutes after loading
data.

It shows statements which ran more than 1 minute.

First number (49,4500) is the number minutes from start of script (starting
from database creation).

The slowest statement is

CREATE TEMP TABLE mydel AS
 SELECT r.dokumnr
 FROM rid r
 LEFT JOIN dok d USING (dokumnr)
 WHERE d.dokumnr IS NULL

which starts at minute 104 and has duration 11 minutes.

There seems to be no direct bottleneck: there are many commands with
duration  1.. 11 minutes.
I will run my script in today night to get complete timing.

Any idea how to increase speed ?

Andrus.

49,4500 Duration 1,4167 minutes: UPDATE dok SET krdokumnr=NULL WHERE
krDokumnr is NOT null AND doktyyp NOT IN ('G','O')

52,3167 Duration 2,8667 minutes:
UPDATE dok SET krdokumnr=NULL WHERE doktyyp='G' AND krdokumnr IS NOT NULL
and
   krdokumnr NOT in (select dokumnr from dok WHERE doktyyp='G')

55,1500 Duration 2,8333 minutes:
UPDATE dok SET krdokumnr=NULL WHERE doktyyp='O' AND krdokumnr IS NOT NULL
and
   krdokumnr NOT in (select dokumnr from dok WHERE doktyyp='O')

56,5667 Duration 1,4167 minutes:
ALTER TABLE dok ADD CHECK (krdokumnr IS NULL OR doktyyp IN('G','O'))

57,9833 Duration 1,4167 minutes:
ALTER TABLE dok ADD CHECK (dokumnr>0)

60,8333 Duration 2,8333 minutes:
ALTER TABLE dok ADD FOREIGN KEY (krdokumnr) REFERENCES dok
  ON DELETE SET NULL ON UPDATE CASCADE DEFERRABLE

62,2667 Duration 1,4333 minutes:
ALTER TABLE dok ALTER doktyyp SET NOT NULL

63,6833 Duration 1,4167 minutes:
ALTER TABLE dok ALTER kuupaev SET NOT NULL

65,1333 Duration 1,4500 minutes:

CREATE INDEX dok_kuupaev_idx ON dok (kuupaev)

66,5667 Duration 1,4333 minutes:
CREATE INDEX dok_krdokumnr_idx ON dok (krdokumnr)

68 Duration 1,4333 minutes:
CREATE INDEX dok_tellimus_idx ON dok (tellimus)

69,4333 Duration 1,4333 minutes:
CREATE INDEX dok_tasudok_idx ON dok (tasudok)

70,8833 Duration 1,4333 minutes:
CREATE INDEX dok_klient_idx ON dok (klient)

72,3167 Duration 1,4333 minutes:
CREATE INDEX dok_tasumata_idx ON dok (tasumata)

73,7500 Duration 1,4333 minutes:

CREATE UNIQUE INDEX dok_tasudok_unique_idx ON dok (doktyyp,tasudok)
    WHERE doktyyp IN ( 'T', 'U')

83,5000 Duration 9,7500 minutes:
CREATE INDEX rid_dokumnr_idx ON rid (dokumnr)

93,2500 Duration 9,7500 minutes:
CREATE INDEX rid_toode_idx ON rid (toode)

104,3500 Duration 11,1000 minutes: CREATE TEMP TABLE mydel AS
 SELECT r.dokumnr
 FROM rid r
 LEFT JOIN dok d USING (dokumnr)
 WHERE d.dokumnr IS NULL

114,0167 Duration 9,6500 minutes:
DELETE FROM rid USING mydel WHERE rid.dokumnr =mydel.dokumnr

125,1500 Duration 11,1333 minutes:

 ALTER TABLE rid ADD FOREIGN KEY (dokumnr) REFERENCES dok
  ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE

135,0833 Duration 9,9333 minutes:
ALTER TABLE rid ALTER dokumnr SET NOT NULL








Re: how to speed up query

From
"Andrus"
Date:
>> Theis is also primary key constraint in dok table:
>>
>>  CONSTRAINT dok_pkey PRIMARY KEY (dokumnr),
>
>
> On a sidenote: this primary implements a unique index anyway. The
> additional index is useless. You can delete it to save time and
> storage.
> (Or maybe this is just another discrepancy between reality and problem
> description.)

Erwin,

thank you.
This index was duplicated.
I changed my script not to create this index.
However this should not affect to speed a much since this index is created
after data is loaded to table.

Andrus.


Re: how to speed up query

From
"Andrus"
Date:
> Ah! 3.) should read:
> CREATE TEMP TABLE mydel AS SELECT DISTINCT dokumnr  FROM firma1.rid;
> DELETE FROM firma1.dok WHERE dokumnr NOT IN (SELECT dukumnr FROM
> mydel);

I need to delete from firma1.rid table
So I cannot use this suggestion since firma1.dok.dokumnr is already unique
(primary key).

> Or 4.)
> If "NOT IN" should be the culprit, there is an alternative:
> ( I seem to remember issues with its performance in the past, but
> hasn't that been improved? Not sure.)
> Haven't tested, whether the temp table is useful here:
>
> CREATE TEMP TABLE mydel AS
> SELECT d.dokumnr
> FROM firma1.dok d
> LEFT JOIN (SELECT DISTINCT dokumnr FROM firma1.rid) r USING (dokumnr)
> WHERE r.dokumnr IS NULL;
> DELETE FROM firma1.dok USING mydel WHERE firma1.dok.doumnr =
> mydel.documnr;

I tried

CREATE TEMP TABLE mydel AS
 SELECT r.dokumnr
 FROM rid r
 LEFT JOIN dok d USING (dokumnr)
 WHERE d.dokumnr IS NULL;
DELETE FROM rid USING mydel WHERE rid.dokumnr =mydel.dokumnr;
drop table mydel;

and this runs 1 seconds intead for 2.2 hours.

Thank you very much.
This works!

It's sad that PostgreSQL cannot optimize this delete statement
automatically.

Andrus.


Re: how to speed up query

From
"Andrus"
Date:
> We don't know.  You don't tell us what version you're running, show
> us any EXPLAIN ANALYSE output, tell us about the data. . .

explain analyze delete from firma1.rid where dokumnr not in (select dokumnr
from firma1.dok)

produces

"Seq Scan on rid  (cost=7703.59..98570208.00 rows=101210 width=6) (actual
time=7269933.877..7269933.877 rows=0 loops=1)"
"  Filter: (NOT (subplan))"
"  SubPlan"
"    ->  Materialize  (cost=7703.59..8537.22 rows=55963 width=4) (actual
time=0.007..18.707 rows=25313 loops=202421)"
"          ->  Seq Scan on dok  (cost=0.00..7373.63 rows=55963 width=4)
(actual time=40.236..3353.985 rows=56079 loops=1)"
"Total runtime: 7269944.251 ms"

As I understand this took 2.2 hours to run

Andrus.


Re: how to speed up query

From
Martijn van Oosterhout
Date:
On Mon, Jun 11, 2007 at 03:01:08PM +0300, Andrus wrote:
> >  delete from firma1.rid where dokumnr not in (select dokumnr from
> >firma1.dok)
>
> Yes, it is nonsensial. However, this command should run fast even if it is
> nonsensial.

For future reference, I beleive the problem is the NOT IN. It has this
"feature" where if any of the rows it searches has a NULL, it will
return FALSE for *all* rows. So the whole table has to be scanned to
check that there arn't any NULLs, before a single row can be returned.
This is why it can't be converted to a join.

Now, you may argue that in your case this doesn't apply, which may be
true, but it's always been a difficult construct to optimise... (and
somewhat surprising for people with they didn't realise the
null-effect). The most efficient way you write this is with an OUTER
JOIN.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: how to speed up query

From
"Andrus"
Date:
> >  delete from firma1.rid where dokumnr not in (select dokumnr from
> >firma1.dok)

>For future reference, I beleive the problem is the NOT IN. It has this
>"feature" where if any of the rows it searches has a NULL, it will
>return FALSE for *all* rows. So the whole table has to be scanned to
>check that there arn't any NULLs, before a single row can be returned.
>This is why it can't be converted to a join.

Thank you.
As I understand, only way to optimize the statement

delete from firma1.rid where dokumnr not in (select dokumnr from
firma1.dok);

assuming that  firma1.dok.dokumnr does not contain null values is to change
it to

CREATE TEMP TABLE mydel AS
 SELECT r.dokumnr
 FROM rid r
 LEFT JOIN dok d USING (dokumnr)
 WHERE d.dokumnr IS NULL;
DELETE FROM rid USING mydel WHERE rid.dokumnr =mydel.dokumnr;
drop table mydel;


I run the following commands (first number of minutes from script start) in
my script:

18 Duration 2,9 minutes: ALTER TABLE dok ADD PRIMARY KEY (dokumnr)
...
81 Duration 9,6 minutes:  CREATE INDEX rid_dokumnr_idx ON rid (dokumnr)
...
101 Duration 10,5 minutes:  analyze
...
113  Duration 11 minutes: CREATE TEMP TABLE mydel AS
 SELECT r.dokumnr
 FROM rid r
 LEFT JOIN dok d USING (dokumnr)
 WHERE d.dokumnr IS NULL

122 Duration 9,6 minutes:  DELETE FROM rid USING mydel WHERE rid.dokumnr
=mydel.dokumnr

133 Duration 11 minutes:  ALTER TABLE rid ADD FOREIGN KEY (dokumnr)
REFERENCES dok
  ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE

When I run command

CREATE TEMP TABLE mydel AS
 SELECT r.dokumnr
 FROM rid r
 LEFT JOIN dok d USING (dokumnr)
 WHERE d.dokumnr IS NULL

from pgAdmin, it takes 1 second.
When I run this command from script it takes 11 minutes!

Any idea why running this command from script takes 11 minutes?

I have created
indexes on dok and rid and ran analyze before using CREATE TEMP TABLE

So I expect that CREATE TEMP TABLE command must take same time to run from
script and from pgAdmin.

My script in running in single transaction.
Should I use commit after index creation or after ANALYZE command?


In pgAdmin

explain analyze  SELECT r.dokumnr
 FROM rid r
 LEFT JOIN dok d USING (dokumnr)
 WHERE d.dokumnr IS NULL

returns

"Hash Left Join  (cost=7760.27..31738.02 rows=1 width=4) (actual
time=2520.904..2520.904 rows=0 loops=1)"
"  Hash Cond: (r.dokumnr = d.dokumnr)"
"  Filter: (d.dokumnr IS NULL)"
"  ->  Seq Scan on rid r  (cost=0.00..17424.24 rows=202424 width=4) (actual
time=0.032..352.225 rows=202421 loops=1)"
"  ->  Hash  (cost=6785.01..6785.01 rows=56101 width=4) (actual
time=211.150..211.150 rows=56079 loops=1)"
"        ->  Seq Scan on dok d  (cost=0.00..6785.01 rows=56101 width=4)
(actual time=0.021..147.805 rows=56079 loops=1)"
"Total runtime: 2521.091 ms"

Andrus.


Re: how to speed up query

From
Tom Lane
Date:
"Andrus" <kobruleht2@hot.ee> writes:
> from pgAdmin, it takes 1 second.
> When I run this command from script it takes 11 minutes!

> Any idea why running this command from script takes 11 minutes?

Different plans maybe?  Try EXPLAIN ANALYZE in both cases.
Do you have work_mem set the same in both cases?

> My script in running in single transaction.
> Should I use commit after index creation or after ANALYZE command?

Hmm, there are some extra cycles involved in examining not-yet-committed
tuples, but I hardly see how that would create a discrepancy of this
size.  Check the plans first.

            regards, tom lane

Re: how to speed up query

From
"Andrus"
Date:
> I cannot make much sense of this information. I can see no reason why
> your script should take 11 minutes, while executing it from pgAdmin
> would take only a second. How do you run the script?

I'm running my script from VFP client applicaton.
Application sends every statement to server separately using ODBC driver.

table creation, data loading, primary key creation, index creation, analyze
and problematic CREATE TABLE TEMP command
all ran in single transaction.
Should I commit transactions after analyze command or after index creation?

server logs shows:

2007-06-13 03:19:43 LOG:  checkpoints are occurring too frequently (21
seconds apart)
2007-06-13 03:19:43 HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
2007-06-13 03:20:02 LOG:  checkpoints are occurring too frequently (19
seconds apart)
2007-06-13 03:20:02 HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
2007-06-13 03:20:22 LOG:  checkpoints are occurring too frequently (20
seconds apart)
2007-06-13 03:20:22 HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
2007-06-13 03:21:18 LOG:  checkpoints are occurring too frequently (23
seconds apart)
2007-06-13 03:21:18 HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
2007-06-13 03:49:10 ERROR:  deadlock detected
2007-06-13 03:49:10 DETAIL:  Process 3280 waits for AccessExclusiveLock on
relation 233893 of database 233756; blocked by process 2508.
 Process 2508 waits for ShareUpdateExclusiveLock on relation 233988 of
database 233756; blocked by process 3280.
2007-06-13 03:49:10 STATEMENT:  ALTER TABLE desktop ADD FOREIGN KEY
 (alamklass) REFERENCES andmetp ON UPDATE CASCADE      DEFERRABLE

and script terminates after 5.5 hours running yesterday night.
I will re-start computer and try again.

Can increasing checkpint_segments increase speed significantly ?

>> After your suggested change my database creation script runs 6 hours.
>
> Is that down from the 14 hours you mentioned before? Which would be an
> amazing 8 hours faster?

I had a number of DELETE .. WHERE NOT IN commands.
I changed all them to CREATE TEMP TABLE ...   DELETE

>> I used query
>>
>> SELECT relname as Table_Name, relpages * 8/1024 as size_in_mb
>>  FROM pg_class
>>  where  relpages * 8/1024>0
>>  ORDER BY relpages DESC
>
> Looks like a useful query. Compare with:
> SELECT pg_size_pretty(pg_database_size(' bilkaib'))

SELECT pg_size_pretty(pg_database_size('mydb'))

returns

828 MB


> SELECT pg_size_pretty(pg_relation_size(' bilkaib'))

returns 100 MB

> SELECT pg_size_pretty(pg_total_relation_size(' bilkaib'))

returns "171 MB"

relpages * 8/1024 and pg_relation_size(oid) return in some cases very
different result, no idea why.

For one index relpages returns size about 6 MB  but pg_relation_size returns
only 2152 kB


>> Biggest database (bilkaib) load time is 8 minutes, it contains 329000
>> records.
>> Total data loading time is approx 49 minutes.
>
> You mean table, not database?

Yes, I meant table.

>> Remaining 5 hours are used for index and key creation. This seems too
>> much.
> It might be worth checking the order in which you create objects.
> Creating relevant indices before using complex queries is one thing to
> look for.

I created primary key on dok(dokumnr), index on rid(dokumnr) and  ran
analyze before running
this CREATE TEMP TABLE command.

> If that still runs so slow it's probably indication that your RDBMS is
> in dire need of more RAM. Look to your setup in postgresql.conf.
> As everything runs slow, you should look to your hardware, system
> configuration and PostgreSQL setup. Do you have enough RAM (you
> mentioned 2 GB) and does PostgreSQL get its share? (-> setup in
> postgresql.conf). There is probably a bottleneck somewhere.

If CREATE TEMP TABLE from pgAdmin takes 1 sec and from script 11 minues I
do'nt think this is hardware related.

> If that does not solve your problem, post your setup or your script -
> whichever you suspect to be the problem

The script which creates 800 MB database  is big.
I can create this script but is anybody interested to look into it ?

Andrus.


Re: how to speed up query

From
Tom Lane
Date:
"Andrus" <kobruleht2@hot.ee> writes:
> 2007-06-13 03:49:10 ERROR:  deadlock detected
> 2007-06-13 03:49:10 DETAIL:  Process 3280 waits for AccessExclusiveLock on
> relation 233893 of database 233756; blocked by process 2508.
>  Process 2508 waits for ShareUpdateExclusiveLock on relation 233988 of
> database 233756; blocked by process 3280.
> 2007-06-13 03:49:10 STATEMENT:  ALTER TABLE desktop ADD FOREIGN KEY
>  (alamklass) REFERENCES andmetp ON UPDATE CASCADE      DEFERRABLE

> and script terminates after 5.5 hours running yesterday night.

This might be a good reason not to run the script as a single long
transaction --- it's probably accumulating locks on a lot of different
tables.  Which would be fine if it was the only thing going on, but
evidently it isn't.

            regards, tom lane

Re: how to speed up query

From
"Andrus"
Date:
>> and script terminates after 5.5 hours running yesterday night.
>
> This might be a good reason not to run the script as a single long
> transaction --- it's probably accumulating locks on a lot of different
> tables.  Which would be fine if it was the only thing going on, but
> evidently it isn't.

Thank you.
I removed transaction from script as all.
I send every statement separately to Postgres.
So each statement runs in its own single transaction now.

I hope that this does not decrease speed and this is best solution?

Now

 explain analyze  SELECT r.dokumnr
 FROM rid r
 LEFT JOIN dok d USING (dokumnr)
 WHERE d.dokumnr IS NULL

returs the same time from script and when returned separately in small
database.


Andrus.


Re: how to speed up query

From
"Andrus"
Date:
>> from pgAdmin, it takes 1 second.
>> When I run this command from script it takes 11 minutes!
>
>> Any idea why running this command from script takes 11 minutes?
>
> Different plans maybe?  Try EXPLAIN ANALYZE in both cases.

Thank you. I tried

 explain analyze  SELECT r.dokumnr FROM rid r  LEFT JOIN dok d USING
(dokumnr)
 WHERE d.dokumnr IS NULL

with small database.

From script this command returns

Hash Left Join  (cost=12.11..60.42 rows=1 width=4)         (actual
time=105.473..105.473 rows=0 loops=1)
Hash Cond: (r.dokumnr = d.dokumnr)
Filter: (d.dokumnr IS NULL)
->  Seq Scan on rid r  (cost=0.00..38.87  rows=687 width=4) (actual
time=2.144..90.823  rows=687 loops=1)
->  Hash  (cost=10.38..10.38 rows=138 width=4)  (actual time=13.925..13.925
rows=138 loops=1)
->  Seq Scan on dok d  (cost=0.00..10.38  rows=138 width=4) (actual
time=1.715..13.812  rows=138 loops=1)
Total runtime: 105.542 ms

running in standalone it returns

Hash Left Join  (cost=13.44..61.76 rows=1 width=4) (actual time=2.172..2.172
rows=0 loops=1)
Hash Cond: (r.dokumnr = d.dokumnr)
Filter: (d.dokumnr IS NULL)
->  Seq Scan on rid r  (cost=0.00..38.87 rows=687 width=4) (actual
time=0.076..0.802 rows=687 loops=1)
->  Hash  (cost=11.53..11.53 rows=153 width=4) (actual time=0.400..0.400
rows=138 loops=1)
->  Seq Scan on dok d  (cost=0.00..11.53 rows=153 width=4) (actual
time=0.013..0.242 rows=138 loops=1)
Total runtime: 2.338 ms

I have no idea why this command runs 50 times slower in script.

ODBC driver inserts RELEASE SAVEPOINT and SAVEPOINT commands before every
statement. There is great explanation about his in
http://archives.postgresql.org/pgsql-odbc/2006-05/msg00078.php
Unfortunately, no connection string option is documented.
I havent found a way to disable this automatic SAVEPOINT insertion from odbc
connection string.
I havent got reply to my  message from January, 18 2007 in odbc forum (I
posted again today).

Reading ODBC driver source this I expected that Protocol=-0  in
connection string should work but this does not. Probably I missed something
in C source.

However I think that this cannot slow down SELECT command  speed.


> Do you have work_mem set the same in both cases?

Yes. I have same database server and same database.

Andrus.


Re: how to speed up query

From
Erwin Brandstetter
Date:
On Jun 13, 3:13 pm, "Andrus" <kobrule...@hot.ee> wrote:
(...)
> As I understand, only way to optimize the statement
>
> delete from firma1.rid where dokumnr not in (select dokumnr from
> firma1.dok);
>
> assuming that  firma1.dok.dokumnr does not contain null values is to change
> it to
>
> CREATE TEMP TABLE mydel AS
>  SELECT r.dokumnr
>  FROM rid r
>  LEFT JOIN dok d USING (dokumnr)
>  WHERE d.dokumnr IS NULL;
> DELETE FROM rid USING mydel WHERE rid.dokumnr =mydel.dokumnr;
> drop table mydel;

As I mentioned when I proposed it, the temp table may not even be
necessary. The important part is the LEFT JOIN instead of the NOT IN
(as Martijn has explained).
You could try the direct approach ...

DELETE FROM rid
USING ( SELECT r.dokumnr
 FROM rid r
 LEFT JOIN dok d USING (dokumnr)
 WHERE d.dokumnr IS NULL) x
WHERE rid.dokumnr = x.dokumnr;

 ... and see which runs faster. Probably it does not make much of a
difference.

If the temp table works for you, you might be interested in a new
feature of 8.2: CREATE TEMP TABLE AS ...ON COMMIT DROP;
http://www.postgresql.org/docs/current/static/sql-createtableas.html

(...)

> explain analyze  SELECT r.dokumnr
>  FROM rid r
>  LEFT JOIN dok d USING (dokumnr)
>  WHERE d.dokumnr IS NULL
>
> returns
>
> "Hash Left Join  (cost=7760.27..31738.02 rows=1 width=4) (actual
> time=2520.904..2520.904 rows=0 loops=1)"
> "  Hash Cond: (r.dokumnr = d.dokumnr)"
> "  Filter: (d.dokumnr IS NULL)"
> "  ->  Seq Scan on rid r  (cost=0.00..17424.24 rows=202424 width=4) (actual
> time=0.032..352.225 rows=202421 loops=1)"
> "  ->  Hash  (cost=6785.01..6785.01 rows=56101 width=4) (actual
> time=211.150..211.150 rows=56079 loops=1)"
> "        ->  Seq Scan on dok d  (cost=0.00..6785.01 rows=56101 width=4)
> (actual time=0.021..147.805 rows=56079 loops=1)"
> "Total runtime: 2521.091 ms"

If the indices are present (and visible) at the time of execution, as
you described it, we should be seeing index scans on dok_dokumnr_idx
and rid_dokumnr_idx instead of sequential scans.

That's what I get on a similar query in one of my databases:
EXPLAIN ANALYZE SELECT a.adr_id FROM cp.adr a LEFT JOIN cp.kontakt k
USING (adr_id) WHERE k.adr_id IS NULL;

Merge Left Join  (cost=0.00..1356.31 rows=10261 width=4) (actual
time=0.096..56.759 rows=3868 loops=1)
  Merge Cond: ("outer".adr_id = "inner".adr_id)
  Filter: ("inner".adr_id IS NULL)
  ->  Index Scan using adr_pkey on adr a  (cost=0.00..947.54
rows=10261 width=4) (actual time=0.012..23.118 rows=10261 loops=1)
  ->  Index Scan using kontakt_adr_id_idx on kontakt k
(cost=0.00..295.47 rows=7011 width=4) (actual time=0.007..13.299
rows=7011 loops=1)
Total runtime: 58.510 ms


Regards
Erwin


Re: how to speed up query

From
"Andrus"
Date:
>> CREATE TEMP TABLE mydel AS
>>  SELECT r.dokumnr
>>  FROM rid r
>>  LEFT JOIN dok d USING (dokumnr)
>>  WHERE d.dokumnr IS NULL;
>> DELETE FROM rid USING mydel WHERE rid.dokumnr =mydel.dokumnr;
>> drop table mydel;
>
> As I mentioned when I proposed it, the temp table may not even be
> necessary. The important part is the LEFT JOIN instead of the NOT IN
> (as Martijn has explained).
> You could try the direct approach ...
>
> DELETE FROM rid
> USING ( SELECT r.dokumnr
> FROM rid r
> LEFT JOIN dok d USING (dokumnr)
> WHERE d.dokumnr IS NULL) x
> WHERE rid.dokumnr = x.dokumnr;
> ... and see which runs faster. Probably it does not make much of a
> difference.


Thank you.
I changed my DELETE commands to use internal table. This works fast.

I tried to change my update commands also to use internal table.
However, this causes update command to run 310 minutes:

update bilkaib SET cr4objekt=NULL
       FROM ( SELECT r.cr4objekt as key
  FROM bilkaib r
  LEFT JOIN yksus4 d ON d.YKSUS     =r.cr4objekt
     WHERE d.YKSUS      IS NULL) mydel
    WHERE cr4objekt IS NOT NULL AND
bilkaib.cr4objekt= mydel.key;

No idea why this does not work fast like in DELETE command. cr4objekt type
is CHAR(10)  maybe this makes internal table slow.

So I changed my script to

DROP TABLE if exists mydel;

CREATE TEMP TABLE mydel AS
  SELECT r.<<cchildkey>> as key
  FROM <<m.cChildtable>> r
  LEFT JOIN <<cmaintable>> d ON d.<<mainkey>>=r.<<cchildkey>>
     WHERE d.<<mainkey>> IS NULL;

update <<m.cChildtable>> SET <<cchildkey>>=NULL
       FROM mydel
    WHERE <<cchildkey>> IS NOT NULL AND
<<m.cChildtable>>.<<cchildkey>>= mydel.key;

Hope this will run fast (will test tomorrow).

My original skript

UPDATE <<m.cChildtable>> SET <<cchildkey>>=NULL
        WHERE <<cchildkey>> IS NOT NULL AND
        <<cchildkey>> NOT IN (SELECT <<mainkey>> FROM <<cmaintable>>);

runs 27 minutes in some cases.


> If the temp table works for you, you might be interested in a new
> feature of 8.2: CREATE TEMP TABLE AS ...ON COMMIT DROP;
> http://www.postgresql.org/docs/current/static/sql-createtableas.html

Per Tom remart , I removed transactions. Now every statement runs in
separate transaction.
In this case ON COMMIT DROP is useless.
ON COMMIT DROP exists in 8.1 also.
8.2 adds DROP IF EXISTS.

>> explain analyze  SELECT r.dokumnr
>>  FROM rid r
>>  LEFT JOIN dok d USING (dokumnr)
>>  WHERE d.dokumnr IS NULL
>>
>> returns
>>
>> "Hash Left Join  (cost=7760.27..31738.02 rows=1 width=4) (actual
>> time=2520.904..2520.904 rows=0 loops=1)"
>> "  Hash Cond: (r.dokumnr = d.dokumnr)"
>> "  Filter: (d.dokumnr IS NULL)"
>> "  ->  Seq Scan on rid r  (cost=0.00..17424.24 rows=202424 width=4)
>> (actual
>> time=0.032..352.225 rows=202421 loops=1)"
>> "  ->  Hash  (cost=6785.01..6785.01 rows=56101 width=4) (actual
>> time=211.150..211.150 rows=56079 loops=1)"
>> "        ->  Seq Scan on dok d  (cost=0.00..6785.01 rows=56101 width=4)
>> (actual time=0.021..147.805 rows=56079 loops=1)"
>> "Total runtime: 2521.091 ms"
>
> If the indices are present (and visible) at the time of execution, as
> you described it, we should be seeing index scans on dok_dokumnr_idx
> and rid_dokumnr_idx instead of sequential scans.
>
> That's what I get on a similar query in one of my databases:
> EXPLAIN ANALYZE SELECT a.adr_id FROM cp.adr a LEFT JOIN cp.kontakt k
> USING (adr_id) WHERE k.adr_id IS NULL;
>
> Merge Left Join  (cost=0.00..1356.31 rows=10261 width=4) (actual
> time=0.096..56.759 rows=3868 loops=1)
>  Merge Cond: ("outer".adr_id = "inner".adr_id)
>  Filter: ("inner".adr_id IS NULL)
>  ->  Index Scan using adr_pkey on adr a  (cost=0.00..947.54
> rows=10261 width=4) (actual time=0.012..23.118 rows=10261 loops=1)
>  ->  Index Scan using kontakt_adr_id_idx on kontakt k
> (cost=0.00..295.47 rows=7011 width=4) (actual time=0.007..13.299
> rows=7011 loops=1)
> Total runtime: 58.510 ms

I have no idea why my query plan shows hash and your plan show merge.
My primary key (dokumnr is of type integer). Maybe this selects hash plan.

For my big database I got the following plan:

 explain analyze  SELECT r.dokumnr
 FROM rid r
 LEFT JOIN dok d USING (dokumnr)
 WHERE d.dokumnr IS NULL

      1  Hash Left Join  (cost=7759.44..31738.44 rows=1
         width=4) (actual time=112.572..761.121 rows=3
         loops=1)
      2    Hash Cond: (r.dokumnr = d.dokumnr)
      3    Filter: (d.dokumnr IS NULL)
      4    ->  Seq Scan on rid r  (cost=0.00..17424.64
         rows=202464 width=4) (actual time=0.007..175.538
         rows=202424 loops=1)
      5    ->  Hash  (cost=6784.64..6784.64 rows=56064
         width=4) (actual time=111.296..111.296 rows=56079
         loops=1)
      6          ->  Seq Scan on dok d  (cost=0.00..6784.64
         rows=56064 width=4) (actual time=0.005..58.686
         rows=56079 loops=1)
      7  Total runtime: 761.311 ms

Since there are a lot of rows (202424 swown), this select command must use
indexes.
Without indexes it is not possible toobtain  speed  of 0.7 seconds.

Andrus.