Thread: how to speed up query
How to speed up the query delete from firma1.rid where dokumnr not in (select dokumnr from firma1.dok) which runs approx 30 minutes I have dokumnr indexes on both tables, both tables are analyzed. CREATE TABLE firma1.dok ( doktyyp character(1) NOT NULL, dokumnr integer NOT NULL DEFAULT nextval('dok_dokumnr_seq'::regclass), .... CREATE INDEX dok_dokumnr_idx ON firma1.dok USING btree (dokumnr); CREATE TABLE firma1.rid ( id integer NOT NULL DEFAULT nextval('rid_id_seq'::regclass), reanr integer NOT NULL DEFAULT nextval('rid_reanr_seq'::regclass), dokumnr integer NOT NULL, .... CONSTRAINT rid_dokumnr_fkey FOREIGN KEY (dokumnr) REFERENCES firma1.dok (dokumnr) MATCH SIMPLE ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITIALLY IMMEDIATE, .. ) CREATE INDEX rid_dokumnr_idx ON firma1.rid USING btree (dokumnr); explain delete from firma1.rid where dokumnr not in (select dokumnr from firma1.dok) returns "Seq Scan on rid (cost=7703.59..99687857.75 rows=102358 width=6)" " Filter: (NOT (subplan))" " SubPlan" " -> Materialize (cost=7703.59..8537.22 rows=55963 width=4)" " -> Seq Scan on dok (cost=0.00..7373.63 rows=55963 width=4)" Andrus.
On Fri, Jun 08, 2007 at 11:29:12AM +0300, Andrus wrote: > How to speed up the query We don't know. You don't tell us what version you're running, show us any EXPLAIN ANALYSE output, tell us about the data. . . A -- Andrew Sullivan | ajs@crankycanuck.ca Unfortunately reformatting the Internet is a little more painful than reformatting your hard drive when it gets out of whack. --Scott Morris
Hi Andrus! On Jun 8, 10:29 am, "Andrus" <kobrule...@hot.ee> wrote: > How to speed up the query > > delete from firma1.rid where dokumnr not in (select dokumnr from firma1.dok) > CREATE TABLE firma1.dok > ( > doktyyp character(1) NOT NULL, > dokumnr integer NOT NULL DEFAULT nextval('dok_dokumnr_seq'::regclass), > .... > > CREATE INDEX dok_dokumnr_idx ON firma1.dok USING btree (dokumnr); > > CREATE TABLE firma1.rid > ( > id integer NOT NULL DEFAULT nextval('rid_id_seq'::regclass), > reanr integer NOT NULL DEFAULT nextval('rid_reanr_seq'::regclass), > dokumnr integer NOT NULL, > .... > CONSTRAINT rid_dokumnr_fkey FOREIGN KEY (dokumnr) > REFERENCES firma1.dok (dokumnr) MATCH SIMPLE > ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITIALLY IMMEDIATE, (...) This whole operation looks contradictory in several ways. firma1.rid references firma1.dok on (dokumnr) Therefore, referential integrity commands that there be NO rows in firma1.rid with a dokumnr not present in firma1.dok. Therefore your DELETE cannot possibly be deleting anything. It is nonsensical: delete from firma1.rid where dokumnr not in (select dokumnr from firma1.dok) Did you mean: delete from firma1.dok where dokumnr not in (select dokumnr from firma1.rid) ?? The next weird thing: I see no UNIQUE index (or primary key) ON firma1.dok.dokumnr. As it is being referenced by foreign key constraint from firma1.rid, the system would require that. This index makes no sense at all: CREATE INDEX dok_dokumnr_idx ON firma1.dok USING btree (dokumnr); Either your problem description is messed up or your postgres installation is. My money is on the former. Aside from that, my ideas would be (assuming that you got the statement backwards): 1.) Try to make the foreign key DEFERRABLE INITIALLY DEFERRED (instead of INITIALLY IMMEDIATE), because every delete on firma1.dok CASCADES to firma1.rid. 2.) Add a DISTINCT clause: delete from firma1.dok where dokumnr not in (select DISTINCT dokumnr from firma1.rid) 3.) Write results of the subquery in a temp table, then DELETE: CREATE TEMP TABLE mydel AS SELECT DISTINCT dokumnr FROM firma1.rid; DELETE FROM firma1.dok USING mydel WHERE firma1.rid.doumnr = mydel.doumnr; If these things do not solve your problem, it might still be helpful to tell us what they do. Regards Erwin
On Jun 9, 12:15 am, Erwin Brandstetter <brsaw...@gmail.com> wrote: > 3.) Write results of the subquery in a temp table, then DELETE: > > CREATE TEMP TABLE mydel AS SELECT DISTINCT dokumnr FROM firma1.rid; > DELETE FROM firma1.dok USING mydel WHERE firma1.rid.doumnr = > mydel.doumnr; Ah! 3.) should read: CREATE TEMP TABLE mydel AS SELECT DISTINCT dokumnr FROM firma1.rid; DELETE FROM firma1.dok WHERE dokumnr NOT IN (SELECT dukumnr FROM mydel); Or 4.) If "NOT IN" should be the culprit, there is an alternative: ( I seem to remember issues with its performance in the past, but hasn't that been improved? Not sure.) Haven't tested, whether the temp table is useful here: CREATE TEMP TABLE mydel AS SELECT d.dokumnr FROM firma1.dok d LEFT JOIN (SELECT DISTINCT dokumnr FROM firma1.rid) r USING (dokumnr) WHERE r.dokumnr IS NULL; DELETE FROM firma1.dok USING mydel WHERE firma1.dok.doumnr = mydel.documnr; Regards Erwin
On Jun 11, 2:01 pm, "Andrus" <kobrule...@hot.ee> wrote: (...) > > This index makes no sense at all: > > CREATE INDEX dok_dokumnr_idx ON firma1.dok USING btree (dokumnr); > > I listed table structure and constraints partially. > Theis is also primary key constraint in dok table: > > CONSTRAINT dok_pkey PRIMARY KEY (dokumnr), On a sidenote: this primary implements a unique index anyway. The additional index is useless. You can delete it to save time and storage. (Or maybe this is just another discrepancy between reality and problem description.) Regards Erwin
On Jun 11, 2:23 pm, "Andrus" <kobrule...@hot.ee> wrote: > I tried > > CREATE TEMP TABLE mydel AS > SELECT r.dokumnr > FROM rid r > LEFT JOIN dok d USING (dokumnr) > WHERE d.dokumnr IS NULL; > DELETE FROM rid USING mydel WHERE rid.dokumnr =mydel.dokumnr; > drop table mydel; > > and this runs 1 seconds intead for 2.2 hours. > > Thank you very much. > This works! > > It's sad that PostgreSQL cannot optimize this delete statement > automatically. 1 second does sound a lot better than 2 hours, doesn't it? :) As to why Postgres seems to fail, I cannot say any more, as your description is unclear. I am pretty sure there is some misunderstanding, though. Regards Erwin
Hi Andrus! On Jun 12, 6:38 pm, "Andrus" <kobrule...@hot.ee> wrote: > 1 second if for repeated runs from pgAdmin. > I my script same CREATE TEMP TABLE command takes appox 11 minutes for same > data (see log below). I cannot make much sense of this information. I can see no reason why your script should take 11 minutes, while executing it from pgAdmin would take only a second. How do you run the script? > After your suggested change my database creation script runs 6 hours. Is that down from the 14 hours you mentioned before? Which would be an amazing 8 hours faster? (...) > I used query > > SELECT relname as Table_Name, relpages * 8/1024 as size_in_mb > FROM pg_class > where relpages * 8/1024>0 > ORDER BY relpages DESC Looks like a useful query. Compare with: SELECT pg_size_pretty(pg_database_size(' bilkaib')) SELECT pg_size_pretty(pg_relation_size(' bilkaib')) SELECT pg_size_pretty(pg_total_relation_size(' bilkaib')) See http://www.postgresql.org/docs/8.2/interactive/functions-admin.html > Biggest database (bilkaib) load time is 8 minutes, it contains 329000 > records. > Total data loading time is approx 49 minutes. You mean table, not database? > Remaining 5 hours are used for index and key creation. This seems too much. > > Here is log file for minutes 49 .. 135 ie. first 86 minutes after loading > data. > > It shows statements which ran more than 1 minute. > > First number (49,4500) is the number minutes from start of script (starting > from database creation). > > The slowest statement is > > CREATE TEMP TABLE mydel AS > SELECT r.dokumnr > FROM rid r > LEFT JOIN dok d USING (dokumnr) > WHERE d.dokumnr IS NULL It might be worth checking the order in which you create objects. Creating relevant indices before using complex queries is one thing to look for. If that still runs so slow it's probably indication that your RDBMS is in dire need of more RAM. Look to your setup in postgresql.conf. As everything runs slow, you should look to your hardware, system configuration and PostgreSQL setup. Do you have enough RAM (you mentioned 2 GB) and does PostgreSQL get its share? (-> setup in postgresql.conf). There is probably a bottleneck somewhere. I have a machine just like the one you described above ( dual AMD Opteron 240, 2 GB RAM, 10k rpm HDDs in RAID 1) - slow CPUs and conservative RAID setup - and a complete dump AND restore of a database where SELECT pg_size_pretty(pg_database_size('event')) = 300 MB takes about 1,5 minutes. And this is with pg 8.1.8, so your setup should have better hardware (guessing here) and newer software. But your DB is also a lot bigger. Anyway, I am not a postgres hacker, I am just a DB admin myself, so don't expect too much from me. Someone else might know more. But if you want help, you'd better learn how to present a problem in a way, that deserves help. My best guess: buy at least 2 GB more RAM. Look to your settings in postgresql.conf. Read up here: http://www.postgresql.org/docs/current/static/performance-tips.html http://revsys.com/writings/postgresql-performance.html http://www.powerpostgresql.com/Docs http://www.powerpostgresql.com/Downloads/annotated_conf_80.html If that does not solve your problem, post your setup or your script - whichever you suspect to be the problem - and try to present all the necessary information in a concise manner. That is much more likely to get help. Nobody wants to waste time, especially not helping someone free of charge. Your first posting was just not good enough. If you keep mixing things up, people will be frustrated and rather not help. Read your posting, before you send it. Regards Erwin
Andrew, >> How to speed up the query > > We don't know. Thank you. Explain seems to show that PostgreSQL makes sequential scan of whole dok table for every rid table row. This is very slow since dok contains 55963 and rid 202421 rows. I expected that there exists some trick like to force this DELETE command to use bitmaps by re-writing it using joins. > You don't tell us what version you're running "PostgreSQL 8.2.3 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC) 3.4.2 (mingw-special)" > , show us any EXPLAIN ANALYSE output I tried explain analyze delete from firma1.rid where dokumnr not in (select dokumnr from firma1.dok) but after 40 minutes this command is not finished. I leave this query running for a weekend. > , tell us about the data. . . I'm creating PostgreSQL database and loading data to it from my application. I use this type of delete commands to remove bad rows before adding foreign keys. My loading application is running 14 hours and is not finished.Server and application are in the same fast computer with 2 GB RAM and 10000 RPM WD Raptor HDD running Windows Vista. postgresql.conf file is not modified, it is from PostgreSQL Windows installer. After 14 hours CPU usage is 50% (postgres process uses it all), no disk activity. I seems that PostgreSQL cannot use 100% of CPU in dual core processors when creating new database and loading data using single connection. It can use only 50% CPU. Maybe two connections which can ran two child processes can use 100% of cpu but I have no generic idea how to split database loading and foreign key creation into two connections. Andrus.
> This whole operation looks contradictory in several ways. > > firma1.rid references firma1.dok on (dokumnr) > Therefore, referential integrity commands that there be NO rows in > firma1.rid with a dokumnr not present in firma1.dok. > Therefore your DELETE cannot possibly be deleting anything. It is > nonsensical: > delete from firma1.rid where dokumnr not in (select dokumnr from > firma1.dok) Yes, it is nonsensial. However, this command should run fast even if it is nonsensial. I my application I add foreign key after running this delete command. I displayed the table structure after addind, I'm sorry. I tried the following command alter table firma1.rid drop constraint rid_dokumnr_fkey; set constraints all deferred; explain analyze delete from firma1.rid where dokumnr not in (select dokumnr from firma1.dok) but it still produces plan "Seq Scan on rid (cost=7703.59..98570208.00 rows=101210 width=6)" " Filter: (NOT (subplan))" " SubPlan" " -> Materialize (cost=7703.59..8537.22 rows=55963 width=4)" " -> Seq Scan on dok (cost=0.00..7373.63 rows=55963 width=4)" > Did you mean: > delete from firma1.dok where dokumnr not in (select dokumnr from > firma1.rid) > ?? No. I mean delete from firma1.rid where dokumnr not in (select dokumnr from firma1.dok) > The next weird thing: > I see no UNIQUE index (or primary key) ON firma1.dok.dokumnr. As it is > being referenced by foreign key constraint from firma1.rid, the system > would require that. > This index makes no sense at all: > CREATE INDEX dok_dokumnr_idx ON firma1.dok USING btree > (dokumnr); I listed table structure and constraints partially. Theis is also primary key constraint in dok table: CONSTRAINT dok_pkey PRIMARY KEY (dokumnr), > Either your problem description is messed up or your postgres > installation is. My money is on the former. > > > Aside from that, my ideas would be (assuming that you got the > statement backwards): > 1.) Try to make the foreign key DEFERRABLE INITIALLY DEFERRED (instead > of INITIALLY IMMEDIATE), because every delete on firma1.dok CASCADES > to firma1.rid. I delete from firma1.rid table. I dropped the foreign key using alter table firma1.rid drop constraint rid_dokumnr_fkey; but the problem persist. > 2.) Add a DISTINCT clause: > delete from firma1.dok where dokumnr not in (select DISTINCT > dokumnr from firma1.rid) I tried delete from firma1.rid where dokumnr not in (select DISTINCT dokumnr from firma1.dok) but this runs still very long time. output from explain: "Seq Scan on rid (cost=20569.69..98583074.10 rows=101210 width=6)" " Filter: (NOT (subplan))" " SubPlan" " -> Materialize (cost=20569.69..21403.32 rows=55963 width=4)" " -> Unique (cost=0.00..20239.73 rows=55963 width=4)" " -> Index Scan using dok_dokumnr_idx on dok (cost=0.00..20099.82 rows=55963 width=4)" Andrus.
>> I tried >> >> CREATE TEMP TABLE mydel AS >> SELECT r.dokumnr >> FROM rid r >> LEFT JOIN dok d USING (dokumnr) >> WHERE d.dokumnr IS NULL; >> DELETE FROM rid USING mydel WHERE rid.dokumnr =mydel.dokumnr; >> drop table mydel; >> >> and this runs 1 seconds intead for 2.2 hours. >> >> Thank you very much. >> This works! >> >> It's sad that PostgreSQL cannot optimize this delete statement >> automatically. > > > 1 second does sound a lot better than 2 hours, doesn't it? :) 1 second if for repeated runs from pgAdmin. I my script same CREATE TEMP TABLE command takes appox 11 minutes for same data (see log below). > As to why Postgres seems to fail, I cannot say any more, as your > description is unclear. I am pretty sure there is some > misunderstanding, though. After your suggested change my database creation script runs 6 hours. Result database biggest 15 tables are: 1 bilkaib 152MB 2 omrid 146MB 3 klient 130MB 4 rid 120MB 5 omdok 59MB 6 dok 48MB 7 mailbox 28MB 8 report 19MB 9 bilkaib_db_idx 16MB 10 bilkaib_cr_idx 16MB 11 bilkaib_pkey 14MB 12 bilkaib_kuupaev_idx 13MB 13 bilkaib_dokumnr_idx 11MB 14 summav 9MB 15 desktop 7MB I used query SELECT relname as Table_Name, relpages * 8/1024 as size_in_mb FROM pg_class where relpages * 8/1024>0 ORDER BY relpages DESC for this. Biggest database (bilkaib) load time is 8 minutes, it contains 329000 records. Total data loading time is approx 49 minutes. Remaining 5 hours are used for index and key creation. This seems too much. Here is log file for minutes 49 .. 135 ie. first 86 minutes after loading data. It shows statements which ran more than 1 minute. First number (49,4500) is the number minutes from start of script (starting from database creation). The slowest statement is CREATE TEMP TABLE mydel AS SELECT r.dokumnr FROM rid r LEFT JOIN dok d USING (dokumnr) WHERE d.dokumnr IS NULL which starts at minute 104 and has duration 11 minutes. There seems to be no direct bottleneck: there are many commands with duration 1.. 11 minutes. I will run my script in today night to get complete timing. Any idea how to increase speed ? Andrus. 49,4500 Duration 1,4167 minutes: UPDATE dok SET krdokumnr=NULL WHERE krDokumnr is NOT null AND doktyyp NOT IN ('G','O') 52,3167 Duration 2,8667 minutes: UPDATE dok SET krdokumnr=NULL WHERE doktyyp='G' AND krdokumnr IS NOT NULL and krdokumnr NOT in (select dokumnr from dok WHERE doktyyp='G') 55,1500 Duration 2,8333 minutes: UPDATE dok SET krdokumnr=NULL WHERE doktyyp='O' AND krdokumnr IS NOT NULL and krdokumnr NOT in (select dokumnr from dok WHERE doktyyp='O') 56,5667 Duration 1,4167 minutes: ALTER TABLE dok ADD CHECK (krdokumnr IS NULL OR doktyyp IN('G','O')) 57,9833 Duration 1,4167 minutes: ALTER TABLE dok ADD CHECK (dokumnr>0) 60,8333 Duration 2,8333 minutes: ALTER TABLE dok ADD FOREIGN KEY (krdokumnr) REFERENCES dok ON DELETE SET NULL ON UPDATE CASCADE DEFERRABLE 62,2667 Duration 1,4333 minutes: ALTER TABLE dok ALTER doktyyp SET NOT NULL 63,6833 Duration 1,4167 minutes: ALTER TABLE dok ALTER kuupaev SET NOT NULL 65,1333 Duration 1,4500 minutes: CREATE INDEX dok_kuupaev_idx ON dok (kuupaev) 66,5667 Duration 1,4333 minutes: CREATE INDEX dok_krdokumnr_idx ON dok (krdokumnr) 68 Duration 1,4333 minutes: CREATE INDEX dok_tellimus_idx ON dok (tellimus) 69,4333 Duration 1,4333 minutes: CREATE INDEX dok_tasudok_idx ON dok (tasudok) 70,8833 Duration 1,4333 minutes: CREATE INDEX dok_klient_idx ON dok (klient) 72,3167 Duration 1,4333 minutes: CREATE INDEX dok_tasumata_idx ON dok (tasumata) 73,7500 Duration 1,4333 minutes: CREATE UNIQUE INDEX dok_tasudok_unique_idx ON dok (doktyyp,tasudok) WHERE doktyyp IN ( 'T', 'U') 83,5000 Duration 9,7500 minutes: CREATE INDEX rid_dokumnr_idx ON rid (dokumnr) 93,2500 Duration 9,7500 minutes: CREATE INDEX rid_toode_idx ON rid (toode) 104,3500 Duration 11,1000 minutes: CREATE TEMP TABLE mydel AS SELECT r.dokumnr FROM rid r LEFT JOIN dok d USING (dokumnr) WHERE d.dokumnr IS NULL 114,0167 Duration 9,6500 minutes: DELETE FROM rid USING mydel WHERE rid.dokumnr =mydel.dokumnr 125,1500 Duration 11,1333 minutes: ALTER TABLE rid ADD FOREIGN KEY (dokumnr) REFERENCES dok ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE 135,0833 Duration 9,9333 minutes: ALTER TABLE rid ALTER dokumnr SET NOT NULL
>> Theis is also primary key constraint in dok table: >> >> CONSTRAINT dok_pkey PRIMARY KEY (dokumnr), > > > On a sidenote: this primary implements a unique index anyway. The > additional index is useless. You can delete it to save time and > storage. > (Or maybe this is just another discrepancy between reality and problem > description.) Erwin, thank you. This index was duplicated. I changed my script not to create this index. However this should not affect to speed a much since this index is created after data is loaded to table. Andrus.
> Ah! 3.) should read: > CREATE TEMP TABLE mydel AS SELECT DISTINCT dokumnr FROM firma1.rid; > DELETE FROM firma1.dok WHERE dokumnr NOT IN (SELECT dukumnr FROM > mydel); I need to delete from firma1.rid table So I cannot use this suggestion since firma1.dok.dokumnr is already unique (primary key). > Or 4.) > If "NOT IN" should be the culprit, there is an alternative: > ( I seem to remember issues with its performance in the past, but > hasn't that been improved? Not sure.) > Haven't tested, whether the temp table is useful here: > > CREATE TEMP TABLE mydel AS > SELECT d.dokumnr > FROM firma1.dok d > LEFT JOIN (SELECT DISTINCT dokumnr FROM firma1.rid) r USING (dokumnr) > WHERE r.dokumnr IS NULL; > DELETE FROM firma1.dok USING mydel WHERE firma1.dok.doumnr = > mydel.documnr; I tried CREATE TEMP TABLE mydel AS SELECT r.dokumnr FROM rid r LEFT JOIN dok d USING (dokumnr) WHERE d.dokumnr IS NULL; DELETE FROM rid USING mydel WHERE rid.dokumnr =mydel.dokumnr; drop table mydel; and this runs 1 seconds intead for 2.2 hours. Thank you very much. This works! It's sad that PostgreSQL cannot optimize this delete statement automatically. Andrus.
> We don't know. You don't tell us what version you're running, show > us any EXPLAIN ANALYSE output, tell us about the data. . . explain analyze delete from firma1.rid where dokumnr not in (select dokumnr from firma1.dok) produces "Seq Scan on rid (cost=7703.59..98570208.00 rows=101210 width=6) (actual time=7269933.877..7269933.877 rows=0 loops=1)" " Filter: (NOT (subplan))" " SubPlan" " -> Materialize (cost=7703.59..8537.22 rows=55963 width=4) (actual time=0.007..18.707 rows=25313 loops=202421)" " -> Seq Scan on dok (cost=0.00..7373.63 rows=55963 width=4) (actual time=40.236..3353.985 rows=56079 loops=1)" "Total runtime: 7269944.251 ms" As I understand this took 2.2 hours to run Andrus.
On Mon, Jun 11, 2007 at 03:01:08PM +0300, Andrus wrote: > > delete from firma1.rid where dokumnr not in (select dokumnr from > >firma1.dok) > > Yes, it is nonsensial. However, this command should run fast even if it is > nonsensial. For future reference, I beleive the problem is the NOT IN. It has this "feature" where if any of the rows it searches has a NULL, it will return FALSE for *all* rows. So the whole table has to be scanned to check that there arn't any NULLs, before a single row can be returned. This is why it can't be converted to a join. Now, you may argue that in your case this doesn't apply, which may be true, but it's always been a difficult construct to optimise... (and somewhat surprising for people with they didn't realise the null-effect). The most efficient way you write this is with an OUTER JOIN. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
> > delete from firma1.rid where dokumnr not in (select dokumnr from > >firma1.dok) >For future reference, I beleive the problem is the NOT IN. It has this >"feature" where if any of the rows it searches has a NULL, it will >return FALSE for *all* rows. So the whole table has to be scanned to >check that there arn't any NULLs, before a single row can be returned. >This is why it can't be converted to a join. Thank you. As I understand, only way to optimize the statement delete from firma1.rid where dokumnr not in (select dokumnr from firma1.dok); assuming that firma1.dok.dokumnr does not contain null values is to change it to CREATE TEMP TABLE mydel AS SELECT r.dokumnr FROM rid r LEFT JOIN dok d USING (dokumnr) WHERE d.dokumnr IS NULL; DELETE FROM rid USING mydel WHERE rid.dokumnr =mydel.dokumnr; drop table mydel; I run the following commands (first number of minutes from script start) in my script: 18 Duration 2,9 minutes: ALTER TABLE dok ADD PRIMARY KEY (dokumnr) ... 81 Duration 9,6 minutes: CREATE INDEX rid_dokumnr_idx ON rid (dokumnr) ... 101 Duration 10,5 minutes: analyze ... 113 Duration 11 minutes: CREATE TEMP TABLE mydel AS SELECT r.dokumnr FROM rid r LEFT JOIN dok d USING (dokumnr) WHERE d.dokumnr IS NULL 122 Duration 9,6 minutes: DELETE FROM rid USING mydel WHERE rid.dokumnr =mydel.dokumnr 133 Duration 11 minutes: ALTER TABLE rid ADD FOREIGN KEY (dokumnr) REFERENCES dok ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE When I run command CREATE TEMP TABLE mydel AS SELECT r.dokumnr FROM rid r LEFT JOIN dok d USING (dokumnr) WHERE d.dokumnr IS NULL from pgAdmin, it takes 1 second. When I run this command from script it takes 11 minutes! Any idea why running this command from script takes 11 minutes? I have created indexes on dok and rid and ran analyze before using CREATE TEMP TABLE So I expect that CREATE TEMP TABLE command must take same time to run from script and from pgAdmin. My script in running in single transaction. Should I use commit after index creation or after ANALYZE command? In pgAdmin explain analyze SELECT r.dokumnr FROM rid r LEFT JOIN dok d USING (dokumnr) WHERE d.dokumnr IS NULL returns "Hash Left Join (cost=7760.27..31738.02 rows=1 width=4) (actual time=2520.904..2520.904 rows=0 loops=1)" " Hash Cond: (r.dokumnr = d.dokumnr)" " Filter: (d.dokumnr IS NULL)" " -> Seq Scan on rid r (cost=0.00..17424.24 rows=202424 width=4) (actual time=0.032..352.225 rows=202421 loops=1)" " -> Hash (cost=6785.01..6785.01 rows=56101 width=4) (actual time=211.150..211.150 rows=56079 loops=1)" " -> Seq Scan on dok d (cost=0.00..6785.01 rows=56101 width=4) (actual time=0.021..147.805 rows=56079 loops=1)" "Total runtime: 2521.091 ms" Andrus.
"Andrus" <kobruleht2@hot.ee> writes: > from pgAdmin, it takes 1 second. > When I run this command from script it takes 11 minutes! > Any idea why running this command from script takes 11 minutes? Different plans maybe? Try EXPLAIN ANALYZE in both cases. Do you have work_mem set the same in both cases? > My script in running in single transaction. > Should I use commit after index creation or after ANALYZE command? Hmm, there are some extra cycles involved in examining not-yet-committed tuples, but I hardly see how that would create a discrepancy of this size. Check the plans first. regards, tom lane
> I cannot make much sense of this information. I can see no reason why > your script should take 11 minutes, while executing it from pgAdmin > would take only a second. How do you run the script? I'm running my script from VFP client applicaton. Application sends every statement to server separately using ODBC driver. table creation, data loading, primary key creation, index creation, analyze and problematic CREATE TABLE TEMP command all ran in single transaction. Should I commit transactions after analyze command or after index creation? server logs shows: 2007-06-13 03:19:43 LOG: checkpoints are occurring too frequently (21 seconds apart) 2007-06-13 03:19:43 HINT: Consider increasing the configuration parameter "checkpoint_segments". 2007-06-13 03:20:02 LOG: checkpoints are occurring too frequently (19 seconds apart) 2007-06-13 03:20:02 HINT: Consider increasing the configuration parameter "checkpoint_segments". 2007-06-13 03:20:22 LOG: checkpoints are occurring too frequently (20 seconds apart) 2007-06-13 03:20:22 HINT: Consider increasing the configuration parameter "checkpoint_segments". 2007-06-13 03:21:18 LOG: checkpoints are occurring too frequently (23 seconds apart) 2007-06-13 03:21:18 HINT: Consider increasing the configuration parameter "checkpoint_segments". 2007-06-13 03:49:10 ERROR: deadlock detected 2007-06-13 03:49:10 DETAIL: Process 3280 waits for AccessExclusiveLock on relation 233893 of database 233756; blocked by process 2508. Process 2508 waits for ShareUpdateExclusiveLock on relation 233988 of database 233756; blocked by process 3280. 2007-06-13 03:49:10 STATEMENT: ALTER TABLE desktop ADD FOREIGN KEY (alamklass) REFERENCES andmetp ON UPDATE CASCADE DEFERRABLE and script terminates after 5.5 hours running yesterday night. I will re-start computer and try again. Can increasing checkpint_segments increase speed significantly ? >> After your suggested change my database creation script runs 6 hours. > > Is that down from the 14 hours you mentioned before? Which would be an > amazing 8 hours faster? I had a number of DELETE .. WHERE NOT IN commands. I changed all them to CREATE TEMP TABLE ... DELETE >> I used query >> >> SELECT relname as Table_Name, relpages * 8/1024 as size_in_mb >> FROM pg_class >> where relpages * 8/1024>0 >> ORDER BY relpages DESC > > Looks like a useful query. Compare with: > SELECT pg_size_pretty(pg_database_size(' bilkaib')) SELECT pg_size_pretty(pg_database_size('mydb')) returns 828 MB > SELECT pg_size_pretty(pg_relation_size(' bilkaib')) returns 100 MB > SELECT pg_size_pretty(pg_total_relation_size(' bilkaib')) returns "171 MB" relpages * 8/1024 and pg_relation_size(oid) return in some cases very different result, no idea why. For one index relpages returns size about 6 MB but pg_relation_size returns only 2152 kB >> Biggest database (bilkaib) load time is 8 minutes, it contains 329000 >> records. >> Total data loading time is approx 49 minutes. > > You mean table, not database? Yes, I meant table. >> Remaining 5 hours are used for index and key creation. This seems too >> much. > It might be worth checking the order in which you create objects. > Creating relevant indices before using complex queries is one thing to > look for. I created primary key on dok(dokumnr), index on rid(dokumnr) and ran analyze before running this CREATE TEMP TABLE command. > If that still runs so slow it's probably indication that your RDBMS is > in dire need of more RAM. Look to your setup in postgresql.conf. > As everything runs slow, you should look to your hardware, system > configuration and PostgreSQL setup. Do you have enough RAM (you > mentioned 2 GB) and does PostgreSQL get its share? (-> setup in > postgresql.conf). There is probably a bottleneck somewhere. If CREATE TEMP TABLE from pgAdmin takes 1 sec and from script 11 minues I do'nt think this is hardware related. > If that does not solve your problem, post your setup or your script - > whichever you suspect to be the problem The script which creates 800 MB database is big. I can create this script but is anybody interested to look into it ? Andrus.
"Andrus" <kobruleht2@hot.ee> writes: > 2007-06-13 03:49:10 ERROR: deadlock detected > 2007-06-13 03:49:10 DETAIL: Process 3280 waits for AccessExclusiveLock on > relation 233893 of database 233756; blocked by process 2508. > Process 2508 waits for ShareUpdateExclusiveLock on relation 233988 of > database 233756; blocked by process 3280. > 2007-06-13 03:49:10 STATEMENT: ALTER TABLE desktop ADD FOREIGN KEY > (alamklass) REFERENCES andmetp ON UPDATE CASCADE DEFERRABLE > and script terminates after 5.5 hours running yesterday night. This might be a good reason not to run the script as a single long transaction --- it's probably accumulating locks on a lot of different tables. Which would be fine if it was the only thing going on, but evidently it isn't. regards, tom lane
>> and script terminates after 5.5 hours running yesterday night. > > This might be a good reason not to run the script as a single long > transaction --- it's probably accumulating locks on a lot of different > tables. Which would be fine if it was the only thing going on, but > evidently it isn't. Thank you. I removed transaction from script as all. I send every statement separately to Postgres. So each statement runs in its own single transaction now. I hope that this does not decrease speed and this is best solution? Now explain analyze SELECT r.dokumnr FROM rid r LEFT JOIN dok d USING (dokumnr) WHERE d.dokumnr IS NULL returs the same time from script and when returned separately in small database. Andrus.
>> from pgAdmin, it takes 1 second. >> When I run this command from script it takes 11 minutes! > >> Any idea why running this command from script takes 11 minutes? > > Different plans maybe? Try EXPLAIN ANALYZE in both cases. Thank you. I tried explain analyze SELECT r.dokumnr FROM rid r LEFT JOIN dok d USING (dokumnr) WHERE d.dokumnr IS NULL with small database. From script this command returns Hash Left Join (cost=12.11..60.42 rows=1 width=4) (actual time=105.473..105.473 rows=0 loops=1) Hash Cond: (r.dokumnr = d.dokumnr) Filter: (d.dokumnr IS NULL) -> Seq Scan on rid r (cost=0.00..38.87 rows=687 width=4) (actual time=2.144..90.823 rows=687 loops=1) -> Hash (cost=10.38..10.38 rows=138 width=4) (actual time=13.925..13.925 rows=138 loops=1) -> Seq Scan on dok d (cost=0.00..10.38 rows=138 width=4) (actual time=1.715..13.812 rows=138 loops=1) Total runtime: 105.542 ms running in standalone it returns Hash Left Join (cost=13.44..61.76 rows=1 width=4) (actual time=2.172..2.172 rows=0 loops=1) Hash Cond: (r.dokumnr = d.dokumnr) Filter: (d.dokumnr IS NULL) -> Seq Scan on rid r (cost=0.00..38.87 rows=687 width=4) (actual time=0.076..0.802 rows=687 loops=1) -> Hash (cost=11.53..11.53 rows=153 width=4) (actual time=0.400..0.400 rows=138 loops=1) -> Seq Scan on dok d (cost=0.00..11.53 rows=153 width=4) (actual time=0.013..0.242 rows=138 loops=1) Total runtime: 2.338 ms I have no idea why this command runs 50 times slower in script. ODBC driver inserts RELEASE SAVEPOINT and SAVEPOINT commands before every statement. There is great explanation about his in http://archives.postgresql.org/pgsql-odbc/2006-05/msg00078.php Unfortunately, no connection string option is documented. I havent found a way to disable this automatic SAVEPOINT insertion from odbc connection string. I havent got reply to my message from January, 18 2007 in odbc forum (I posted again today). Reading ODBC driver source this I expected that Protocol=-0 in connection string should work but this does not. Probably I missed something in C source. However I think that this cannot slow down SELECT command speed. > Do you have work_mem set the same in both cases? Yes. I have same database server and same database. Andrus.
On Jun 13, 3:13 pm, "Andrus" <kobrule...@hot.ee> wrote: (...) > As I understand, only way to optimize the statement > > delete from firma1.rid where dokumnr not in (select dokumnr from > firma1.dok); > > assuming that firma1.dok.dokumnr does not contain null values is to change > it to > > CREATE TEMP TABLE mydel AS > SELECT r.dokumnr > FROM rid r > LEFT JOIN dok d USING (dokumnr) > WHERE d.dokumnr IS NULL; > DELETE FROM rid USING mydel WHERE rid.dokumnr =mydel.dokumnr; > drop table mydel; As I mentioned when I proposed it, the temp table may not even be necessary. The important part is the LEFT JOIN instead of the NOT IN (as Martijn has explained). You could try the direct approach ... DELETE FROM rid USING ( SELECT r.dokumnr FROM rid r LEFT JOIN dok d USING (dokumnr) WHERE d.dokumnr IS NULL) x WHERE rid.dokumnr = x.dokumnr; ... and see which runs faster. Probably it does not make much of a difference. If the temp table works for you, you might be interested in a new feature of 8.2: CREATE TEMP TABLE AS ...ON COMMIT DROP; http://www.postgresql.org/docs/current/static/sql-createtableas.html (...) > explain analyze SELECT r.dokumnr > FROM rid r > LEFT JOIN dok d USING (dokumnr) > WHERE d.dokumnr IS NULL > > returns > > "Hash Left Join (cost=7760.27..31738.02 rows=1 width=4) (actual > time=2520.904..2520.904 rows=0 loops=1)" > " Hash Cond: (r.dokumnr = d.dokumnr)" > " Filter: (d.dokumnr IS NULL)" > " -> Seq Scan on rid r (cost=0.00..17424.24 rows=202424 width=4) (actual > time=0.032..352.225 rows=202421 loops=1)" > " -> Hash (cost=6785.01..6785.01 rows=56101 width=4) (actual > time=211.150..211.150 rows=56079 loops=1)" > " -> Seq Scan on dok d (cost=0.00..6785.01 rows=56101 width=4) > (actual time=0.021..147.805 rows=56079 loops=1)" > "Total runtime: 2521.091 ms" If the indices are present (and visible) at the time of execution, as you described it, we should be seeing index scans on dok_dokumnr_idx and rid_dokumnr_idx instead of sequential scans. That's what I get on a similar query in one of my databases: EXPLAIN ANALYZE SELECT a.adr_id FROM cp.adr a LEFT JOIN cp.kontakt k USING (adr_id) WHERE k.adr_id IS NULL; Merge Left Join (cost=0.00..1356.31 rows=10261 width=4) (actual time=0.096..56.759 rows=3868 loops=1) Merge Cond: ("outer".adr_id = "inner".adr_id) Filter: ("inner".adr_id IS NULL) -> Index Scan using adr_pkey on adr a (cost=0.00..947.54 rows=10261 width=4) (actual time=0.012..23.118 rows=10261 loops=1) -> Index Scan using kontakt_adr_id_idx on kontakt k (cost=0.00..295.47 rows=7011 width=4) (actual time=0.007..13.299 rows=7011 loops=1) Total runtime: 58.510 ms Regards Erwin
>> CREATE TEMP TABLE mydel AS >> SELECT r.dokumnr >> FROM rid r >> LEFT JOIN dok d USING (dokumnr) >> WHERE d.dokumnr IS NULL; >> DELETE FROM rid USING mydel WHERE rid.dokumnr =mydel.dokumnr; >> drop table mydel; > > As I mentioned when I proposed it, the temp table may not even be > necessary. The important part is the LEFT JOIN instead of the NOT IN > (as Martijn has explained). > You could try the direct approach ... > > DELETE FROM rid > USING ( SELECT r.dokumnr > FROM rid r > LEFT JOIN dok d USING (dokumnr) > WHERE d.dokumnr IS NULL) x > WHERE rid.dokumnr = x.dokumnr; > ... and see which runs faster. Probably it does not make much of a > difference. Thank you. I changed my DELETE commands to use internal table. This works fast. I tried to change my update commands also to use internal table. However, this causes update command to run 310 minutes: update bilkaib SET cr4objekt=NULL FROM ( SELECT r.cr4objekt as key FROM bilkaib r LEFT JOIN yksus4 d ON d.YKSUS =r.cr4objekt WHERE d.YKSUS IS NULL) mydel WHERE cr4objekt IS NOT NULL AND bilkaib.cr4objekt= mydel.key; No idea why this does not work fast like in DELETE command. cr4objekt type is CHAR(10) maybe this makes internal table slow. So I changed my script to DROP TABLE if exists mydel; CREATE TEMP TABLE mydel AS SELECT r.<<cchildkey>> as key FROM <<m.cChildtable>> r LEFT JOIN <<cmaintable>> d ON d.<<mainkey>>=r.<<cchildkey>> WHERE d.<<mainkey>> IS NULL; update <<m.cChildtable>> SET <<cchildkey>>=NULL FROM mydel WHERE <<cchildkey>> IS NOT NULL AND <<m.cChildtable>>.<<cchildkey>>= mydel.key; Hope this will run fast (will test tomorrow). My original skript UPDATE <<m.cChildtable>> SET <<cchildkey>>=NULL WHERE <<cchildkey>> IS NOT NULL AND <<cchildkey>> NOT IN (SELECT <<mainkey>> FROM <<cmaintable>>); runs 27 minutes in some cases. > If the temp table works for you, you might be interested in a new > feature of 8.2: CREATE TEMP TABLE AS ...ON COMMIT DROP; > http://www.postgresql.org/docs/current/static/sql-createtableas.html Per Tom remart , I removed transactions. Now every statement runs in separate transaction. In this case ON COMMIT DROP is useless. ON COMMIT DROP exists in 8.1 also. 8.2 adds DROP IF EXISTS. >> explain analyze SELECT r.dokumnr >> FROM rid r >> LEFT JOIN dok d USING (dokumnr) >> WHERE d.dokumnr IS NULL >> >> returns >> >> "Hash Left Join (cost=7760.27..31738.02 rows=1 width=4) (actual >> time=2520.904..2520.904 rows=0 loops=1)" >> " Hash Cond: (r.dokumnr = d.dokumnr)" >> " Filter: (d.dokumnr IS NULL)" >> " -> Seq Scan on rid r (cost=0.00..17424.24 rows=202424 width=4) >> (actual >> time=0.032..352.225 rows=202421 loops=1)" >> " -> Hash (cost=6785.01..6785.01 rows=56101 width=4) (actual >> time=211.150..211.150 rows=56079 loops=1)" >> " -> Seq Scan on dok d (cost=0.00..6785.01 rows=56101 width=4) >> (actual time=0.021..147.805 rows=56079 loops=1)" >> "Total runtime: 2521.091 ms" > > If the indices are present (and visible) at the time of execution, as > you described it, we should be seeing index scans on dok_dokumnr_idx > and rid_dokumnr_idx instead of sequential scans. > > That's what I get on a similar query in one of my databases: > EXPLAIN ANALYZE SELECT a.adr_id FROM cp.adr a LEFT JOIN cp.kontakt k > USING (adr_id) WHERE k.adr_id IS NULL; > > Merge Left Join (cost=0.00..1356.31 rows=10261 width=4) (actual > time=0.096..56.759 rows=3868 loops=1) > Merge Cond: ("outer".adr_id = "inner".adr_id) > Filter: ("inner".adr_id IS NULL) > -> Index Scan using adr_pkey on adr a (cost=0.00..947.54 > rows=10261 width=4) (actual time=0.012..23.118 rows=10261 loops=1) > -> Index Scan using kontakt_adr_id_idx on kontakt k > (cost=0.00..295.47 rows=7011 width=4) (actual time=0.007..13.299 > rows=7011 loops=1) > Total runtime: 58.510 ms I have no idea why my query plan shows hash and your plan show merge. My primary key (dokumnr is of type integer). Maybe this selects hash plan. For my big database I got the following plan: explain analyze SELECT r.dokumnr FROM rid r LEFT JOIN dok d USING (dokumnr) WHERE d.dokumnr IS NULL 1 Hash Left Join (cost=7759.44..31738.44 rows=1 width=4) (actual time=112.572..761.121 rows=3 loops=1) 2 Hash Cond: (r.dokumnr = d.dokumnr) 3 Filter: (d.dokumnr IS NULL) 4 -> Seq Scan on rid r (cost=0.00..17424.64 rows=202464 width=4) (actual time=0.007..175.538 rows=202424 loops=1) 5 -> Hash (cost=6784.64..6784.64 rows=56064 width=4) (actual time=111.296..111.296 rows=56079 loops=1) 6 -> Seq Scan on dok d (cost=0.00..6784.64 rows=56064 width=4) (actual time=0.005..58.686 rows=56079 loops=1) 7 Total runtime: 761.311 ms Since there are a lot of rows (202424 swown), this select command must use indexes. Without indexes it is not possible toobtain speed of 0.7 seconds. Andrus.