activating RI constraints too slow - Mailing list pgsql-bugs
From | Preston Landers |
---|---|
Subject | activating RI constraints too slow |
Date | |
Msg-id | Pine.LNX.4.33.0308281440140.6957-100000@dev2.int.journyx.com Whole thread Raw |
List | pgsql-bugs |
============================================================================ POSTGRESQL BUG REPORT TEMPLATE ============================================================================ Your name : Preston Landers Your email address : planders@journyx.com System Configuration --------------------- Architecture (example: Intel Pentium) : Intel Pentium II dual 500 mhz Operating System (example: Linux 2.0.26 ELF) : Linux 2.4.2-2smp (Redhat 7.1) PostgreSQL version (example: PostgreSQL-7.3): PostgreSQL-7.4beta2 snapshot (from 2003/8/26.) Compiler used (example: gcc 2.95.2) : GCC 2.96 Please enter a FULL description of your problem: ------------------------------------------------ This is not a "bug" report, just a performance complaint unaccompanied by a patch. ;-) Enabling referential integrity is painfully slow. Enabling a foreign key constraint appears to take time proportional to the number of parent and child records, but even longer than an equivelent "check for bad records" query would. With a good number of records (more than a few hundred thousand) this can take a painfully long time. All of the time appears to be taken in the query that checks for non-conforming records, not in creating the rules. I *do* have indexes on all relevent columns. The columns are the same data-type. It may be that I am simply doing something else wrong but I sure can't identify it. I would appreciate any suggestions. Oracle and SQL Server, at least, seem to handle these checks in a much more reasonable amount of time (less than 1/10th the time,) for whatever that's worth. We are currently using PG 7.2.4. I was hoping that upgrading to 7.4 would fix this, as I understand there are a number of performances fixes. Unfortunately, this does not seem to be addressed yet. (I'm testing using a snapshot from 2003/08/26.) Please describe a way to repeat the problem. Please try to provide a concise reproducible example, if at all possible: ---------------------------------------------------------------------- I'm pretty sure this has been brought up before on various mailing lists. At least when I search around, I can find people posing similar questions, but I can never seem to find a response or solution. Anyway, I can provide a sample database or SQL script that illustrates the problem, if neccesary. The trick is to actually have hundreds of thousands of records in there. Out of all my tables, the FK constraint that takes longest to activate is this one. 162.65 seconds for query: ALTER TABLE "time_recs_attribs" ADD CONSTRAINT f904_1_123 FOREIGN KEY ("id_time_rec") REFERENCES "time_recs" ("id_time_rec") ON DELETE CASCADE; time_recs (parent table) has 281,043 records in this particular database. time_recs_attribs (child) has 246,890 records. The column time_recs_attribs.id_time_rec (child) does have an index. The parent table automatically has an index on the primary key column (the one I am referencing.) I have tried doing a VACUUM FULL just prior to activating the constraint to no avail. In both tables the datatypes are VARCHAR(30). With the same data and same schema under SQL Server, this ALTER TABLE statement takes about 3 seconds on the same (actually, slightly worse) hardware. For another example, we have a constraint on the "time_recs" table referencing the primary key of the "projects" table. timesheet=# select count(id_project) from projects; count ------- 462 timesheet=# select count(id_time_rec) from time_recs; count -------- 101045 template1=# select count(distinct(id_project)) from time_recs; count ------- 285 timesheet=# explain analyze timesheet-# select id_time_rec from time_recs timesheet-# where not exists ( select id_project from projects where projects.id_project = time_recs.id_project ); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on time_recs (cost=0.00..591579.89 rows=50523 width=33) (actual time=9278.47..9278.47 rows=0 loops=1) Filter: (NOT (subplan)) SubPlan -> Index Scan using projects_pkey on projects (cost=0.00..5.82 rows=1 width=33) (actual time=0.07..0.07 rows=1 loops=101045) Index Cond: ((id_project)::text = ($0)::text) Total runtime: 9278.70 msec (6 rows) Time: 9282.38 ms template1=# ALTER TABLE "time_recs" ADD CONSTRAINT f123_5_108 FOREIGN KEY ("id_project") REFERENCES "projects" ("id_project"); NOTICE: ALTER TABLE will create implicit trigger(s) for FOREIGN KEY check(s) ALTER TABLE Time: 27137.22 ms Why does the ALTER TABLE statement take three times as long as the equivelent check query? When I enable debug logging on my pgres server, I see this: SELECT 1 FROM ONLY "public"."projects" x WHERE "id_project" = $1 FOR UPDATE OF x It seems to be doing this query behind the scenes, iterating over each record in time_recs. I guess that this might be the source of the inefficiency. I have no idea how to address this in the postgresql design -- just wanted to point it out in case it had been unnoticed. Or could this just be purely a configuration issue? Most of the PG config file settings in these databases are at their default values. Please let me know if you need an actual sample database to better illustrate this. By the way, Microsoft SQL Server at least seems to use index scans for both tables (parent and child), instead of a sequential scan over the child table, in this case time_recs. The equivelent "check query" in SQL Server also uses index scans for both tables, using the "Hash Match / Right Anti Semi Join" method to sort them together. (This is according to the Query Analyzer.) If you know how this problem might be fixed, list the solution below: --------------------------------------------------------------------- See above...
pgsql-bugs by date: