Performance issue in pg_dump's dependency loop searching - Mailing list pgsql-hackers

From Tom Lane
Subject Performance issue in pg_dump's dependency loop searching
Date
Msg-id 27628.1406317657@sss.pgh.pa.us
Whole thread Raw
Responses Re: Performance issue in pg_dump's dependency loop searching  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
I looked into the performance issue Joe Van Dyk reported in bug #11033.
It turns out this is basically a consequence of the "section boundary"
pseudo-objects I added in commit a1ef01fe163b304760088e3e30eb22036910a495.
(I'd worried about performance consequences at the time, but hadn't
seen any actual evidence that there would be a problem.)

In Joe's test case, there are 8119 objects that need to be sorted into a
dumpable order.  (Only about 2700 of these are user objects that actually
end up in the output file; the rest are system objects like collations and
operators, which we include in the sorting pass in case there are
dependency paths leading through them.)  5976 of these are "pre-data
objects", so we end up with 5976 dependencies for the pre-data boundary
object; and then there are 386 data objects, which depend on the pre-data
boundary object and which the post-data boundary object depends on; and
then 1755 post-data objects, which all have dependencies on the post-data
boundary object.

You can probably already see where this is going :-(.  For each post-data
object that's potentially involved in a dependency loop, the findLoop()
search in pg_dump_sort.c will need to recurse not only to the object's
ordinary dependencies (of which there are probably just a few), but to the
post-data boundary object.  From there it will need to recurse to each
data object.  And from each data object, it will recurse to the
pre-data boundary object.  And from there, it'll have to recurse to each
and every pre-data object.  Not to mention recursing to the actual
dependencies of each one.  Now the recursion is pretty quick, but still
this is a significant number of cycles.

But wait, it gets worse.

In a data-only dump, we have even more dependencies, because we sort all
these objects (even though we won't dump most of them), and we also add
dependencies representing foreign-key constraints, in hopes of arriving at
a dump order wherein the foreign-key constraints won't be violated during
the restore.  Joe's example has about 320 FK constraints, some of them in
fairly long chains (ie, table A references table B which references table
C which references table D etc).  What this means is that there are a
multitude of dependency paths from the post-data boundary to the pre-data
boundary, for instance from boundary to table A to boundary, or from
boundary to table A to table B to boundary, or from boundary to table A to
table B to table C to boundary, etc etc etc.  After chasing down *each* of
these paths, findLoop has to visit every one of the pre-data objects.

So what's surprising is not so much that the sort takes 30 seconds as that
it finishes before the heat death of the universe.

I've found a fairly simple fix for this, which depends on the observation
that once we've failed to find a dependency path from object X back to our
starting object, there's no need to search again if we arrive at X via a
different dependency path.  This is noninvasive and obviously correct, and
it seems to completely eliminate the performance issue in Joe's test case.

There are other things that we might consider doing, like excluding system
objects from the dependency sort; but I'm less excited about that because
it's not entirely clear that there wouldn't be unexpected behavioral
consequences.  In any case, the more user objects are in the database, the
less it would matter.  We've definitely heard of databases with tens of
thousands of user tables for instance.  Another idea would be to somehow
be smarter about quickly eliminating objects that aren't actually members
of any dependency loop, but only depend on objects that are in loops.
But that seems like a research project; I definitely don't see a cheap
way to get TopoSort to detect such cases.

So I propose applying the attached, and back-patching to 9.1, which is
as far back as the section-boundary objects are used.

            regards, tom lane

diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 1b505a0..f0caa6b 100644
*** a/src/bin/pg_dump/pg_dump_sort.c
--- b/src/bin/pg_dump/pg_dump_sort.c
*************** static void findDependencyLoops(Dumpable
*** 137,142 ****
--- 137,143 ----
  static int findLoop(DumpableObject *obj,
           DumpId startPoint,
           bool *processed,
+          DumpId *searchFailed,
           DumpableObject **workspace,
           int depth);
  static void repairDependencyLoop(DumpableObject **loop,
*************** static void
*** 633,653 ****
  findDependencyLoops(DumpableObject **objs, int nObjs, int totObjs)
  {
      /*
!      * We use two data structures here.  One is a bool array processed[],
!      * which is indexed by dump ID and marks the objects already processed
!      * during this invocation of findDependencyLoops().  The other is a
!      * workspace[] array of DumpableObject pointers, in which we try to build
!      * lists of objects constituting loops.  We make workspace[] large enough
!      * to hold all the objects, which is huge overkill in most cases but could
!      * theoretically be necessary if there is a single dependency chain
!      * linking all the objects.
       */
      bool       *processed;
      DumpableObject **workspace;
      bool        fixedloop;
      int            i;

      processed = (bool *) pg_malloc0((getMaxDumpId() + 1) * sizeof(bool));
      workspace = (DumpableObject **) pg_malloc(totObjs * sizeof(DumpableObject *));
      fixedloop = false;

--- 634,668 ----
  findDependencyLoops(DumpableObject **objs, int nObjs, int totObjs)
  {
      /*
!      * We use three data structures here:
!      *
!      * processed[] is a bool array indexed by dump ID, marking the objects
!      * already processed during this invocation of findDependencyLoops().
!      *
!      * searchFailed[] is another array indexed by dump ID.  searchFailed[j] is
!      * set to dump ID k if we have proven that there is no dependency path
!      * leading from object j back to start point k.  This allows us to skip
!      * useless searching when there are multiple dependency paths from k to j,
!      * which is a common situation.  We could use a simple bool array for
!      * this, but then we'd need to re-zero it for each start point, resulting
!      * in O(N^2) zeroing work.  Using the start point's dump ID as the "true"
!      * value lets us skip clearing the array before we consider the next start
!      * point.
!      *
!      * workspace[] is an array of DumpableObject pointers, in which we try to
!      * build lists of objects constituting loops.  We make workspace[] large
!      * enough to hold all the objects in TopoSort's output, which is huge
!      * overkill in most cases but could theoretically be necessary if there is
!      * a single dependency chain linking all the objects.
       */
      bool       *processed;
+     DumpId       *searchFailed;
      DumpableObject **workspace;
      bool        fixedloop;
      int            i;

      processed = (bool *) pg_malloc0((getMaxDumpId() + 1) * sizeof(bool));
+     searchFailed = (DumpId *) pg_malloc0((getMaxDumpId() + 1) * sizeof(DumpId));
      workspace = (DumpableObject **) pg_malloc(totObjs * sizeof(DumpableObject *));
      fixedloop = false;

*************** findDependencyLoops(DumpableObject **obj
*** 657,663 ****
          int            looplen;
          int            j;

!         looplen = findLoop(obj, obj->dumpId, processed, workspace, 0);

          if (looplen > 0)
          {
--- 672,683 ----
          int            looplen;
          int            j;

!         looplen = findLoop(obj,
!                            obj->dumpId,
!                            processed,
!                            searchFailed,
!                            workspace,
!                            0);

          if (looplen > 0)
          {
*************** findDependencyLoops(DumpableObject **obj
*** 685,690 ****
--- 705,711 ----
          exit_horribly(modulename, "could not identify dependency loop\n");

      free(workspace);
+     free(searchFailed);
      free(processed);
  }

*************** findDependencyLoops(DumpableObject **obj
*** 695,700 ****
--- 716,722 ----
   *    obj: object we are examining now
   *    startPoint: dumpId of starting object for the hoped-for circular loop
   *    processed[]: flag array marking already-processed objects
+  *    searchFailed[]: flag array marking already-unsuccessfully-visited objects
   *    workspace[]: work array in which we are building list of loop members
   *    depth: number of valid entries in workspace[] at call
   *
*************** static int
*** 708,713 ****
--- 730,736 ----
  findLoop(DumpableObject *obj,
           DumpId startPoint,
           bool *processed,
+          DumpId *searchFailed,
           DumpableObject **workspace,
           int depth)
  {
*************** findLoop(DumpableObject *obj,
*** 721,726 ****
--- 744,756 ----
          return 0;

      /*
+      * If we've already proven there is no path from this object back to the
+      * startPoint, forget it.
+      */
+     if (searchFailed[obj->dumpId] == startPoint)
+         return 0;
+
+     /*
       * Reject if obj is already present in workspace.  This test prevents us
       * from going into infinite recursion if we are given a startPoint object
       * that links to a cycle it's not a member of, and it guarantees that we
*************** findLoop(DumpableObject *obj,
*** 759,770 ****
--- 789,806 ----
          newDepth = findLoop(nextobj,
                              startPoint,
                              processed,
+                             searchFailed,
                              workspace,
                              depth);
          if (newDepth > 0)
              return newDepth;
      }

+     /*
+      * Remember there is no path from here back to startPoint
+      */
+     searchFailed[obj->dumpId] = startPoint;
+
      return 0;
  }


pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: [w32] test_shm_mq test suite permanently burns connections slots
Next
From: Tom Lane
Date:
Subject: Re: implement subject alternative names support for SSL connections