Home > mailing lists

Re: directory tree query with big planner variation - Mailing list pgsql-performance

From	Michael Stone
Subject	Re: directory tree query with big planner variation
Date	July 31, 2006 10:32:08
Msg-id	20060731133034.GF2900@mathom.us Whole thread Raw
In response to	Re: directory tree query with big planner variation (Axel Rau <Axel.Rau@Chaos1.DE>)
Responses	Re: directory tree query with big planner variation Re: directory tree query with big planner variation
List	pgsql-performance

Tree view

On Mon, Jul 31, 2006 at 01:54:24PM +0200, Axel Rau wrote:
>Am 31.07.2006 um 13:15 schrieb Michael Stone:
>>On Mon, Jul 31, 2006 at 12:48:11PM +0200, Axel Rau wrote:
>>>               WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC
>>
>>This can't be indexed. You might try something like WHERE P.path
>>LIKE '%@%' AND P.path ~ '^%@/[^/]*/$'

Ignore that, I wasn't awake yet.

>Why does it quite well in this case:
>---------------------------------------
>->  Index Scan using path_name_idx on path p  (cost=0.00..3.02 rows=1
>width=97) (actual time=15.480..56.935 rows=27 loops=1)
>       Index Cond: ((path >= '/Users/axel/Library/
>Preferences/'::text) AND (path < '/Users/axel/Library/
>Preferences0'::text))
>       Filter: ((path ~ '^/Users/axel/Library/Preferences/[^/]*/
>$'::text) AND (rtrim("replace"(path, '/Users/axel/Library/
>Preferences/'::text, ''::text), '/'::text) <> ''::text))
>---------------------------------------
>as compared to this case(ignoring the index on path):
>---------------------------------------
>->  Index Scan using path_pkey on path p  (cost=0.00..2567.57
>rows=1941 width=97) (actual time=527.805..1521.911 rows=69 loops=1)
>       Filter: ((path ~ '^/Users/axel/[^/]*/$'::text) AND (rtrim
>("replace"(path, '/Users/axel/'::text, ''::text), '/'::text) <>
>''::text))
>---------------------------------------
>? With all longer path names, I get the above (good)result.
>Should I put the rtrim/replace on the client side?

That's not the slow part. The slow part is retrieving every single file
for each of the subdirectories in order to determine whether there are
any files in the subdirectories.

>>The schema could be a lot more intelligent here. (E.g., store path
>>seperately from file/directory name, store type (file or directory)
>>seperately, etc.) Without improving the schema I don't think this
>>will ever be a speed demon.

>PATH holds complete pathnames of directories, FILENAME holds
>filenames and pathname components.
>Currently the schema is the lowest common denominator between SQLite,
>MySQL and pg and the bacula people will stay with that (-;).

Nothing I suggested raises the bar for the "lowest common denominator".
If I understand the intend of this SQL, you're pulling all the entries
in a directory in two parts. The first part (files) is fairly
straightforward. The second part (directories) consists of pulling any
file whose parent is a subdirectory of the directory you're looking for
(this is *all* children of the directory, since you have to retrieve
every element that begins with the directory, then discard those that
have an additional / in their name), counting how many of these there
are for each subdirectory, and discarding those results except for a
binary (yes there are children or no there aren't). This is a lot of
useless work to go through, and is going to be slow if you've got a lot
of stuff in a subdirectory. An alternative approach would be, for each
directory, to store all its children (files and subdirectories) along
with a flag indicating which it is. This would allow you to create the
collapsed tree view without walking all the children of a subdirectory.

Assuming you can't make changes to the schema, what about the query?
You've got this:

explain analyze  SELECT X.name AS name, COUNT(CH) > 1 AS children
  FROM
    ( SELECT  RTRIM( REPLACE( NLPC.path, '%@/', ''),'/') AS name,
                                                 FN.name AS CH
        FROM
          ( SELECT P.path,P.pathid
              FROM bacula.path P
              WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC
          LEFT OUTER JOIN bacula.file F
            ON
              NLPC.pathid = F.pathid
          LEFT OUTER JOIN bacula.filename FN
            ON
              F.filenameid = FN.filenameid
        GROUP BY NLPC.path, FN.name
      UNION
      SELECT  FN.name AS name, FN.name AS CH
        FROM
          bacula.path P, bacula.file F, bacula.filename FN
        WHERE
          P.path = '%@/'   AND
          P.pathid = F.pathid                           AND
          F.filenameid = FN.filenameid
    ) AS X
  WHERE X.name <> ''
  GROUP BY X.name

I'm only looking at the first part, which reduces to
SELECT X.name AS name, COUNT(CH) > 1 AS children FROM
  SELECT NLPC.path AS name, FN.name as CH
    FROM ( SELECT P.path,P.pathid FROM bacula.path AS NLPC
           LEFT OUTER JOIN bacula.file F ON NLPC.pathid=F.pathid
           LEFT OUTER JOIN bacula.filename FN ON F.filenameid=FN.filenameid
           GROUP BY NLPC.path,FN.name

Why is the filename table even accessed? Would the results be the
same if you did
  SELECT NLPC.path AS name, F.fileid AS CH
and drop the LEFT OUTER JOIN bacula.filename altogether?

And then what happens if you try something like
SELECT X.name,X.children
 FROM
       (SELECT [rtrim]P.path,(SELECT count(*) FROM bacula.file F
                              WHERE F.pathid = P.pathid
                              LIMIT 2) > 1
          FROM bacula.path P
          WHERE P.path ~ '^%@/[^/]*/$'
        UNION
        SELECT FN.name,0
          FROM bacula.path P, bacula.file F, bacula.filename FN
          WHERE
            P.path = '%@/'   AND
            P.pathid = F.pathid                           AND
            F.filenameid = FN.filenameid
        ) AS X
 WHERE X.name <> ''
 GROUP BY X.name

It's hard to say without knowing what's actually *in* the tables, but
the existing query definately doesn't scale well for what I think it's
trying to do.

Mike Stone

pgsql-performance by date:

From: Tom Lane
Date: 31 July 2006, 09:54:50
Subject: Re: sub select performance due to seq scans

From: Mark Lewis
Date: 31 July 2006, 10:54:06
Subject: Re: directory tree query with big planner variation

Re: directory tree query with big planner variation - Mailing list pgsql-performance

Previous

Next