Re: directory tree query with big planner variation - Mailing list pgsql-performance

From Mark Lewis
Subject Re: directory tree query with big planner variation
Date
Msg-id 1154354001.1634.792.camel@archimedes
Whole thread Raw
In response to Re: directory tree query with big planner variation  (Michael Stone <mstone+postgres@mathom.us>)
Responses Re: directory tree query with big planner variation  (Axel Rau <Axel.Rau@Chaos1.DE>)
List pgsql-performance
It seems like you might be able to avoid the expensive directory lookups
entirely without changing the schema by defining an immutable function
dir_depth(path), which would just count the number of forward slashes.
Then create a functional index on dir_depth(path) and in the query do a
check for directories with a given prefix and the expected dir_depth.

In 8.1 and later, this kind of query might be able to use a bitmap index
combining thingamajigger (the actual name escapes me right now).

This is just a hunch and I haven't looked too carefully at the
schema/query requirements to see if its feasible, but seems like a
reasonable approach.

-- Mark Lewis


On Mon, 2006-07-31 at 09:30 -0400, Michael Stone wrote:
> On Mon, Jul 31, 2006 at 01:54:24PM +0200, Axel Rau wrote:
> >Am 31.07.2006 um 13:15 schrieb Michael Stone:
> >>On Mon, Jul 31, 2006 at 12:48:11PM +0200, Axel Rau wrote:
> >>>               WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC
> >>
> >>This can't be indexed. You might try something like WHERE P.path
> >>LIKE '%@%' AND P.path ~ '^%@/[^/]*/$'
>
> Ignore that, I wasn't awake yet.
>
> >Why does it quite well in this case:
> >---------------------------------------
> >->  Index Scan using path_name_idx on path p  (cost=0.00..3.02 rows=1
> >width=97) (actual time=15.480..56.935 rows=27 loops=1)
> >       Index Cond: ((path >= '/Users/axel/Library/
> >Preferences/'::text) AND (path < '/Users/axel/Library/
> >Preferences0'::text))
> >       Filter: ((path ~ '^/Users/axel/Library/Preferences/[^/]*/
> >$'::text) AND (rtrim("replace"(path, '/Users/axel/Library/
> >Preferences/'::text, ''::text), '/'::text) <> ''::text))
> >---------------------------------------
> >as compared to this case(ignoring the index on path):
> >---------------------------------------
> >->  Index Scan using path_pkey on path p  (cost=0.00..2567.57
> >rows=1941 width=97) (actual time=527.805..1521.911 rows=69 loops=1)
> >       Filter: ((path ~ '^/Users/axel/[^/]*/$'::text) AND (rtrim
> >("replace"(path, '/Users/axel/'::text, ''::text), '/'::text) <>
> >''::text))
> >---------------------------------------
> >? With all longer path names, I get the above (good)result.
> >Should I put the rtrim/replace on the client side?
>
> That's not the slow part. The slow part is retrieving every single file
> for each of the subdirectories in order to determine whether there are
> any files in the subdirectories.
>
> >>The schema could be a lot more intelligent here. (E.g., store path
> >>seperately from file/directory name, store type (file or directory)
> >>seperately, etc.) Without improving the schema I don't think this
> >>will ever be a speed demon.
>
> >PATH holds complete pathnames of directories, FILENAME holds
> >filenames and pathname components.
> >Currently the schema is the lowest common denominator between SQLite,
> >MySQL and pg and the bacula people will stay with that (-;).
>
> Nothing I suggested raises the bar for the "lowest common denominator".
> If I understand the intend of this SQL, you're pulling all the entries
> in a directory in two parts. The first part (files) is fairly
> straightforward. The second part (directories) consists of pulling any
> file whose parent is a subdirectory of the directory you're looking for
> (this is *all* children of the directory, since you have to retrieve
> every element that begins with the directory, then discard those that
> have an additional / in their name), counting how many of these there
> are for each subdirectory, and discarding those results except for a
> binary (yes there are children or no there aren't). This is a lot of
> useless work to go through, and is going to be slow if you've got a lot
> of stuff in a subdirectory. An alternative approach would be, for each
> directory, to store all its children (files and subdirectories) along
> with a flag indicating which it is. This would allow you to create the
> collapsed tree view without walking all the children of a subdirectory.
>
> Assuming you can't make changes to the schema, what about the query?
> You've got this:
>
> explain analyze  SELECT X.name AS name, COUNT(CH) > 1 AS children
>   FROM
>     ( SELECT  RTRIM( REPLACE( NLPC.path, '%@/', ''),'/') AS name,
>                                                  FN.name AS CH
>         FROM
>           ( SELECT P.path,P.pathid
>               FROM bacula.path P
>               WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC
>           LEFT OUTER JOIN bacula.file F
>             ON
>               NLPC.pathid = F.pathid
>           LEFT OUTER JOIN bacula.filename FN
>             ON
>               F.filenameid = FN.filenameid
>         GROUP BY NLPC.path, FN.name
>       UNION
>       SELECT  FN.name AS name, FN.name AS CH
>         FROM
>           bacula.path P, bacula.file F, bacula.filename FN
>         WHERE
>           P.path = '%@/'   AND
>           P.pathid = F.pathid                           AND
>           F.filenameid = FN.filenameid
>     ) AS X
>   WHERE X.name <> ''
>   GROUP BY X.name
>
> I'm only looking at the first part, which reduces to
> SELECT X.name AS name, COUNT(CH) > 1 AS children FROM
>   SELECT NLPC.path AS name, FN.name as CH
>     FROM ( SELECT P.path,P.pathid FROM bacula.path AS NLPC
>            LEFT OUTER JOIN bacula.file F ON NLPC.pathid=F.pathid
>            LEFT OUTER JOIN bacula.filename FN ON F.filenameid=FN.filenameid
>            GROUP BY NLPC.path,FN.name
>
> Why is the filename table even accessed? Would the results be the
> same if you did
>   SELECT NLPC.path AS name, F.fileid AS CH
> and drop the LEFT OUTER JOIN bacula.filename altogether?
>
> And then what happens if you try something like
> SELECT X.name,X.children
>  FROM
>        (SELECT [rtrim]P.path,(SELECT count(*) FROM bacula.file F
>                               WHERE F.pathid = P.pathid
>                               LIMIT 2) > 1
>           FROM bacula.path P
>           WHERE P.path ~ '^%@/[^/]*/$'
>         UNION
>         SELECT FN.name,0
>           FROM bacula.path P, bacula.file F, bacula.filename FN
>           WHERE
>             P.path = '%@/'   AND
>             P.pathid = F.pathid                           AND
>             F.filenameid = FN.filenameid
>         ) AS X
>  WHERE X.name <> ''
>  GROUP BY X.name
>
> It's hard to say without knowing what's actually *in* the tables, but
> the existing query definately doesn't scale well for what I think it's
> trying to do.
>
> Mike Stone
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

pgsql-performance by date:

Previous
From: Michael Stone
Date:
Subject: Re: directory tree query with big planner variation
Next
From: H Hale
Date:
Subject: Re: sub select performance due to seq scans