Re: directory tree query with big planner variation - Mailing list pgsql-performance
From | Mark Lewis |
---|---|
Subject | Re: directory tree query with big planner variation |
Date | |
Msg-id | 1154354001.1634.792.camel@archimedes Whole thread Raw |
In response to | Re: directory tree query with big planner variation (Michael Stone <mstone+postgres@mathom.us>) |
Responses |
Re: directory tree query with big planner variation
|
List | pgsql-performance |
It seems like you might be able to avoid the expensive directory lookups entirely without changing the schema by defining an immutable function dir_depth(path), which would just count the number of forward slashes. Then create a functional index on dir_depth(path) and in the query do a check for directories with a given prefix and the expected dir_depth. In 8.1 and later, this kind of query might be able to use a bitmap index combining thingamajigger (the actual name escapes me right now). This is just a hunch and I haven't looked too carefully at the schema/query requirements to see if its feasible, but seems like a reasonable approach. -- Mark Lewis On Mon, 2006-07-31 at 09:30 -0400, Michael Stone wrote: > On Mon, Jul 31, 2006 at 01:54:24PM +0200, Axel Rau wrote: > >Am 31.07.2006 um 13:15 schrieb Michael Stone: > >>On Mon, Jul 31, 2006 at 12:48:11PM +0200, Axel Rau wrote: > >>> WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC > >> > >>This can't be indexed. You might try something like WHERE P.path > >>LIKE '%@%' AND P.path ~ '^%@/[^/]*/$' > > Ignore that, I wasn't awake yet. > > >Why does it quite well in this case: > >--------------------------------------- > >-> Index Scan using path_name_idx on path p (cost=0.00..3.02 rows=1 > >width=97) (actual time=15.480..56.935 rows=27 loops=1) > > Index Cond: ((path >= '/Users/axel/Library/ > >Preferences/'::text) AND (path < '/Users/axel/Library/ > >Preferences0'::text)) > > Filter: ((path ~ '^/Users/axel/Library/Preferences/[^/]*/ > >$'::text) AND (rtrim("replace"(path, '/Users/axel/Library/ > >Preferences/'::text, ''::text), '/'::text) <> ''::text)) > >--------------------------------------- > >as compared to this case(ignoring the index on path): > >--------------------------------------- > >-> Index Scan using path_pkey on path p (cost=0.00..2567.57 > >rows=1941 width=97) (actual time=527.805..1521.911 rows=69 loops=1) > > Filter: ((path ~ '^/Users/axel/[^/]*/$'::text) AND (rtrim > >("replace"(path, '/Users/axel/'::text, ''::text), '/'::text) <> > >''::text)) > >--------------------------------------- > >? With all longer path names, I get the above (good)result. > >Should I put the rtrim/replace on the client side? > > That's not the slow part. The slow part is retrieving every single file > for each of the subdirectories in order to determine whether there are > any files in the subdirectories. > > >>The schema could be a lot more intelligent here. (E.g., store path > >>seperately from file/directory name, store type (file or directory) > >>seperately, etc.) Without improving the schema I don't think this > >>will ever be a speed demon. > > >PATH holds complete pathnames of directories, FILENAME holds > >filenames and pathname components. > >Currently the schema is the lowest common denominator between SQLite, > >MySQL and pg and the bacula people will stay with that (-;). > > Nothing I suggested raises the bar for the "lowest common denominator". > If I understand the intend of this SQL, you're pulling all the entries > in a directory in two parts. The first part (files) is fairly > straightforward. The second part (directories) consists of pulling any > file whose parent is a subdirectory of the directory you're looking for > (this is *all* children of the directory, since you have to retrieve > every element that begins with the directory, then discard those that > have an additional / in their name), counting how many of these there > are for each subdirectory, and discarding those results except for a > binary (yes there are children or no there aren't). This is a lot of > useless work to go through, and is going to be slow if you've got a lot > of stuff in a subdirectory. An alternative approach would be, for each > directory, to store all its children (files and subdirectories) along > with a flag indicating which it is. This would allow you to create the > collapsed tree view without walking all the children of a subdirectory. > > Assuming you can't make changes to the schema, what about the query? > You've got this: > > explain analyze SELECT X.name AS name, COUNT(CH) > 1 AS children > FROM > ( SELECT RTRIM( REPLACE( NLPC.path, '%@/', ''),'/') AS name, > FN.name AS CH > FROM > ( SELECT P.path,P.pathid > FROM bacula.path P > WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC > LEFT OUTER JOIN bacula.file F > ON > NLPC.pathid = F.pathid > LEFT OUTER JOIN bacula.filename FN > ON > F.filenameid = FN.filenameid > GROUP BY NLPC.path, FN.name > UNION > SELECT FN.name AS name, FN.name AS CH > FROM > bacula.path P, bacula.file F, bacula.filename FN > WHERE > P.path = '%@/' AND > P.pathid = F.pathid AND > F.filenameid = FN.filenameid > ) AS X > WHERE X.name <> '' > GROUP BY X.name > > I'm only looking at the first part, which reduces to > SELECT X.name AS name, COUNT(CH) > 1 AS children FROM > SELECT NLPC.path AS name, FN.name as CH > FROM ( SELECT P.path,P.pathid FROM bacula.path AS NLPC > LEFT OUTER JOIN bacula.file F ON NLPC.pathid=F.pathid > LEFT OUTER JOIN bacula.filename FN ON F.filenameid=FN.filenameid > GROUP BY NLPC.path,FN.name > > Why is the filename table even accessed? Would the results be the > same if you did > SELECT NLPC.path AS name, F.fileid AS CH > and drop the LEFT OUTER JOIN bacula.filename altogether? > > And then what happens if you try something like > SELECT X.name,X.children > FROM > (SELECT [rtrim]P.path,(SELECT count(*) FROM bacula.file F > WHERE F.pathid = P.pathid > LIMIT 2) > 1 > FROM bacula.path P > WHERE P.path ~ '^%@/[^/]*/$' > UNION > SELECT FN.name,0 > FROM bacula.path P, bacula.file F, bacula.filename FN > WHERE > P.path = '%@/' AND > P.pathid = F.pathid AND > F.filenameid = FN.filenameid > ) AS X > WHERE X.name <> '' > GROUP BY X.name > > It's hard to say without knowing what's actually *in* the tables, but > the existing query definately doesn't scale well for what I think it's > trying to do. > > Mike Stone > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster
pgsql-performance by date: