Re: directory tree query with big planner variation - Mailing list pgsql-performance
From | Michael Stone |
---|---|
Subject | Re: directory tree query with big planner variation |
Date | |
Msg-id | 20060731133034.GF2900@mathom.us Whole thread Raw |
In response to | Re: directory tree query with big planner variation (Axel Rau <Axel.Rau@Chaos1.DE>) |
Responses |
Re: directory tree query with big planner variation
Re: directory tree query with big planner variation |
List | pgsql-performance |
On Mon, Jul 31, 2006 at 01:54:24PM +0200, Axel Rau wrote: >Am 31.07.2006 um 13:15 schrieb Michael Stone: >>On Mon, Jul 31, 2006 at 12:48:11PM +0200, Axel Rau wrote: >>> WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC >> >>This can't be indexed. You might try something like WHERE P.path >>LIKE '%@%' AND P.path ~ '^%@/[^/]*/$' Ignore that, I wasn't awake yet. >Why does it quite well in this case: >--------------------------------------- >-> Index Scan using path_name_idx on path p (cost=0.00..3.02 rows=1 >width=97) (actual time=15.480..56.935 rows=27 loops=1) > Index Cond: ((path >= '/Users/axel/Library/ >Preferences/'::text) AND (path < '/Users/axel/Library/ >Preferences0'::text)) > Filter: ((path ~ '^/Users/axel/Library/Preferences/[^/]*/ >$'::text) AND (rtrim("replace"(path, '/Users/axel/Library/ >Preferences/'::text, ''::text), '/'::text) <> ''::text)) >--------------------------------------- >as compared to this case(ignoring the index on path): >--------------------------------------- >-> Index Scan using path_pkey on path p (cost=0.00..2567.57 >rows=1941 width=97) (actual time=527.805..1521.911 rows=69 loops=1) > Filter: ((path ~ '^/Users/axel/[^/]*/$'::text) AND (rtrim >("replace"(path, '/Users/axel/'::text, ''::text), '/'::text) <> >''::text)) >--------------------------------------- >? With all longer path names, I get the above (good)result. >Should I put the rtrim/replace on the client side? That's not the slow part. The slow part is retrieving every single file for each of the subdirectories in order to determine whether there are any files in the subdirectories. >>The schema could be a lot more intelligent here. (E.g., store path >>seperately from file/directory name, store type (file or directory) >>seperately, etc.) Without improving the schema I don't think this >>will ever be a speed demon. >PATH holds complete pathnames of directories, FILENAME holds >filenames and pathname components. >Currently the schema is the lowest common denominator between SQLite, >MySQL and pg and the bacula people will stay with that (-;). Nothing I suggested raises the bar for the "lowest common denominator". If I understand the intend of this SQL, you're pulling all the entries in a directory in two parts. The first part (files) is fairly straightforward. The second part (directories) consists of pulling any file whose parent is a subdirectory of the directory you're looking for (this is *all* children of the directory, since you have to retrieve every element that begins with the directory, then discard those that have an additional / in their name), counting how many of these there are for each subdirectory, and discarding those results except for a binary (yes there are children or no there aren't). This is a lot of useless work to go through, and is going to be slow if you've got a lot of stuff in a subdirectory. An alternative approach would be, for each directory, to store all its children (files and subdirectories) along with a flag indicating which it is. This would allow you to create the collapsed tree view without walking all the children of a subdirectory. Assuming you can't make changes to the schema, what about the query? You've got this: explain analyze SELECT X.name AS name, COUNT(CH) > 1 AS children FROM ( SELECT RTRIM( REPLACE( NLPC.path, '%@/', ''),'/') AS name, FN.name AS CH FROM ( SELECT P.path,P.pathid FROM bacula.path P WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC LEFT OUTER JOIN bacula.file F ON NLPC.pathid = F.pathid LEFT OUTER JOIN bacula.filename FN ON F.filenameid = FN.filenameid GROUP BY NLPC.path, FN.name UNION SELECT FN.name AS name, FN.name AS CH FROM bacula.path P, bacula.file F, bacula.filename FN WHERE P.path = '%@/' AND P.pathid = F.pathid AND F.filenameid = FN.filenameid ) AS X WHERE X.name <> '' GROUP BY X.name I'm only looking at the first part, which reduces to SELECT X.name AS name, COUNT(CH) > 1 AS children FROM SELECT NLPC.path AS name, FN.name as CH FROM ( SELECT P.path,P.pathid FROM bacula.path AS NLPC LEFT OUTER JOIN bacula.file F ON NLPC.pathid=F.pathid LEFT OUTER JOIN bacula.filename FN ON F.filenameid=FN.filenameid GROUP BY NLPC.path,FN.name Why is the filename table even accessed? Would the results be the same if you did SELECT NLPC.path AS name, F.fileid AS CH and drop the LEFT OUTER JOIN bacula.filename altogether? And then what happens if you try something like SELECT X.name,X.children FROM (SELECT [rtrim]P.path,(SELECT count(*) FROM bacula.file F WHERE F.pathid = P.pathid LIMIT 2) > 1 FROM bacula.path P WHERE P.path ~ '^%@/[^/]*/$' UNION SELECT FN.name,0 FROM bacula.path P, bacula.file F, bacula.filename FN WHERE P.path = '%@/' AND P.pathid = F.pathid AND F.filenameid = FN.filenameid ) AS X WHERE X.name <> '' GROUP BY X.name It's hard to say without knowing what's actually *in* the tables, but the existing query definately doesn't scale well for what I think it's trying to do. Mike Stone
pgsql-performance by date: