Thread: Need suggestion high-level suggestion on how to solve a performance problem
Need suggestion high-level suggestion on how to solve a performance problem
From
Madison Kelly
Date:
Hi all, I hope I am not asking too many questions. :) I have been trying to solve a performance problem in my program for a while now and, after getting an index to work which didn't speed things up enough, I am stumped. I am hoping someone here might have come across a similar issue and came up with a creative solution they wouldn't mind sharing. I am not looking for details, I expect to do my homework, I just need a pointer, suggestion or trick. The problem I have is that I am using pgSQL as a back end for my web-based *nix backup program. Part of the database stores info on every file and directory per partition. I use this information to build my directory tree. I have a test partition with ~325,000 files of which ~30,000 are directories. I have been able to get the performance up to a reasonable level for displaying the directory tree including expanding and contracting branches (~3-5sec). I do this by loading all the directory info into an array and a hash once and using them as needed instead of hitting the DB. The problem comes when the user toggles a directory branch's backup flag (a simple check box beside the directory name). If it's a directory near the end of a branch it is fast enough. If they toggle a single file it is nearly instant. However if they toggle say the root directory, so every file and directory below it needs to be updated, it can take 500-600sec to return. Obviously this is no good. What I need is a scheme for being able to say, essentially: UPDATE file_info_1 SET file_backup='t' WHERE file_parent_dir~'^/'; Faster. An index isn't enough because it needs to hit every entry anyway. I use perl to access the DB and generate the web pages. The file browser portion looks and acts like most file browsers (directory tree in the left frame with expanding and contracting directory branches and a list of files in a given directory on the right). It does not use any plug-ins like Java and that is important to me that it stays that way (I want it to be as simple as possible for the user to install). So far the only suggestion I've received is to keep a secondary 'delta' table to store just the request. Then on load get the existing data then check it against the delta table before creating the page. The biggest draw back for me with this is that currently I don't need to provide an 'Apply' button because a simple javascript call passes the request onto the perl script immediately. I really like the Mac-esque approach to keeping the UI as simple and straight forward as possible. So, a suggestion that doesn't require something like an 'Apply' button would be much appreciated. Thanks for any suggestions in advance! Madison PS - For what it's worth, this is the last hurdle for me to overcome before I can finally release my program as 'beta' after over 15 months of work! :)
Hello, I once upon a time worked in a company doing backup software and I remember these problems, we had exactly the same ! The file tree was all into memory and everytime the user clicked on something it haaad to update everything. Being C++ it was very fast, but to backup a million files you needed a gig of RAM, which is... a problem let's say, when you think my linux laptop has about 400k files on it. So we rewrote the project entirely with the purpose of doing the million files thingy with the clunky Pentium 90 with 64 megabytes of RAM, and it worked. What I did was this : - use Berkeley DB Berkeley DB isn't a database like postgres, it's just a tree, but it's cool for managing trees. It's quite fast, uses key compression, etc. It has however a few drawbacks : - files tend to fragment a lot over time and it can't reindex or vacuum like postgres. You have to dump and reload. - the price of the licence to be able to embed it in your product and sell it is expensive, and if you want crash-proof, it's insanely expensive. - Even though it's a tree it has no idea what a parent is so you have to mess with that manually. We used a clever path encoding to keep all the paths inside the same directory close in the tree ; and separated database for dirs and files because we wanted the dirs to be in the cache, whereas we almost never touched the files. And... You can't make it if you update every node everytime the user clicks on something. You have to update 1 node. In your tree you have nodes. Give each node a state being one of these three : include, exclude, inherit When you fetch a node you also fetch all of its parents, and you propagate the state to know the state of the final node. If a node is in state 'inherit' it is like its parent, etc. So you have faster updates but slower selects. However, there is a bonus : if you check a directory as "include" and one of its subdirectory as "exclude", and the user adds files all over the place, the files added in the "included" directory will be automatically backed up and the ones in the 'ignored' directory will be automatically ignored, you have nothing to change. And it is not that slow because, if you think about it, suppose you have /var/www/mysite/blah with 20.000 files in it, in order to inherit the state of the parents on them you only have to fetch /var once, www once, etc. So if you propagate your inherited properties when doing a tree traversal it comes at no cost. IMHO it's the only solution. It can be done quite easily also, using ltree types and a little stored procedures, you can even make a view which gives the state of each element, computed by inheritance. Here's the secret : the user will select 100.000 files by clicking on a directory near root, but the user will NEVER look at 100.000 files. So you can make looking at files 10x slower if you can make including/excluding directories 100.000 times faster. Now you'll ask me, but how do I calculate the total size of the backup without looking at all the files ? when I click on a directory I don't know what files are in it and which will inherit and which will not. It's simple : you precompute it when you scan the disk for changed files. This is the only time you should do a complete tree exploration. On each directory we put a matrix [M]x[N], M and N being one of the three above state, containing the amount of stuff in the directory which would be in state M if the directory was in state N. This is very easy to compute when you scan for new files. Then when a directory changes state, you have to sum a few cells of that matrix to know how much more that adds to the backup. And you only look up 1 record. Is that helpful ?
PFC wrote: > > Hello, > I once upon a time worked in a company doing backup software and I > remember these problems, we had exactly the same ! Prety neat. :) > The file tree was all into memory and everytime the user clicked on > something it haaad to update everything. Being C++ it was very fast, > but to backup a million files you needed a gig of RAM, which is... a > problem let's say, when you think my linux laptop has about 400k files > on it. I want this to run on "average" systems (I'm developing it primarily on my modest P3 1GHz Thinkpad w/ 512MB RAM running Debian) so expecting that much free memory is not reasonable. As it is my test DB, with a realistic amount of data, is ~150MB. > So we rewrote the project entirely with the purpose of doing the > million files thingy with the clunky Pentium 90 with 64 megabytes of > RAM, and it worked. > What I did was this : > - use Berkeley DB <snip> > - the price of the licence to be able to embed it in your product > and sell it is expensive, and if you want crash-proof, it's insanely > expensive. This is the kicker right there; my program is released under the GPL so it's fee-free. I can't eat anything costly like that. As it is there is hundreds and hundreds of hours in this program that I am already hoping to recoup one day through support contracts. Adding commercial software I am afraid is not an option. > bonus : if you check a directory as "include" and one of its > subdirectory as "exclude", and the user adds files all over the place, > the files added in the "included" directory will be automatically > backed up and the ones in the 'ignored' directory will be automatically > ignored, you have nothing to change. <snip> > IMHO it's the only solution. Now *this* is an idea worth looking into. How I will implement it with my system I don't know yet but it's a new line of thinking. Wonderful! > Now you'll ask me, but how do I calculate the total size of the > backup without looking at all the files ? when I click on a directory I > don't know what files are in it and which will inherit and which will not. > > It's simple : you precompute it when you scan the disk for changed > files. This is the only time you should do a complete tree exploration. This is already what I do. When a user selects a partition they want to select files to backup or restore the partition is scanned. The scan looks at every file, directory and symlink and records it's size (on disk), it mtime, owner, group, etc. and records it to the database. I've got this scan/update running at ~1,500 files/second on my laptop. That was actually the first performance tuning I started with. :) With all the data in the DB the backup script can calculate rather intelligently where it wants to copy each directory to. > On each directory we put a matrix [M]x[N], M and N being one of the > three above state, containing the amount of stuff in the directory > which would be in state M if the directory was in state N. This is very > easy to compute when you scan for new files. Then when a directory > changes state, you have to sum a few cells of that matrix to know how > much more that adds to the backup. And you only look up 1 record. In my case what I do is calculate the size of all the files selected for backup in each directory, sort the directories from all sources by the total size of all their selected files and then start assigning the directories, largest to smallest to each of my available destination medias. If it runs out of destination space it backs up what it can and then waits a user-definable amount of time and then checks to see if any new destination media has been made available. If so it again tries to assign the files/directories that didn't fit. It will loop a user-definable number of times before giving up and warning the user that more destination space is needed for that backup job. > Is that helpful ? The three states (inhertied, backup, ignore) has definately caught my attention. Thank you very much for your idea and lengthy reply! Madison
> This is the kicker right there; my program is released under the GPL > so it's fee-free. I can't eat anything costly like that. As it is there > is hundreds and hundreds of hours in this program that I am already > hoping to recoup one day through support contracts. Adding commercial > software I am afraid is not an option. If you open-source GPL then Berkeley is free for you.
Re: Need suggestion high-level suggestion on how to solve a performance problem
From
Josh Berkus
Date:
Madison, > The problem comes when the user toggles a directory branch's backup > flag (a simple check box beside the directory name). If it's a directory > near the end of a branch it is fast enough. If they toggle a single file > it is nearly instant. However if they toggle say the root directory, so > every file and directory below it needs to be updated, it can take > 500-600sec to return. Obviously this is no good. > > What I need is a scheme for being able to say, essentially: > > UPDATE file_info_1 SET file_backup='t' WHERE file_parent_dir~'^/'; Well, from the sound of it the problem is not selecting the files to be updated, it's updating them. What I would do, personally, is *not* store an update flag for each file. Instead, I would store the update flag for the directory which was selected. If users want to exclude certain files and subdirectories, I'd also include a dont_update flag. When it's time to back up, you simply check the tree for the most immediate update or don't update flag above each file. For the table itself, I'd consider using ltree for the directory tree structure. It has some nice features which makes it siginifcanly better than using a delimited text field. -- --Josh Josh Berkus Aglio Database Solutions San Francisco