Re: Need suggestion high-level suggestion on how to solve - Mailing list pgsql-performance
From | Madison Kelly |
---|---|
Subject | Re: Need suggestion high-level suggestion on how to solve |
Date | |
Msg-id | 42CDBAB2.9030406@alteeve.com Whole thread Raw |
In response to | Re: Need suggestion high-level suggestion on how to solve a performance problem (PFC <lists@boutiquenumerique.com>) |
Responses |
Re: Need suggestion high-level suggestion on how to solve a performance problem
|
List | pgsql-performance |
PFC wrote: > > Hello, > I once upon a time worked in a company doing backup software and I > remember these problems, we had exactly the same ! Prety neat. :) > The file tree was all into memory and everytime the user clicked on > something it haaad to update everything. Being C++ it was very fast, > but to backup a million files you needed a gig of RAM, which is... a > problem let's say, when you think my linux laptop has about 400k files > on it. I want this to run on "average" systems (I'm developing it primarily on my modest P3 1GHz Thinkpad w/ 512MB RAM running Debian) so expecting that much free memory is not reasonable. As it is my test DB, with a realistic amount of data, is ~150MB. > So we rewrote the project entirely with the purpose of doing the > million files thingy with the clunky Pentium 90 with 64 megabytes of > RAM, and it worked. > What I did was this : > - use Berkeley DB <snip> > - the price of the licence to be able to embed it in your product > and sell it is expensive, and if you want crash-proof, it's insanely > expensive. This is the kicker right there; my program is released under the GPL so it's fee-free. I can't eat anything costly like that. As it is there is hundreds and hundreds of hours in this program that I am already hoping to recoup one day through support contracts. Adding commercial software I am afraid is not an option. > bonus : if you check a directory as "include" and one of its > subdirectory as "exclude", and the user adds files all over the place, > the files added in the "included" directory will be automatically > backed up and the ones in the 'ignored' directory will be automatically > ignored, you have nothing to change. <snip> > IMHO it's the only solution. Now *this* is an idea worth looking into. How I will implement it with my system I don't know yet but it's a new line of thinking. Wonderful! > Now you'll ask me, but how do I calculate the total size of the > backup without looking at all the files ? when I click on a directory I > don't know what files are in it and which will inherit and which will not. > > It's simple : you precompute it when you scan the disk for changed > files. This is the only time you should do a complete tree exploration. This is already what I do. When a user selects a partition they want to select files to backup or restore the partition is scanned. The scan looks at every file, directory and symlink and records it's size (on disk), it mtime, owner, group, etc. and records it to the database. I've got this scan/update running at ~1,500 files/second on my laptop. That was actually the first performance tuning I started with. :) With all the data in the DB the backup script can calculate rather intelligently where it wants to copy each directory to. > On each directory we put a matrix [M]x[N], M and N being one of the > three above state, containing the amount of stuff in the directory > which would be in state M if the directory was in state N. This is very > easy to compute when you scan for new files. Then when a directory > changes state, you have to sum a few cells of that matrix to know how > much more that adds to the backup. And you only look up 1 record. In my case what I do is calculate the size of all the files selected for backup in each directory, sort the directories from all sources by the total size of all their selected files and then start assigning the directories, largest to smallest to each of my available destination medias. If it runs out of destination space it backs up what it can and then waits a user-definable amount of time and then checks to see if any new destination media has been made available. If so it again tries to assign the files/directories that didn't fit. It will loop a user-definable number of times before giving up and warning the user that more destination space is needed for that backup job. > Is that helpful ? The three states (inhertied, backup, ignore) has definately caught my attention. Thank you very much for your idea and lengthy reply! Madison
pgsql-performance by date: