Re: Need suggestion high-level suggestion on how to solve - Mailing list pgsql-performance

From Madison Kelly
Subject Re: Need suggestion high-level suggestion on how to solve
Date
Msg-id 42CDBAB2.9030406@alteeve.com
Whole thread Raw
In response to Re: Need suggestion high-level suggestion on how to solve a performance problem  (PFC <lists@boutiquenumerique.com>)
Responses Re: Need suggestion high-level suggestion on how to solve a performance problem
List pgsql-performance
PFC wrote:
>
>     Hello,
>     I once upon a time worked in a company doing backup software and I
> remember these problems, we had exactly the same !

   Prety neat. :)

>     The file tree was all into memory and everytime the user clicked on
> something it haaad to update everything. Being C++ it was very fast,
> but  to backup a million files you needed a gig of RAM, which is... a
> problem  let's say, when you think my linux laptop has about 400k files
> on it.

   I want this to run on "average" systems (I'm developing it primarily
on my modest P3 1GHz Thinkpad w/ 512MB RAM running Debian) so expecting
that much free memory is not reasonable. As it is my test DB, with a
realistic amount of data, is ~150MB.

>     So we rewrote the project entirely with the purpose of doing the
> million  files thingy with the clunky Pentium 90 with 64 megabytes of
> RAM, and it  worked.
>     What I did was this :
>     - use Berkeley DB
<snip>
>     - the price of the licence to be able to embed it in your product
> and  sell it is expensive, and if you want crash-proof, it's insanely
> expensive.

   This is the kicker right there; my program is released under the GPL
so it's fee-free. I can't eat anything costly like that. As it is there
is hundreds and hundreds of hours in this program that I am already
hoping to recoup one day through support contracts. Adding commercial
software I am afraid is not an option.

> bonus  : if you check a directory as "include" and one of its
> subdirectory as  "exclude", and the user adds files all over the place,
> the files added in  the "included" directory will be automatically
> backed up and the ones in  the 'ignored' directory will be automatically
> ignored, you have nothing to  change.
<snip>
>     IMHO it's the only solution.

   Now *this* is an idea worth looking into. How I will implement it
with my system I don't know yet but it's a new line of thinking. Wonderful!

>     Now you'll ask me, but how do I calculate the total size of the
> backup  without looking at all the files ? when I click on a directory I
> don't  know what files are in it and which will inherit and which will not.
>
>     It's simple : you precompute it when you scan the disk for changed
> files.  This is the only time you should do a complete tree exploration.

   This is already what I do. When a user selects a partition they want
to select files to backup or restore the partition is scanned. The scan
looks at every file, directory and symlink and records it's size (on
disk), it mtime, owner, group, etc. and records it to the database. I've
got this scan/update running at ~1,500 files/second on my laptop. That
was actually the first performance tuning I started with. :)

   With all the data in the DB the backup script can calculate rather
intelligently where it wants to copy each directory to.

>     On each directory we put a matrix [M]x[N], M and N being one of the
> three  above state, containing the amount of stuff in the directory
> which would  be in state M if the directory was in state N. This is very
> easy to  compute when you scan for new files. Then when a directory
> changes state,  you have to sum a few cells of that matrix to know how
> much more that adds  to the backup. And you only look up 1 record.

   In my case what I do is calculate the size of all the files selected
for backup in each directory, sort the directories from all sources by
the total size of all their selected files and then start assigning the
directories, largest to smallest to each of my available destination
medias. If it runs out of destination space it backs up what it can and
then waits a user-definable amount of time and then checks to see if any
new destination media has been made available. If so it again tries to
assign the files/directories that didn't fit. It will loop a
user-definable number of times before giving up and warning the user
that more destination space is needed for that backup job.

>     Is that helpful ?

   The three states (inhertied, backup, ignore) has definately caught my
attention. Thank you very much for your idea and lengthy reply!

Madison

pgsql-performance by date:

Previous
From: PFC
Date:
Subject: Re: Need suggestion high-level suggestion on how to solve a performance problem
Next
From: PFC
Date:
Subject: Re: Need suggestion high-level suggestion on how to solve a performance problem