Thread: Need suggestion high-level suggestion on how to solve a performance problem

Need suggestion high-level suggestion on how to solve a performance problem

From

Madison Kelly

Date:

07 July 2005, 10:14:17

Hi all,

   I hope I am not asking too many questions. :)

   I have been trying to solve a performance problem in my program for a
while now and, after getting an index to work which didn't speed things
up enough, I am stumped. I am hoping someone here might have come across
a similar issue and came up with a creative solution they wouldn't mind
sharing.

   I am not looking for details, I expect to do my homework, I just need
a pointer, suggestion or trick.

   The problem I have is that I am using pgSQL as a back end for my
web-based *nix backup program. Part of the database stores info on every
file and directory per partition. I use this information to build my
directory tree. I have a test partition with ~325,000 files of which
~30,000 are directories. I have been able to get the performance up to a
reasonable level for displaying the directory tree including expanding
and contracting branches (~3-5sec). I do this by loading all the
directory info into an array and a hash once and using them as needed
instead of hitting the DB.

   The problem comes when the user toggles a directory branch's backup
flag (a simple check box beside the directory name). If it's a directory
near the end of a branch it is fast enough. If they toggle a single file
it is nearly instant. However if they toggle say the root directory, so
every file and directory below it needs to be updated, it can take
500-600sec to return. Obviously this is no good.

   What I need is a scheme for being able to say, essentially:

UPDATE file_info_1 SET file_backup='t' WHERE file_parent_dir~'^/';

   Faster. An index isn't enough because it needs to hit every entry anyway.

   I use perl to access the DB and generate the web pages. The file
browser portion looks and acts like most file browsers (directory tree
in the left frame with expanding and contracting directory branches and
a list of files in a given directory on the right). It does not use any
plug-ins like Java and that is important to me that it stays that way (I
want it to be as simple as possible for the user to install).

   So far the only suggestion I've received is to keep a secondary
'delta' table to store just the request. Then on load get the existing
data then check it against the delta table before creating the page. The
biggest draw back for me with this is that currently I don't need to
provide an 'Apply' button because a simple javascript call passes the
request onto the perl script immediately. I really like the Mac-esque
approach to keeping the UI as simple and straight forward as possible.
So, a suggestion that doesn't require something like an 'Apply' button
would be much appreciated.

   Thanks for any suggestions in advance!

Madison

PS - For what it's worth, this is the last hurdle for me to overcome
before I can finally release my program as 'beta' after over 15 months
of work! :)

Re: Need suggestion high-level suggestion on how to solve a performance problem

From

PFC

Date:

07 July 2005, 19:21:41

    Hello,
    I once upon a time worked in a company doing backup software and I
remember these problems, we had exactly the same !
    The file tree was all into memory and everytime the user clicked on
something it haaad to update everything. Being C++ it was very fast, but
to backup a million files you needed a gig of RAM, which is... a problem
let's say, when you think my linux laptop has about 400k files on it.
    So we rewrote the project entirely with the purpose of doing the million
files thingy with the clunky Pentium 90 with 64 megabytes of RAM, and it
worked.
    What I did was this :
    - use Berkeley DB
    Berkeley DB isn't a database like postgres, it's just a tree, but it's
cool for managing trees. It's quite fast, uses key compression, etc.
    It has however a few drawbacks :
    - files tend to fragment a lot over time and it can't reindex or vacuum
like postgres. You have to dump and reload.
    - the price of the licence to be able to embed it in your product and
sell it is expensive, and if you want crash-proof, it's insanely expensive.
    - Even though it's a tree it has no idea what a parent is so you have to
mess with that manually. We used a clever path encoding to keep all the
paths inside the same directory close in the tree ; and separated database
for dirs and files because we wanted the dirs to be in the cache, whereas
we almost never touched the files.

    And...
    You can't make it if you update every node everytime the user clicks on
something. You have to update 1 node.
    In your tree you have nodes.
    Give each node a state being one of these three : include, exclude,
inherit
    When you fetch a node you also fetch all of its parents, and you
propagate the state to know the state of the final node.
    If a node is in state 'inherit' it is like its parent, etc.
    So you have faster updates but slower selects. However, there is a bonus
: if you check a directory as "include" and one of its subdirectory as
"exclude", and the user adds files all over the place, the files added in
the "included" directory will be automatically backed up and the ones in
the 'ignored' directory will be automatically ignored, you have nothing to
change.
    And it is not that slow because, if you think about it, suppose you have
/var/www/mysite/blah with 20.000 files in it, in order to inherit the
state of the parents on them you only have to fetch /var once, www once,
etc.
    So if you propagate your inherited properties when doing a tree traversal
it comes at no cost.

    IMHO it's the only solution.

    It can be done quite easily also, using ltree types and a little stored
procedures, you can even make a view which gives the state of each
element, computed by inheritance.

    Here's the secret : the user will select 100.000 files by clicking on a
directory near root, but the user will NEVER look at 100.000 files. So you
can make looking at files 10x slower if you can make including/excluding
directories 100.000 times faster.

    Now you'll ask me, but how do I calculate the total size of the backup
without looking at all the files ? when I click on a directory I don't
know what files are in it and which will inherit and which will not.

    It's simple : you precompute it when you scan the disk for changed files.
This is the only time you should do a complete tree exploration.

    On each directory we put a matrix [M]x[N], M and N being one of the three
above state, containing the amount of stuff in the directory which would
be in state M if the directory was in state N. This is very easy to
compute when you scan for new files. Then when a directory changes state,
you have to sum a few cells of that matrix to know how much more that adds
to the backup. And you only look up 1 record.

    Is that helpful ?

Re: Need suggestion high-level suggestion on how to solve

From

Madison Kelly

Date:

07 July 2005, 20:29:36

PFC wrote:
>
>     Hello,
>     I once upon a time worked in a company doing backup software and I
> remember these problems, we had exactly the same !

   Prety neat. :)

>     The file tree was all into memory and everytime the user clicked on
> something it haaad to update everything. Being C++ it was very fast,
> but  to backup a million files you needed a gig of RAM, which is... a
> problem  let's say, when you think my linux laptop has about 400k files
> on it.

   I want this to run on "average" systems (I'm developing it primarily
on my modest P3 1GHz Thinkpad w/ 512MB RAM running Debian) so expecting
that much free memory is not reasonable. As it is my test DB, with a
realistic amount of data, is ~150MB.

>     So we rewrote the project entirely with the purpose of doing the
> million  files thingy with the clunky Pentium 90 with 64 megabytes of
> RAM, and it  worked.
>     What I did was this :
>     - use Berkeley DB
<snip>
>     - the price of the licence to be able to embed it in your product
> and  sell it is expensive, and if you want crash-proof, it's insanely
> expensive.

   This is the kicker right there; my program is released under the GPL
so it's fee-free. I can't eat anything costly like that. As it is there
is hundreds and hundreds of hours in this program that I am already
hoping to recoup one day through support contracts. Adding commercial
software I am afraid is not an option.

> bonus  : if you check a directory as "include" and one of its
> subdirectory as  "exclude", and the user adds files all over the place,
> the files added in  the "included" directory will be automatically
> backed up and the ones in  the 'ignored' directory will be automatically
> ignored, you have nothing to  change.
<snip>
>     IMHO it's the only solution.

   Now *this* is an idea worth looking into. How I will implement it
with my system I don't know yet but it's a new line of thinking. Wonderful!

>     Now you'll ask me, but how do I calculate the total size of the
> backup  without looking at all the files ? when I click on a directory I
> don't  know what files are in it and which will inherit and which will not.
>
>     It's simple : you precompute it when you scan the disk for changed
> files.  This is the only time you should do a complete tree exploration.

   This is already what I do. When a user selects a partition they want
to select files to backup or restore the partition is scanned. The scan
looks at every file, directory and symlink and records it's size (on
disk), it mtime, owner, group, etc. and records it to the database. I've
got this scan/update running at ~1,500 files/second on my laptop. That
was actually the first performance tuning I started with. :)

   With all the data in the DB the backup script can calculate rather
intelligently where it wants to copy each directory to.

>     On each directory we put a matrix [M]x[N], M and N being one of the
> three  above state, containing the amount of stuff in the directory
> which would  be in state M if the directory was in state N. This is very
> easy to  compute when you scan for new files. Then when a directory
> changes state,  you have to sum a few cells of that matrix to know how
> much more that adds  to the backup. And you only look up 1 record.

   In my case what I do is calculate the size of all the files selected
for backup in each directory, sort the directories from all sources by
the total size of all their selected files and then start assigning the
directories, largest to smallest to each of my available destination
medias. If it runs out of destination space it backs up what it can and
then waits a user-definable amount of time and then checks to see if any
new destination media has been made available. If so it again tries to
assign the files/directories that didn't fit. It will loop a
user-definable number of times before giving up and warning the user
that more destination space is needed for that backup job.

>     Is that helpful ?

   The three states (inhertied, backup, ignore) has definately caught my
attention. Thank you very much for your idea and lengthy reply!

Madison

Re: Need suggestion high-level suggestion on how to solve a performance problem

From

PFC

Date:

07 July 2005, 20:58:48


>    This is the kicker right there; my program is released under the GPL
> so it's fee-free. I can't eat anything costly like that. As it is there
> is hundreds and hundreds of hours in this program that I am already
> hoping to recoup one day through support contracts. Adding commercial
> software I am afraid is not an option.

    If you open-source GPL then Berkeley is free for you.

Re: Need suggestion high-level suggestion on how to solve a performance problem

From

Josh Berkus

Date:

07 July 2005, 22:02:15

Madison,

>    The problem comes when the user toggles a directory branch's backup
> flag (a simple check box beside the directory name). If it's a directory
> near the end of a branch it is fast enough. If they toggle a single file
> it is nearly instant. However if they toggle say the root directory, so
> every file and directory below it needs to be updated, it can take
> 500-600sec to return. Obviously this is no good.
>
>    What I need is a scheme for being able to say, essentially:
>
> UPDATE file_info_1 SET file_backup='t' WHERE file_parent_dir~'^/';

Well, from the sound of it the problem is not selecting the files to be
updated, it's updating them.

What I would do, personally, is *not* store an update flag for each file.
Instead, I would store the update flag for the directory which was
selected.  If users want to exclude certain files and subdirectories, I'd
also include a dont_update flag.  When it's time to back up, you simply
check the tree for the most immediate update or don't update flag above
each file.

For the table itself, I'd consider using ltree for the directory tree
structure.  It has some nice features which makes it siginifcanly better
than using a delimited text field.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco