Re: Need suggestion high-level suggestion on how to solve a performance problem - Mailing list pgsql-performance

From PFC
Subject Re: Need suggestion high-level suggestion on how to solve a performance problem
Date
Msg-id op.stkdp5zeth1vuj@localhost
Whole thread Raw
In response to Need suggestion high-level suggestion on how to solve a performance problem  (Madison Kelly <linux@alteeve.com>)
Responses Re: Need suggestion high-level suggestion on how to solve
List pgsql-performance
    Hello,
    I once upon a time worked in a company doing backup software and I
remember these problems, we had exactly the same !
    The file tree was all into memory and everytime the user clicked on
something it haaad to update everything. Being C++ it was very fast, but
to backup a million files you needed a gig of RAM, which is... a problem
let's say, when you think my linux laptop has about 400k files on it.
    So we rewrote the project entirely with the purpose of doing the million
files thingy with the clunky Pentium 90 with 64 megabytes of RAM, and it
worked.
    What I did was this :
    - use Berkeley DB
    Berkeley DB isn't a database like postgres, it's just a tree, but it's
cool for managing trees. It's quite fast, uses key compression, etc.
    It has however a few drawbacks :
    - files tend to fragment a lot over time and it can't reindex or vacuum
like postgres. You have to dump and reload.
    - the price of the licence to be able to embed it in your product and
sell it is expensive, and if you want crash-proof, it's insanely expensive.
    - Even though it's a tree it has no idea what a parent is so you have to
mess with that manually. We used a clever path encoding to keep all the
paths inside the same directory close in the tree ; and separated database
for dirs and files because we wanted the dirs to be in the cache, whereas
we almost never touched the files.

    And...
    You can't make it if you update every node everytime the user clicks on
something. You have to update 1 node.
    In your tree you have nodes.
    Give each node a state being one of these three : include, exclude,
inherit
    When you fetch a node you also fetch all of its parents, and you
propagate the state to know the state of the final node.
    If a node is in state 'inherit' it is like its parent, etc.
    So you have faster updates but slower selects. However, there is a bonus
: if you check a directory as "include" and one of its subdirectory as
"exclude", and the user adds files all over the place, the files added in
the "included" directory will be automatically backed up and the ones in
the 'ignored' directory will be automatically ignored, you have nothing to
change.
    And it is not that slow because, if you think about it, suppose you have
/var/www/mysite/blah with 20.000 files in it, in order to inherit the
state of the parents on them you only have to fetch /var once, www once,
etc.
    So if you propagate your inherited properties when doing a tree traversal
it comes at no cost.

    IMHO it's the only solution.

    It can be done quite easily also, using ltree types and a little stored
procedures, you can even make a view which gives the state of each
element, computed by inheritance.

    Here's the secret : the user will select 100.000 files by clicking on a
directory near root, but the user will NEVER look at 100.000 files. So you
can make looking at files 10x slower if you can make including/excluding
directories 100.000 times faster.

    Now you'll ask me, but how do I calculate the total size of the backup
without looking at all the files ? when I click on a directory I don't
know what files are in it and which will inherit and which will not.

    It's simple : you precompute it when you scan the disk for changed files.
This is the only time you should do a complete tree exploration.

    On each directory we put a matrix [M]x[N], M and N being one of the three
above state, containing the amount of stuff in the directory which would
be in state M if the directory was in state N. This is very easy to
compute when you scan for new files. Then when a directory changes state,
you have to sum a few cells of that matrix to know how much more that adds
to the backup. And you only look up 1 record.

    Is that helpful ?
















pgsql-performance by date:

Previous
From: Bendik Rognlien Johansen
Date:
Subject: Re: How to speed up delete
Next
From: Madison Kelly
Date:
Subject: Re: Need suggestion high-level suggestion on how to solve