Re: Need suggestion high-level suggestion on how to solve a performance problem - Mailing list pgsql-performance
From | PFC |
---|---|
Subject | Re: Need suggestion high-level suggestion on how to solve a performance problem |
Date | |
Msg-id | op.stkdp5zeth1vuj@localhost Whole thread Raw |
In response to | Need suggestion high-level suggestion on how to solve a performance problem (Madison Kelly <linux@alteeve.com>) |
Responses |
Re: Need suggestion high-level suggestion on how to solve
|
List | pgsql-performance |
Hello, I once upon a time worked in a company doing backup software and I remember these problems, we had exactly the same ! The file tree was all into memory and everytime the user clicked on something it haaad to update everything. Being C++ it was very fast, but to backup a million files you needed a gig of RAM, which is... a problem let's say, when you think my linux laptop has about 400k files on it. So we rewrote the project entirely with the purpose of doing the million files thingy with the clunky Pentium 90 with 64 megabytes of RAM, and it worked. What I did was this : - use Berkeley DB Berkeley DB isn't a database like postgres, it's just a tree, but it's cool for managing trees. It's quite fast, uses key compression, etc. It has however a few drawbacks : - files tend to fragment a lot over time and it can't reindex or vacuum like postgres. You have to dump and reload. - the price of the licence to be able to embed it in your product and sell it is expensive, and if you want crash-proof, it's insanely expensive. - Even though it's a tree it has no idea what a parent is so you have to mess with that manually. We used a clever path encoding to keep all the paths inside the same directory close in the tree ; and separated database for dirs and files because we wanted the dirs to be in the cache, whereas we almost never touched the files. And... You can't make it if you update every node everytime the user clicks on something. You have to update 1 node. In your tree you have nodes. Give each node a state being one of these three : include, exclude, inherit When you fetch a node you also fetch all of its parents, and you propagate the state to know the state of the final node. If a node is in state 'inherit' it is like its parent, etc. So you have faster updates but slower selects. However, there is a bonus : if you check a directory as "include" and one of its subdirectory as "exclude", and the user adds files all over the place, the files added in the "included" directory will be automatically backed up and the ones in the 'ignored' directory will be automatically ignored, you have nothing to change. And it is not that slow because, if you think about it, suppose you have /var/www/mysite/blah with 20.000 files in it, in order to inherit the state of the parents on them you only have to fetch /var once, www once, etc. So if you propagate your inherited properties when doing a tree traversal it comes at no cost. IMHO it's the only solution. It can be done quite easily also, using ltree types and a little stored procedures, you can even make a view which gives the state of each element, computed by inheritance. Here's the secret : the user will select 100.000 files by clicking on a directory near root, but the user will NEVER look at 100.000 files. So you can make looking at files 10x slower if you can make including/excluding directories 100.000 times faster. Now you'll ask me, but how do I calculate the total size of the backup without looking at all the files ? when I click on a directory I don't know what files are in it and which will inherit and which will not. It's simple : you precompute it when you scan the disk for changed files. This is the only time you should do a complete tree exploration. On each directory we put a matrix [M]x[N], M and N being one of the three above state, containing the amount of stuff in the directory which would be in state M if the directory was in state N. This is very easy to compute when you scan for new files. Then when a directory changes state, you have to sum a few cells of that matrix to know how much more that adds to the backup. And you only look up 1 record. Is that helpful ?
pgsql-performance by date: