Thread: Backup very large databases
Hi all, I have the following problem - when I;m backuping my database (4 GB+ is single database) the website (is used from a web site) going too slow, because of the backup. Database is 7.0, pg_dump is from 7.2.1 Any ideas how I can make the process for backup to be with low priority, or can I use second readonly!?!?!?! Postgres server on same machine? Im also are interested to join/make? New group for very large postgres databases. Is there other people that running simillar large databases? ----------------------------------------------------------- The Reboots are for hardware upgrades, Found more here: http://www.nmmm.nu Nikolay Mihaylov nmmm@nmmm.nu
On Tue, 16 Apr 2002, Nikolay Mihaylov wrote: > single database) the website (is used from a web site) going too slow, > because of the backup. > Database is 7.0, pg_dump is from 7.2.1 For starters why haven't you moved the DB to 7.2? > Any ideas how I can make the process for backup to be with low priority, > or can I use second readonly!?!?!?! Postgres server on same machine? Not really sure if you could make it low priority "automagically", but you could use ps to find the PID for the process and then manually do a nice on it. > Im also are interested to join/make? New group for very large postgres > databases. Is there other people that running simillar large databases? Don't know how many other people would be interested, but I like the idea. However, what is wrong with just posting the questions to general? I have about 7GB worth of data.
On Fri, 19 Apr 2002, Francisco Reyes wrote: > > Im also are interested to join/make? New group for very large postgres > > databases. Is there other people that running simillar large databases? > > Don't know how many other people would be interested, but I like the idea. > However, what is wrong with just posting the questions to general? There doesn't seem to be all that much traffic directly concerning large databases, and most of it seems to be performance-related, so perhaps this could be subsumed into the performance list someone suggested making. > I have about 7GB worth of data. I anticipate having a 30-50 GB or larger database in the next few months. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson <cjs@cynic.net> writes: > On Fri, 19 Apr 2002, Francisco Reyes wrote: >> I have about 7GB worth of data. > I anticipate having a 30-50 GB or larger database in the next few months. FWIW, there are people out there with multi-hundred-GB to TB databases (you might try searching the archives for "American Chemical Society", for instance). So you're not pioneers. I do agree that the special problems of running huge DBs on Postgres might belong in a pgsql-perform list rather than any of the regular lists. Anyone else agree? Or does the pgsql-general list readership want to see this stuff? regards, tom lane
On Fri, 19 Apr 2002, Tom Lane wrote: > I do agree that the special problems of running huge DBs on Postgres > might belong in a pgsql-perform list rather than any of the regular > lists. Anyone else agree? Or does the pgsql-general list readership > want to see this stuff? I don't mind it in general but would use a pgsql-perform list if it existed. Rod -- Why is it so easy to throw caution to the wind. Shouldn't it be heavier and shaped like an anvil? Jon Anderson
On 4/19/02 10:24 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > Curt Sampson <cjs@cynic.net> writes: >> On Fri, 19 Apr 2002, Francisco Reyes wrote: >>> I have about 7GB worth of data. > >> I anticipate having a 30-50 GB or larger database in the next few months. > > FWIW, there are people out there with multi-hundred-GB to TB databases > (you might try searching the archives for "American Chemical Society", > for instance). So you're not pioneers. > > I do agree that the special problems of running huge DBs on Postgres > might belong in a pgsql-perform list rather than any of the regular > lists. Anyone else agree? Or does the pgsql-general list readership > want to see this stuff? I'd like to see it. It seems to me all of those discussions could have at least some implication for the "low end". -- sgl
On Fri, 19 Apr 2002, Tom Lane wrote: > I do agree that the special problems of running huge DBs on Postgres > might belong in a pgsql-perform list rather than any of the regular > lists. Anyone else agree? Or does the pgsql-general list readership > want to see this stuff? I think the better question is.. do people who has a special situation due to their DB size want to go through all of the GENERAL mails. Even though my DB is only about 7GB I would like to see a "performance" list. What would be it's charter? A list dedicated for Large and 24 x 7 PostgreSQL installations?
On Sat, 20 Apr 2002, Curt Sampson wrote: > I anticipate having a 30-50 GB or larger database in the next few months. I think that with so much data having good hardware is key. What type of data is this? How many people concurrently connect?
> > I think the better question is.. do people who has a special > situation due > to their DB size want to go through all of the GENERAL mails. It seems to me that you are presuming that none of the GENERAL emails would apply to those who also have large systems, or that those who have large systems have nothing to learn from the GENERAL list. I believe that more people will be interested in both topics than will be interested in only one, and the amount of traffic that applies to only one group and not the other is not worth the trouble to separate. I'd like to see it stay on general. My $.02 -ron
On Sat, 20 Apr 2002, Ron Snyder wrote: > > I think the better question is.. do people who has a special > > situation due to their DB size want to go through all of the GENERAL > > mails. In my case, no. I try to get through the general mail, but if I'm short on time, it would be really nice to be able to ignore the day's postings to general, but still know that I can find some things of particular interest to me (i.e., stuff I really don't want to miss) on the performance list. > It seems to me that you are presuming that none of the GENERAL emails > would apply to those who also have large systems, or that those who > have large systems have nothing to learn from the GENERAL list. No. It's a matter not of what else in general would apply, but how does one get at least the most important stuff when one is short of time. > I believe that more people will be interested in both topics than will > be interested in only one.... That doesn't really matter, I think. People interested in both topics can easily subscribe to both lists. The real question, to my mind, is "are there people who would find it useful to see just performance-related stuff." Which I think is true. Thus, I'd like to see a separate performance list. Here's my stab at a charter: The pgsql-performance list is for technical discussion relating to the performance (speed and efficiency) of PostgreSQL. In particular, discussions of the performance of common or uncommon operations on very large databases are welcome. However, "why isn't this query using this index?" questions are strongly discouraged, unless you understand the PostgreSQL optimizer, are proposing changes to it, and have data to show that the change improves the optimizer in a wide variety of situations, not just your particular one. This description could probably be made a lot better, but I think you get the idea. The reasons for the second paragraph should be fairly obvious to hacker types, if not less technical users of PostgreSQL. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Sat, 20 Apr 2002, Francisco Reyes wrote: > On Sat, 20 Apr 2002, Curt Sampson wrote: > > > I anticipate having a 30-50 GB or larger database in the next few months. > > I think that with so much data having good hardware is key. > What type of data is this? How many people concurrently connect? Well, "I'm not sure." The client is being a bit obscure about what he really wants, and in fact may not quite know exactly what he really wants. Welcome to the joys of being a consultant. :-) My contract is to do which seems to be a proof-of-concept. They are giving me a pile of sample data (about half a billion records) which includes a record ID, date, user ID, and a bit of other stuff. I need to do selects by user ID across a range of dates. I'm obviously going to have to work out more details about what they really want as I go along, which is no problem: that's one of the things I'm hired to do. Anyway, it appears that this is going to be used as a query-only copy of a database where the master data is stored elsewhere, so that's why I'm not too worried about updates. I probably don't need backups, either, since I can always just re-import the data if I lose it. One of the things they want me to try is partitioning the data across multiple machines, and submitting queries in parallel. So I'll be writing software that will take a query, figure out what tables it needs to apply that query to, apply that query to those tables (chosing the servers appropriately as well), and consolidate the results. For hardware, it seems that a bunch of cheap, basic PCs would do the trick. I'm thinking of a system with a 1-2 GHz CPU, 512 MB of memory, a 20-40 GB IDE disk for the system, log and temporary space, and an 80 GB or larger IDE disk for the data. If reliability is a real concern, probably mirroring the disks is the best option. These systems, in 1U cases, should come in under 200,000 yen each (under US$1500) and we can scale by adding more of them. But even the performance of a single one should be fairly good, since I'm avoiding RAID-5, which tends to be a serious performance killer. This, BTW, is why I've been going on about import speed recently. It's expensive (both in terms of cost and performance) to avoid disk system failure with RAID or whatever, but if you can live with recovering from failure, rather than avoiding failure, you can use cheaper and faster hardware. So I'm wondering if I can set this up to be able to recover from a failure fast enough that I don't need to worry so much about preventing one. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Sun, 21 Apr 2002, Curt Sampson wrote: > One of the things they want me to try is partitioning the data > across multiple machines, and submitting queries in parallel. So > I'll be writing software that will take a query, figure out what > tables it needs to apply that query to, apply that query to those > tables (chosing the servers appropriately as well), and consolidate > the results. Interesting. > For hardware, it seems that a bunch of cheap, basic PCs would do > the trick. I'm thinking of a system with a 1-2 GHz CPU, 512 MB of > memory, a 20-40 GB IDE disk for the system, log and temporary space, > and an 80 GB or larger IDE disk for the data. If reliability is a > real concern, probably mirroring the disks is the best option. May I suggest a different approach? From what I understand this data may not change often. How about instead of getting numerous cheap machines get only 2 or 3 good machines with 2 15K RPM drives, 4GB of RAM and 1 IDE for the OS. Or if you can get even more money... 4 15K rpm drives on Raid 0.
On Sun, 21 Apr 2002, Francisco Reyes wrote: > May I suggest a different approach? > From what I understand this data may not change often. > How about instead of getting numerous cheap machines get only 2 or 3 good > machines with 2 15K RPM drives, 4GB of RAM and 1 IDE for the OS. It won't be as cost-effective. Price increases much faster than performance for an individual component. Take a look at CPUs for example: CPU Cost $/GHz 2.4 GHz P4 $539 $224 2.0 GHz P4 $322 $161 1.9 GHz P4 $225 $118 1.8 GHz P4 $158 $88 For the price of one 2.4 GHz CPU, I can get three 1.8 GHz CPUs, giving me more than twice the aggregate CPU power, and still have change left over. Drives? 73 GB 10K RPM SCSI drives start at $399. I can't even find a 73 GB 15K RPM drive, but the 36 GB drives are $388 and up. For $400 I can buy four 7200 RPM IDE drives, and assuming I have them on separate controllers, I'm going to get much better aggregate throughput than I could ever get with a single SCSI drive. Not to mention that I end up with more than four times the storage space as well. This is the great thing about distributed systems. The only trick is distributing your application as well. And that's often a pretty big trick, otherwise everybody would be doing it. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Mon, 22 Apr 2002, Curt Sampson wrote: > On Sun, 21 Apr 2002, Francisco Reyes wrote: > > May I suggest a different approach? > > From what I understand this data may not change often. > > How about instead of getting numerous cheap machines get only 2 or 3 good > > machines with 2 15K RPM drives, 4GB of RAM and 1 IDE for the OS. > > It won't be as cost-effective. Price increases much faster than > performance for an individual component. Take a look at CPUs for > example: > > CPU Cost $/GHz > 2.4 GHz P4 $539 $224 > 2.0 GHz P4 $322 $161 > 1.9 GHz P4 $225 $118 > 1.8 GHz P4 $158 $88 The 1.8Ghz would be fine. > Drives? 73 GB 10K RPM SCSI drives start at $399. I can't even find > a 73 GB 15K RPM drive, but the 36 GB drives are $388 and up. For > $400 I can buy four 7200 RPM IDE drives, and assuming I have them > on separate controllers, I'm going to get much better aggregate > throughput than I could ever get with a single SCSI drive. Not to > mention that I end up with more than four times the storage space > as well. If you were to get 4 machines I would still think 2 machines with 15K rpm would be better than 4 machines with 7200rpm IDE drives. IDE's may have good thoughput, but their seek times can't compete with top of the line SCSI. > This is the great thing about distributed systems. The only trick > is distributing your application as well. And that's often a pretty > big trick, otherwise everybody would be doing it. Exactly.. since it won't be easy for you to find the best distributions then it may be worth getting better hardware. :-)
On Mon, 22 Apr 2002, Francisco Reyes wrote: > IDE's may have good thoughput, but their seek times can't compete with top > of the line SCSI. Dollar for dollar, IDE has far better performance. Even including the cost of extra controllers (because you need one controller for each drive), I can get four high-end IDE drives for the price of one SCSI drive. There's no way any SCSI drive is going to do as many I/Os per second as four good IDE drives. As well, seek time isn't always important. For your log disk, for example, you care more about high sequential write speed. > Exactly.. since it won't be easy for you to find the best distributions > then it may be worth getting better hardware. :-) What makes you think it won't be easy, in my case? cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
You're still limited to onlya few controllers in a system. There's also the issue of concurrency. IDE drives handle one, and only one request at a time. A SCSI drive can (and usually is) be issued commands with tags allowing more than one active/pending command at a time. Thats why SCSI drives fair better in multi-user environments. The better drives all have 'smart' cache controllers on them that re-order the pending ocmmands in a way that optimizes response and throughput in an intelligent manner (I think seagate is calling it serpentine seek, but whatever it is called..) SCSI allows for many many more drives in a given system. SCSI always has lower CPU overhead for an I/O operation (It is designed as a 'fire and forget' protocol). SCSI is more expensive, there's no doubt, but for larger environments, there are clear benefits. Also in a SCSI system, a failed drives electronics will, in most cases, isolate itself. In an IDE system a failed drive will, at the least, make the other drives on that chain unavailable, and in many cases make any other drives on the same controller unavailable. SCSI systems inherintly allow for 'hot swap' whereas IDE there is no such thing (And if your vendor tells you there is -- he's lying). Curt Sampson wrote: >On Mon, 22 Apr 2002, Francisco Reyes wrote: > >>IDE's may have good thoughput, but their seek times can't compete with top >>of the line SCSI. >> > >Dollar for dollar, IDE has far better performance. Even including the >cost of extra controllers (because you need one controller for each >drive), I can get four high-end IDE drives for the price of one SCSI >drive. There's no way any SCSI drive is going to do as many I/Os per >second as four good IDE drives. > >As well, seek time isn't always important. For your log disk, for >example, you care more about high sequential write speed. > >>Exactly.. since it won't be easy for you to find the best distributions >>then it may be worth getting better hardware. :-) >> > >What makes you think it won't be easy, in my case? > >cjs >
On Mon, 22 Apr 2002, Michael Loftis wrote: > You're still limited to onlya few controllers in a system. Well, I usually can spare two or three PCI slots, so that's six or eight controllers (and thus drives), at any rate. If I really needed more drives or storage capacity on a system at that point, it's probably best to move to an external disk array anyway. > There's also > the issue of concurrency. IDE drives handle one, and only one request > at a time. A SCSI drive can (and usually is) be issued commands with > tags allowing more than one active/pending command at a time. Sure. But still, send four requests to a SCSI drive, and four requests to four IDE drives, and see which comes back faster. > The better drives > all have 'smart' cache controllers on them that re-order the pending > ocmmands in a way that optimizes response and throughput in an > intelligent manner (I think seagate is calling it serpentine seek, but > whatever it is called..) Many operating systems do this, too, though obviously they may not be able to do it quite as well as the controller can. > SCSI is more expensive, there's no doubt, but for larger environments, > there are clear benefits. Indeed. I've never argued against this. I am well aware of the various advantages of SCSI drives. I'm also aware of what they cost. It's just that there's a particular person on the list who seems to think a less cost-effective solution would somehow be better. > Also in a SCSI system, a failed drives > electronics will, in most cases, isolate itself. In an IDE system a > failed drive will, at the least, make the other drives on that chain > unavailable, and in many cases make any other drives on the same > controller unavailable. Not a big deal; nobody interested in performance is going to put more than a single drive on an IDE controller anyway. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
AFAIK google uses thousands of pcs with IDE drives. So going the multiple PC way can work. Not sure how that would be implemented for postgresql. It seems simple to support _many_ read only queries at a time using many pcs. But how would one speed up a few large parallel queries that way? Cheerio, Link. At 08:08 AM 4/23/02 +0900, Curt Sampson wrote: >On Mon, 22 Apr 2002, Francisco Reyes wrote: > > > IDE's may have good thoughput, but their seek times can't compete with top > > of the line SCSI. > >Dollar for dollar, IDE has far better performance. Even including the >cost of extra controllers (because you need one controller for each >drive), I can get four high-end IDE drives for the price of one SCSI >drive. There's no way any SCSI drive is going to do as many I/Os per >second as four good IDE drives. > >As well, seek time isn't always important. For your log disk, for >example, you care more about high sequential write speed. > > > Exactly.. since it won't be easy for you to find the best distributions > > then it may be worth getting better hardware. :-) > >What makes you think it won't be easy, in my case? > >cjs
On Tue, 23 Apr 2002, Lincoln Yeoh wrote: > Not sure how that would be implemented for postgresql. It seems simple to > support _many_ read only queries at a time using many pcs. But how would > one speed up a few large parallel queries that way? In my case I'm dealing with data spread across a known range of dates. So I partition it into separate tables (with identical schema definitions) based on the date (e.g., a table for January, a table for February, and so on). Then when I get a query, I just have to parcel it out to the appropriate machines and merge the results that I get. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC