Thread: Backup very large databases

Backup very large databases

From
"Nikolay Mihaylov"
Date:
Hi all,

I have the following problem - when I;m backuping my database (4 GB+ is
single database) the website (is used from a web site) going too slow,
because of the backup.

Database is 7.0, pg_dump is from 7.2.1

Any ideas how I can make the process for backup to be with low priority,
or can I use second readonly!?!?!?! Postgres server on same machine?

Im also are interested to join/make? New group for very large postgres
databases. Is there other people that running simillar large databases?


-----------------------------------------------------------
The Reboots are for hardware upgrades,
Found more here: http://www.nmmm.nu
Nikolay Mihaylov nmmm@nmmm.nu


Re: Backup very large databases

From
Francisco Reyes
Date:
On Tue, 16 Apr 2002, Nikolay Mihaylov wrote:

> single database) the website (is used from a web site) going too slow,
> because of the backup.
> Database is 7.0, pg_dump is from 7.2.1

For starters why haven't you moved the DB to 7.2?

> Any ideas how I can make the process for backup to be with low priority,
> or can I use second readonly!?!?!?! Postgres server on same machine?
Not really sure if you could make it low priority "automagically", but you
could use ps to find the PID for the process and then manually do a nice
on it.


> Im also are interested to join/make? New group for very large postgres
> databases. Is there other people that running simillar large databases?

Don't know how many other people would be interested, but I like the idea.
However, what is wrong with just posting the questions to general?

I have about 7GB worth of data.


Re: Backup very large databases

From
Curt Sampson
Date:
On Fri, 19 Apr 2002, Francisco Reyes wrote:

> > Im also are interested to join/make? New group for very large postgres
> > databases. Is there other people that running simillar large databases?
>
> Don't know how many other people would be interested, but I like the idea.
> However, what is wrong with just posting the questions to general?

There doesn't seem to be all that much traffic directly concerning large
databases, and most of it seems to be performance-related, so perhaps this
could be subsumed into the performance list someone suggested making.

> I have about 7GB worth of data.

I anticipate having a 30-50 GB or larger database in the next few months.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: Backup very large databases

From
Tom Lane
Date:
Curt Sampson <cjs@cynic.net> writes:
> On Fri, 19 Apr 2002, Francisco Reyes wrote:
>> I have about 7GB worth of data.

> I anticipate having a 30-50 GB or larger database in the next few months.

FWIW, there are people out there with multi-hundred-GB to TB databases
(you might try searching the archives for "American Chemical Society",
for instance).  So you're not pioneers.

I do agree that the special problems of running huge DBs on Postgres
might belong in a pgsql-perform list rather than any of the regular
lists.  Anyone else agree?  Or does the pgsql-general list readership
want to see this stuff?

            regards, tom lane

Re: Backup very large databases

From
"Roderick A. Anderson"
Date:
On Fri, 19 Apr 2002, Tom Lane wrote:

> I do agree that the special problems of running huge DBs on Postgres
> might belong in a pgsql-perform list rather than any of the regular
> lists.  Anyone else agree?  Or does the pgsql-general list readership
> want to see this stuff?

I don't mind it in general but would use a pgsql-perform list if it
existed.


Rod
--
             Why is it so easy to throw caution to the wind.
            Shouldn't it be heavier and shaped like an anvil?

                                                           Jon Anderson


Re: Backup very large databases

From
Steve Lane
Date:
On 4/19/02 10:24 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Curt Sampson <cjs@cynic.net> writes:
>> On Fri, 19 Apr 2002, Francisco Reyes wrote:
>>> I have about 7GB worth of data.
>
>> I anticipate having a 30-50 GB or larger database in the next few months.
>
> FWIW, there are people out there with multi-hundred-GB to TB databases
> (you might try searching the archives for "American Chemical Society",
> for instance).  So you're not pioneers.
>
> I do agree that the special problems of running huge DBs on Postgres
> might belong in a pgsql-perform list rather than any of the regular
> lists.  Anyone else agree?  Or does the pgsql-general list readership
> want to see this stuff?


I'd like to see it. It seems to me all of those discussions could have at
least some implication for the "low end".

-- sgl


Re: Backup very large databases

From
Francisco Reyes
Date:
On Fri, 19 Apr 2002, Tom Lane wrote:

> I do agree that the special problems of running huge DBs on Postgres
> might belong in a pgsql-perform list rather than any of the regular
> lists.  Anyone else agree?  Or does the pgsql-general list readership
> want to see this stuff?

I think the better question is.. do people who has a special situation due
to their DB size want to go through all of the GENERAL mails.

Even though my DB is only about 7GB I would like to see a "performance"
list.

What would be it's charter?
A list dedicated for Large and 24 x 7 PostgreSQL installations?


Re: Backup very large databases

From
Francisco Reyes
Date:
On Sat, 20 Apr 2002, Curt Sampson wrote:

> I anticipate having a 30-50 GB or larger database in the next few months.

I think that with so much data having good hardware is key.
What type of data is this? How many people concurrently connect?


Re: Backup very large databases

From
Ron Snyder
Date:
>
> I think the better question is.. do people who has a special
> situation due
> to their DB size want to go through all of the GENERAL mails.

It seems to me that you are presuming that none of the GENERAL emails would
apply to those who also have large systems, or that those who have large
systems have nothing to learn from the GENERAL list.

I believe that more people will be interested in both topics than will be
interested in only one, and the amount of traffic that applies to only one
group and not the other is not worth the trouble to separate.

I'd like to see it stay on general.

My $.02

-ron


Re: Backup very large databases

From
Curt Sampson
Date:
On Sat, 20 Apr 2002, Ron Snyder wrote:

> > I think the better question is.. do people who has a special
> > situation due to their DB size want to go through all of the GENERAL
> > mails.

In my case, no. I try to get through the general mail, but if I'm
short on time, it would be really nice to be able to ignore the day's
postings to general, but still know that I can find some things of
particular interest to me (i.e., stuff I really don't want to miss) on
the performance list.

> It seems to me that you are presuming that none of the GENERAL emails
> would apply to those who also have large systems, or that those who
> have large systems have nothing to learn from the GENERAL list.

No. It's a matter not of what else in general would apply, but how
does one get at least the most important stuff when one is short
of time.

> I believe that more people will be interested in both topics than will
> be interested in only one....

That doesn't really matter, I think. People interested in both
topics can easily subscribe to both lists. The real question, to
my mind, is "are there people who would find it useful to see just
performance-related stuff." Which I think is true.

Thus, I'd like to see a separate performance list.

Here's my stab at a charter:

    The pgsql-performance list is for technical discussion relating
    to the performance (speed and efficiency) of PostgreSQL. In
    particular, discussions of the performance of common or uncommon
    operations on very large databases are welcome.

    However, "why isn't this query using this index?" questions
    are strongly discouraged, unless you understand the PostgreSQL
    optimizer, are proposing changes to it, and have data to show
    that the change improves the optimizer in a wide variety of
    situations, not just your particular one.

This description could probably be made a lot better, but I think
you get the idea. The reasons for the second paragraph should be
fairly obvious to hacker types, if not less technical users of
PostgreSQL.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


One particular large database application

From
Curt Sampson
Date:
On Sat, 20 Apr 2002, Francisco Reyes wrote:

> On Sat, 20 Apr 2002, Curt Sampson wrote:
>
> > I anticipate having a 30-50 GB or larger database in the next few months.
>
> I think that with so much data having good hardware is key.
> What type of data is this? How many people concurrently connect?

Well, "I'm not sure." The client is being a bit obscure about what he
really wants, and in fact may not quite know exactly what he really
wants. Welcome to the joys of being a consultant. :-)

My contract is to do which seems to be a proof-of-concept. They are
giving me a pile of sample data (about half a billion records) which
includes a record ID, date, user ID, and a bit of other stuff. I need to
do selects by user ID across a range of dates. I'm obviously going to
have to work out more details about what they really want as I go along,
which is no problem: that's one of the things I'm hired to do.

Anyway, it appears that this is going to be used as a query-only
copy of a database where the master data is stored elsewhere, so
that's why I'm not too worried about updates. I probably don't need
backups, either, since I can always just re-import the data if I
lose it.

One of the things they want me to try is partitioning the data
across multiple machines, and submitting queries in parallel. So
I'll be writing software that will take a query, figure out what
tables it needs to apply that query to, apply that query to those
tables (chosing the servers appropriately as well), and consolidate
the results.

For hardware, it seems that a bunch of cheap, basic PCs would do
the trick. I'm thinking of a system with a 1-2 GHz CPU, 512 MB of
memory, a 20-40 GB IDE disk for the system, log and temporary space,
and an 80 GB or larger IDE disk for the data. If reliability is a
real concern, probably mirroring the disks is the best option.

These systems, in 1U cases, should come in under 200,000 yen each
(under US$1500) and we can scale by adding more of them. But even
the performance of a single one should be fairly good, since I'm
avoiding RAID-5, which tends to be a serious performance killer.

This, BTW, is why I've been going on about import speed recently.
It's expensive (both in terms of cost and performance) to avoid
disk system failure with RAID or whatever, but if you can live with
recovering from failure, rather than avoiding failure, you can use
cheaper and faster hardware. So I'm wondering if I can set this up
to be able to recover from a failure fast enough that I don't need
to worry so much about preventing one.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: One particular large database application

From
Francisco Reyes
Date:
On Sun, 21 Apr 2002, Curt Sampson wrote:

> One of the things they want me to try is partitioning the data
> across multiple machines, and submitting queries in parallel. So
> I'll be writing software that will take a query, figure out what
> tables it needs to apply that query to, apply that query to those
> tables (chosing the servers appropriately as well), and consolidate
> the results.

Interesting.

> For hardware, it seems that a bunch of cheap, basic PCs would do
> the trick. I'm thinking of a system with a 1-2 GHz CPU, 512 MB of
> memory, a 20-40 GB IDE disk for the system, log and temporary space,
> and an 80 GB or larger IDE disk for the data. If reliability is a
> real concern, probably mirroring the disks is the best option.


May I suggest a different approach?
From what I understand this data may not change often.
How about instead of getting numerous cheap machines get only 2 or 3 good
machines with 2  15K RPM drives, 4GB of RAM and 1 IDE for the OS. Or if
you can get even more money... 4 15K rpm drives on Raid 0.


Re: One particular large database application

From
Curt Sampson
Date:
On Sun, 21 Apr 2002, Francisco Reyes wrote:

> May I suggest a different approach?
> From what I understand this data may not change often.
> How about instead of getting numerous cheap machines get only 2 or 3 good
> machines with 2  15K RPM drives, 4GB of RAM and 1 IDE for the OS.

It won't be as cost-effective. Price increases much faster than
performance for an individual component. Take a look at CPUs for
example:

    CPU            Cost    $/GHz
    2.4 GHz P4        $539    $224
    2.0 GHz P4        $322    $161
    1.9 GHz P4        $225    $118
    1.8 GHz P4        $158    $88

For the price of one 2.4 GHz CPU, I can get three 1.8 GHz CPUs,
giving me more than twice the aggregate CPU power, and still have
change left over.

Drives? 73 GB 10K RPM SCSI drives start at $399. I can't even find
a 73 GB 15K RPM drive, but the 36 GB drives are $388 and up. For
$400 I can buy four 7200 RPM IDE drives, and assuming I have them
on separate controllers, I'm going to get much better aggregate
throughput than I could ever get with a single SCSI drive. Not to
mention that I end up with more than four times the storage space
as well.

This is the great thing about distributed systems. The only trick
is distributing your application as well. And that's often a pretty
big trick, otherwise everybody would be doing it.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: One particular large database application

From
Francisco Reyes
Date:
On Mon, 22 Apr 2002, Curt Sampson wrote:

> On Sun, 21 Apr 2002, Francisco Reyes wrote:
> > May I suggest a different approach?
> > From what I understand this data may not change often.
> > How about instead of getting numerous cheap machines get only 2 or 3 good
> > machines with 2  15K RPM drives, 4GB of RAM and 1 IDE for the OS.
>
> It won't be as cost-effective. Price increases much faster than
> performance for an individual component. Take a look at CPUs for
> example:
>
>     CPU            Cost    $/GHz
>     2.4 GHz P4        $539    $224
>     2.0 GHz P4        $322    $161
>     1.9 GHz P4        $225    $118
>     1.8 GHz P4        $158    $88

The 1.8Ghz would be fine.

> Drives? 73 GB 10K RPM SCSI drives start at $399. I can't even find
> a 73 GB 15K RPM drive, but the 36 GB drives are $388 and up. For
> $400 I can buy four 7200 RPM IDE drives, and assuming I have them
> on separate controllers, I'm going to get much better aggregate
> throughput than I could ever get with a single SCSI drive. Not to
> mention that I end up with more than four times the storage space
> as well.

If you were to get 4 machines I would still think 2 machines with 15K rpm
would be better than 4 machines with 7200rpm IDE drives.

IDE's may have good thoughput, but their seek times can't compete with top
of the line SCSI.

> This is the great thing about distributed systems. The only trick
> is distributing your application as well. And that's often a pretty
> big trick, otherwise everybody would be doing it.

Exactly.. since it won't be easy for you to find the best distributions
then it may be worth getting better hardware. :-)


Re: One particular large database application

From
Curt Sampson
Date:
On Mon, 22 Apr 2002, Francisco Reyes wrote:

> IDE's may have good thoughput, but their seek times can't compete with top
> of the line SCSI.

Dollar for dollar, IDE has far better performance. Even including the
cost of extra controllers (because you need one controller for each
drive), I can get four high-end IDE drives for the price of one SCSI
drive. There's no way any SCSI drive is going to do as many I/Os per
second as four good IDE drives.

As well, seek time isn't always important. For your log disk, for
example, you care more about high sequential write speed.

> Exactly.. since it won't be easy for you to find the best distributions
> then it may be worth getting better hardware. :-)

What makes you think it won't be easy, in my case?

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: One particular large database application

From
Michael Loftis
Date:
You're still limited to onlya few controllers in a system.  There's also
the issue of concurrency.  IDE drives handle one, and only one request
at a time.  A SCSI drive can (and usually is) be issued commands with
tags allowing more than one active/pending command at a time.  Thats why
SCSI drives fair better in multi-user environments.  The better drives
all have 'smart' cache controllers on them that re-order the pending
ocmmands in a way that optimizes response and throughput in an
intelligent manner (I think seagate is calling it serpentine seek, but
whatever it is called..)

SCSI allows for many many more drives in a given system.  SCSI always
has lower CPU overhead for an I/O operation (It is designed as a 'fire
and forget' protocol).

SCSI is more expensive, there's no doubt, but for larger environments,
there are clear benefits.  Also in a SCSI system, a failed drives
electronics will, in most cases, isolate itself.  In an IDE system a
failed drive will, at the least, make the other drives on that chain
unavailable, and in many cases make any other drives on the same
controller unavailable.  SCSI systems inherintly allow for 'hot swap'
whereas IDE there is no such thing (And if your vendor tells you there
is -- he's lying).

Curt Sampson wrote:

>On Mon, 22 Apr 2002, Francisco Reyes wrote:
>
>>IDE's may have good thoughput, but their seek times can't compete with top
>>of the line SCSI.
>>
>
>Dollar for dollar, IDE has far better performance. Even including the
>cost of extra controllers (because you need one controller for each
>drive), I can get four high-end IDE drives for the price of one SCSI
>drive. There's no way any SCSI drive is going to do as many I/Os per
>second as four good IDE drives.
>
>As well, seek time isn't always important. For your log disk, for
>example, you care more about high sequential write speed.
>
>>Exactly.. since it won't be easy for you to find the best distributions
>>then it may be worth getting better hardware. :-)
>>
>
>What makes you think it won't be easy, in my case?
>
>cjs
>



Re: One particular large database application

From
Curt Sampson
Date:
On Mon, 22 Apr 2002, Michael Loftis wrote:

> You're still limited to onlya few controllers in a system.

Well, I usually can spare two or three PCI slots, so that's six or
eight controllers (and thus drives), at any rate. If I really needed
more drives or storage capacity on a system at that point, it's
probably best to move to an external disk array anyway.

> There's also
> the issue of concurrency.  IDE drives handle one, and only one request
> at a time.  A SCSI drive can (and usually is) be issued commands with
> tags allowing more than one active/pending command at a time.

Sure. But still, send four requests to a SCSI drive, and four
requests to four IDE drives, and see which comes back faster.

> The better drives
> all have 'smart' cache controllers on them that re-order the pending
> ocmmands in a way that optimizes response and throughput in an
> intelligent manner (I think seagate is calling it serpentine seek, but
> whatever it is called..)

Many operating systems do this, too, though obviously they may not
be able to do it quite as well as the controller can.

> SCSI is more expensive, there's no doubt, but for larger environments,
> there are clear benefits.

Indeed. I've never argued against this.

I am well aware of the various advantages of SCSI drives. I'm also
aware of what they cost. It's just that there's a particular person
on the list who seems to think a less cost-effective solution would
somehow be better.

> Also in a SCSI system, a failed drives
> electronics will, in most cases, isolate itself.  In an IDE system a
> failed drive will, at the least, make the other drives on that chain
> unavailable, and in many cases make any other drives on the same
> controller unavailable.

Not a big deal; nobody interested in performance is going to put
more than a single drive on an IDE controller anyway.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: One particular large database application

From
Lincoln Yeoh
Date:
AFAIK google uses thousands of pcs with IDE drives. So going the multiple
PC way can work.

Not sure how that would be implemented for postgresql. It seems simple to
support _many_ read only queries at a time using many pcs. But how would
one speed up a few large parallel queries that way?

Cheerio,
Link.

At 08:08 AM 4/23/02 +0900, Curt Sampson wrote:
>On Mon, 22 Apr 2002, Francisco Reyes wrote:
>
> > IDE's may have good thoughput, but their seek times can't compete with top
> > of the line SCSI.
>
>Dollar for dollar, IDE has far better performance. Even including the
>cost of extra controllers (because you need one controller for each
>drive), I can get four high-end IDE drives for the price of one SCSI
>drive. There's no way any SCSI drive is going to do as many I/Os per
>second as four good IDE drives.
>
>As well, seek time isn't always important. For your log disk, for
>example, you care more about high sequential write speed.
>
> > Exactly.. since it won't be easy for you to find the best distributions
> > then it may be worth getting better hardware. :-)
>
>What makes you think it won't be easy, in my case?
>
>cjs



Re: One particular large database application

From
Curt Sampson
Date:
On Tue, 23 Apr 2002, Lincoln Yeoh wrote:

> Not sure how that would be implemented for postgresql. It seems simple to
> support _many_ read only queries at a time using many pcs. But how would
> one speed up a few large parallel queries that way?

In my case I'm dealing with data spread across a known range of
dates. So I partition it into separate tables (with identical schema
definitions) based on the date (e.g., a table for January, a table
for February, and so on).

Then when I get a query, I just have to parcel it out to the
appropriate machines and merge the results that I get.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC