Thread: large database

large database

From
"Mihai Popa"
Date:
Hi,

I've recently inherited a project that involves importing a large set of
Access mdb files into a Postgres or MySQL database.
The process is to export the mdb's to comma separated files than import
those into the final database.
We are now at the point where the csv files are all created and amount
to some 300 GB of data.

I would like to get some advice on the best deployment option.

First, the project has been started using MySQL. Is it worth switching
to Postgres and if so, which version should I use?

Second, where should I deploy it? The cloud or a dedicated box?

Amazon seems like the sensible choice; you can scale it up and down as
needed and backup is handled automatically.
I was thinking of an x-large RDS instance with 10000 IOPS and 1 TB of
storage. Would this do, or will I end up with a larger/ more expensive
instance?

Alternatively I looked at a Dell server with 32 GB of RAM and some
really good hard drives. But such a box does not come cheap and I don't
want to keep the pieces if it doesn't cut it





thank you,

--
Mihai Popa <mihai@lattica.com>
Lattica, Inc.



Re: large database

From
Ondrej Ivanič
Date:
Hi,

On 11 December 2012 07:26, Mihai Popa <mihai@lattica.com> wrote:
> First, the project has been started using MySQL. Is it worth switching
> to Postgres and if so, which version should I use?

You should to consider several things:
- do you have in-depth MySQL knowledge in you team?
- do you need any sql_mode "features"?
(http://dev.mysql.com/doc/refman/5.6/en/server-sql-mode.html)
- do you need flexible, easy to setup and monitor replication?
- do you need multi-master?
- do you need REPLACE / INSERT ... ON DUPLICATE KEY UPDATE / INSERT
IGNORE syntax?
- do you need many connections to your database w/o deploying / using
load balancer?
- do you need something which is MySQL only?
(http://dev.mysql.com/doc/refman/5.0/en/compatibility.html)

If you have 4 or more 'yes' then I would stick with MySQL...

--
Ondrej Ivanic
(http://www.linkedin.com/in/ondrejivanic)


Re: large database

From
Jan Kesten
Date:
Hi Mihai.

> We are now at the point where the csv files are all created and amount
> to some 300 GB of data.

> I would like to get some advice on the best deployment option.

First - and maybe best - advice: Do some testing on your own and plan
some time for this.

> First, the project has been started using MySQL. Is it worth switching
> to Postgres and if so, which version should I use?

When switching to PostgreSQL I would recommend to use the latest stable
version. But your project is already running in MySQL - are there issues
you expect to solve with switching to another database system? If not:
why switching?

> Second, where should I deploy it? The cloud or a dedicated box?

Given 1TB of storage, the x-large instance and 10000 provisioned IOPS
would mean about 2000USD for a 100% utilized instance on amazon. This is
not really ultra-cheap ;-) For two months running you can get a
dedicated server with eight drives, buy to extra SSDs and have full
control on a Dell server. But things get much cheaper if real IOPS are
not at such high rate.

Also when using a cloud infrastructure and need your data on local
system keep network latency in mind.

We have several huge PostgreSQL databases running and have used
OpenIndina with ZFS and SSDs for data storage for quite a while now and
works perfect.

There are some sildes from Sun/Oracle about ZFS, ZIL, SSD and PostgreSQL
performance (I can look if I find them if needed).

> Alternatively I looked at a Dell server with 32 GB of RAM and some
> really good hard drives. But such a box does not come cheap and I don't
> want to keep the pieces if it doesn't cut it

Just a hint: Do not simply look at Dells prices - phone them and get a
quote. I was surprised (but do not buy SSDs there).

Think about how you data is structured and how it is queried after it
was imported into the database to see where your bottlenecks are.

Cheers,
Jan


Re: large database

From
Johannes Lochmann
Date:
Hello Jan, hello List

On 12/11/2012 09:10 AM, Jan Kesten wrote:
> There are some sildes from Sun/Oracle about ZFS, ZIL, SSD and
> PostgreSQL performance (I can look if I find them if needed).

I would very much appreciate a copy or a link to these slides!

Johannes


Re: large database

From
Jan Kesten
Date:
Hi all,

> I would very much appreciate a copy or a link to these slides!

here they are:

http://www.scribd.com/mobile/doc/61186429

Have fun!



Re: large database

From
Chris Angelico
Date:
On Tue, Dec 11, 2012 at 7:26 AM, Mihai Popa <mihai@lattica.com> wrote:
> Second, where should I deploy it? The cloud or a dedicated box?

Forget cloud. For similar money, you can get dedicated hosting with
much more reliable performance. We've been looking at places to deploy
a new service, and to that end, we booked a few cloud instances and
started playing. Bang for buck, even the lower-end dedicated servers
(eg about $35/month) majorly outdo Amazon cloud instances.

But don't take someone's word for it. Amazon let you trial their
system for a year, up to (I think) ~750 computation hours per month,
of their basic instance type. You can find out for yourself exactly
how unsuitable it is! :)

The fact is that cloud platforms offer flexibility, and that
flexibility comes at a significant cost. I don't think PostgreSQL can
adequately exploit X nodes with 600MB RAM each, while it _can_ make
excellent use of a single computer with gobs (that's a new SI unit,
you know) of memory.

Incidentally, I've heard tell that cloud instances can vary enormously
in performance through the day or week, but we did some cursory
testing and didn't experience that. That doesn't prove you won't have
problems, of course, but it's one of the purported downsides of
clouding that clearly isn't as universal as I've heard said.

ChrisA


Re: large database

From
Gavin Flower
Date:
On 11/12/12 23:25, Chris Angelico wrote:
On Tue, Dec 11, 2012 at 7:26 AM, Mihai Popa <mihai@lattica.com> wrote:
Second, where should I deploy it? The cloud or a dedicated box?
Forget cloud. For similar money, you can get dedicated hosting with
much more reliable performance. We've been looking at places to deploy
a new service, and to that end, we booked a few cloud instances and
started playing. Bang for buck, even the lower-end dedicated servers
(eg about $35/month) majorly outdo Amazon cloud instances.

But don't take someone's word for it. Amazon let you trial their
system for a year, up to (I think) ~750 computation hours per month,
of their basic instance type. You can find out for yourself exactly
how unsuitable it is! :)

The fact is that cloud platforms offer flexibility, and that
flexibility comes at a significant cost. I don't think PostgreSQL can
adequately exploit X nodes with 600MB RAM each, while it _can_ make
excellent use of a single computer with gobs (that's a new SI unit,
you know) of memory.

Incidentally, I've heard tell that cloud instances can vary enormously
in performance through the day or week, but we did some cursory
testing and didn't experience that. That doesn't prove you won't have
problems, of course, but it's one of the purported downsides of
clouding that clearly isn't as universal as I've heard said.

ChrisA


Would you say the issue is cloudy?
(I'm not being entirely facetious!)


Cheers,
Gavin

Re: large database

From
Chris Travers
Date:


On Mon, Dec 10, 2012 at 12:26 PM, Mihai Popa <mihai@lattica.com> wrote:
Hi,

I've recently inherited a project that involves importing a large set of
Access mdb files into a Postgres or MySQL database.
The process is to export the mdb's to comma separated files than import
those into the final database.
We are now at the point where the csv files are all created and amount
to some 300 GB of data.

Ok, this doesn't sound like anything that would require MySQL-only featuers. 

I would like to get some advice on the best deployment option.

First, the project has been started using MySQL. Is it worth switching
to Postgres and if so, which version should I use?

Probably.  I say probably because there are costs.  See below. 

Second, where should I deploy it? The cloud or a dedicated box?

The cloud is best for low-end db's.  Once you have high disk I/O it breaks down.  Large db's suggest significant disk I/O. 

Amazon seems like the sensible choice; you can scale it up and down as
needed and backup is handled automatically.

Keep in mind there is a huge complexity cost there, when it comes to maintaining adequate performance.
 
I was thinking of an x-large RDS instance with 10000 IOPS and 1 TB of
storage. Would this do, or will I end up with a larger/ more expensive
instance?

Alternatively I looked at a Dell server with 32 GB of RAM and some
really good hard drives. But such a box does not come cheap and I don't
want to keep the pieces if it doesn't cut it

Without knowing your expected usage pattern, estimating memory needed.  I would however suggest the first thing to do is to estimate the frequently used  set of memory from normal operations  and then multiply that by a reasonable factor to give you a reasonable assurance of the data you want being in cache. 

Secondly you want to look at expected number of parallel queries and plan number of processors around that.

As for costs of switching there are some other things you need to be aware of:

1: MySQL and PostgreSQL have completely different assumptions about standard table usage, and these impact indexing strategies and bulk data performance.  You can expect to do some relearning regarding performance tuning and indexing (typically you create a lot more indexes in MySQL than in PostgreSQL.  Note this typically makes bulk inserts a bit faster on PostgreSQL).

2: MySQL and PostgreSQL have completely different assumptions about what you are doing with your data.  MySQL assumes it's a single app db.  PostgreSQL assumes it is a multi-app db.  This explains a lot of sql_mode weirdness in MySQL and why the PostgreSQL team will never go down that path.    This means also typically you have more mature tools for manipulating data inside the db (via views, functions, and the like) than you do in MySQL, but MySQL has some additional flexibility if you are just porting to it from another db.

If it were me I would definitely migrate.  It sounds like these are internal line of business db's, and they may come from different MS Access apps.  For this reason, I think you will have more options regarding centralizing your code and better managing your front-end access (I assume through MS Access?) in PostgreSQL than you would in MySQL.  However the choice is yours.

Best Wishes,
Chris Travers

Re: large database

From
Johannes Lochmann
Date:
Hi all,

On 12/11/2012 11:02 AM, Jan Kesten wrote:
>> I would very much appreciate a copy or a link to these slides!
> here they are:
>
> http://www.scribd.com/mobile/doc/61186429
>
thank you very much!

Johannes


Re: large database

From
Chris Angelico
Date:
On Tue, Dec 11, 2012 at 9:33 PM, Gavin Flower
<GavinFlower@archidevsys.co.nz> wrote:
>
> On Tue, Dec 11, 2012 at 7:26 AM, Mihai Popa <mihai@lattica.com> wrote:
> > Second, where should I deploy it? The cloud or a dedicated box?
>
> Would you say the issue is cloudy?
> (I'm not being entirely facetious!)

*Groan* :)

It's certainly not clear-cut in the general case. In our specific
case, cloud was definitely not the way to go (we have other systems in
place for handling scale-out, so it's better for us to simply get X
dedicated computers and have an administrative decision to scale up to
Y, rather than automate it up and down as cloud can do).

ChrisA


Re: large database

From
Bill Moran
Date:
On Mon, 10 Dec 2012 15:26:02 -0500 (EST) "Mihai Popa" <mihai@lattica.com> wrote:

> Hi,
>
> I've recently inherited a project that involves importing a large set of
> Access mdb files into a Postgres or MySQL database.
> The process is to export the mdb's to comma separated files than import
> those into the final database.
> We are now at the point where the csv files are all created and amount
> to some 300 GB of data.
>
> I would like to get some advice on the best deployment option.
>
> First, the project has been started using MySQL. Is it worth switching
> to Postgres and if so, which version should I use?

I've been managing a few large databases this year, on both PostgreSQL and
MySQL.

Don't put your data in MySQL.  Ever.  If you feel like you need to use
something like MySQL, just go straight to a system that was designed with
no constraints right off the bat, like Mongo or something.

Don't put large amounts of data in MySQL.  There are lots of issuse with it.
Despite the fact that lots of people have been able to make it work (me,
for example) it's a LOT harder to keep running well than it is on
PostgreSQL.  MySQL just isn't designed to deal with large data.  As some
examples: lack of CREATE INDEX CONCURRENTLY, the fact that the default
configuration stores everything in a single file, the fact that any table
changes (including simple things like adding a comment, or seemingly
unrelated things like adding an index) require a complete table rebuild,
and the fact that if you use anything other than INT AUTO_INCREMENT for
your primary key you're liable to hit on awful inefficiencies.

PostgreSQL has none of these problems.

--
Bill Moran <wmoran@potentialtech.com>


Re: large database

From
Tony CL Chan
Date:
Hi, 

If you have big table you could also think about Hadoop/HBase or Cassandra but do not put large data set in MySQL. I agree with Bill that "Despite the fact that lots of people have been able to make it (MySQL) work" (me too, another example), there are issues with it.  I have been using MySQL for a number of years, using it to handle large DBs with large number of users,  the MySQL is the bottleneck, especially when running table joins for large data set, CPU and I/O load went up ......

If switching to PostgreSQL, PostgreSQL 9.1.x is very good choice for production deployment. 

Thanks
Tony


P.S.  Today I did some stress tests on my PostgreSQL staging server: a)  insert 2 billions records into the test table, b) full scan the table. here are some test results:  

Facts:  
Number of records: 2 billions records inserted today
Full table scan: about 16.76 minutes to scan 2 billions of rows, really AMAZING!  
Database size: 109GB
PostgrSQL: 9.2.1
Physical RAM: 8GB
CPU: i5 


########

EXPLAIN ANALYZE SELECT COUNT(*) FROM test;
QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=33849559.60..33849559.61 rows=1 width=0) (actual time=1006476.308..1006476.309 rows=1 loops=1)
->  Seq Scan on test  (cost=0.00..28849559.28 rows=2000000128 width=0) (actual time=47.147..903264.427 rows=2000000000 loops=1)
Total runtime: 1006507.963 ms





On 11 Dec 2012, at 8:27 PM, Bill Moran wrote:

On Mon, 10 Dec 2012 15:26:02 -0500 (EST) "Mihai Popa" <mihai@lattica.com> wrote:

Hi,

I've recently inherited a project that involves importing a large set of
Access mdb files into a Postgres or MySQL database.
The process is to export the mdb's to comma separated files than import
those into the final database.
We are now at the point where the csv files are all created and amount
to some 300 GB of data.

I would like to get some advice on the best deployment option.

First, the project has been started using MySQL. Is it worth switching
to Postgres and if so, which version should I use?

I've been managing a few large databases this year, on both PostgreSQL and
MySQL.

Don't put your data in MySQL.  Ever.  If you feel like you need to use
something like MySQL, just go straight to a system that was designed with
no constraints right off the bat, like Mongo or something.

Don't put large amounts of data in MySQL.  There are lots of issuse with it.
Despite the fact that lots of people have been able to make it work (me,
for example) it's a LOT harder to keep running well than it is on
PostgreSQL.  MySQL just isn't designed to deal with large data.  As some
examples: lack of CREATE INDEX CONCURRENTLY, the fact that the default
configuration stores everything in a single file, the fact that any table
changes (including simple things like adding a comment, or seemingly
unrelated things like adding an index) require a complete table rebuild,
and the fact that if you use anything other than INT AUTO_INCREMENT for
your primary key you're liable to hit on awful inefficiencies.

PostgreSQL has none of these problems.

--
Bill Moran <wmoran@potentialtech.com>


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: large database

From
David Boreham
Date:
On 12/10/2012 1:26 PM, Mihai Popa wrote:
>
> Second, where should I deploy it? The cloud or a dedicated box?
>
> Amazon seems like the sensible choice; you can scale it up and down as
> needed and backup is handled automatically.
> I was thinking of an x-large RDS instance with 10000 IOPS and 1 TB of
> storage. Would this do, or will I end up with a larger/ more expensive
> instance?
>
> Alternatively I looked at a Dell server with 32 GB of RAM and some
> really good hard drives. But such a box does not come cheap and I don't
> want to keep the pieces if it doesn't cut it
>

Note that it will be much cheaper to buy a machine from the likes of
Newegg or Amazon as parts and put it together yourself, if you have the
time to spare and don't care about a Dell warranty. We've been deploying
64G hex-core Xeon Ivy Bridge boxes with 300G SSD for about $2500.
Another 300G SSD for your database size would add about $1200. The newer
Intel DC-series SSDs should be cheaper, but are not yet available. So as
someone else pointed out you could buy a very capable box outright for
the cost of a few months Amazon fees.

I'm not sure I'd worry too much about how to do backups (Amazon just
copies the data the same as you can, to an external drive) but if you
need things like spare machines, pay a human to manage them and so on,
then there are cost benefits to the cloud approach. It also allows you
to blame someone else if something goes wrong.





Re: large database

From
Mihai Popa
Date:
On 12/11/2012 07:27 AM, Bill Moran wrote:
> On Mon, 10 Dec 2012 15:26:02 -0500 (EST) "Mihai Popa" <mihai@lattica.com> wrote:
>
>> Hi,
>>
>> I've recently inherited a project that involves importing a large set of
>> Access mdb files into a Postgres or MySQL database.
>> The process is to export the mdb's to comma separated files than import
>> those into the final database.
>> We are now at the point where the csv files are all created and amount
>> to some 300 GB of data.
>>
>> I would like to get some advice on the best deployment option.
>>
>> First, the project has been started using MySQL. Is it worth switching
>> to Postgres and if so, which version should I use?
> I've been managing a few large databases this year, on both PostgreSQL and
> MySQL.
>
> Don't put your data in MySQL.  Ever.  If you feel like you need to use
> something like MySQL, just go straight to a system that was designed with
> no constraints right off the bat, like Mongo or something.

I've never worked with MySQL before; I did work with Postgres a lot over
the last few years, but never
with such large databases, so I cannot really choose one over the other;
hence my posting:)
> and the fact that if you use anything other than INT AUTO_INCREMENT for
> your primary key you're liable to hit on awful inefficiencies.

Unfortunately, I don't know much yet about the usage pattern; all I know
is that the data is mostly
read only, there will be a few updates every year, but they will
probably happen as batch jobs over night.
And meanwhile it appears there is a lot more of it: 800 GB rather than
300 as initially thought.
There aren't a lot of tables so each will have a large number of rows.

I guess Chris was right, I have to better understand the usage pattern
and do some testing of my own.
I was just hoping my hunch about Amazon being the better alternative
would be confirmed, but this does not
seem to be the case; most of you recommend purchasing a box.

I want to thank everyone for the input, really appreciate it!

regards,
mihai


Re: large database

From
David Boreham
Date:
On 12/11/2012 8:28 AM, Mihai Popa wrote:
> I guess Chris was right, I have to better understand the usage pattern
> and do some testing of my own.
> I was just hoping my hunch about Amazon being the better alternative
> would be confirmed, but this does not
> seem to be the case; most of you recommend purchasing a box.
>
Amazon (or another PG cloud provider such as Heroku) is a great choice
if you want to do some sizing tests.
However, beware that if you need big instances running for more than a
week or two, you may spend as much in fees as it would have cost to buy
a big machine outright.

Cloud services are highly economic where you need resources that are
significantly _less_ than a present-day physical machine provides. For
example I have a VM with 500MB at Rackspace that I can use to run Nagios
to check on my physical servers, located in a different state on
different peering. I think that costs something like $15/mo. I couldn't
locate any kind of physical box in a new data center for anything like
that little. But where we need all the resources provided by the biggest
box you can buy today (and several of those), it is an order of
magnitude cheaper to buy them and pay for colo space. Similarly, if I
needed machines for only a short time (to test something, for example),
cloud hosting is a nice option vs having a big pile of metal in the
corner of the office that you don't know what to do with...

Finally, note that there is a middle-ground available between cloud
hosting and outright machine purchase -- providers such as Linode and
SoftLayer will sell physical machines in a way that gives much of the
convenience of cloud hosting, but with the resource dedication and
consistency of performance of physical machines. Still not as cheap as
buying your own machines of course, if you need them long-term.








Re: large database

From
Jeff Janes
Date:
On Mon, Dec 10, 2012 at 12:26 PM, Mihai Popa <mihai@lattica.com> wrote:
> Hi,
>
> I've recently inherited a project that involves importing a large set of
> Access mdb files into a Postgres or MySQL database.
> The process is to export the mdb's to comma separated files than import
> those into the final database.
> We are now at the point where the csv files are all created and amount
> to some 300 GB of data.

Compressed or uncompressed?

> I would like to get some advice on the best deployment option.
>
> First, the project has been started using MySQL. Is it worth switching
> to Postgres and if so, which version should I use?

Why did you originally choose MySQL?  What has changed that causes you
to rethink that decision?  Does your team have experience with MySQL
but not with PostgreSQL?

I like PostgreSQL, of course, but if I already had an
already-functioning app on MySQL I'd be reluctant to change it.

If I were going to do so, though, I'd use 9.2.  No reason to develop
against something other than the latest stable version.


> Second, where should I deploy it? The cloud or a dedicated box?
>
> Amazon seems like the sensible choice; you can scale it up and down as
> needed and backup is handled automatically.
> I was thinking of an x-large RDS instance with 10000 IOPS and 1 TB of
> storage. Would this do, or will I end up with a larger/ more expensive
> instance?

My understanding is that RDS does not support Postgres, so if you go
that route the decision is already made for you.  Or am I wrong here?

1TB of storage sounds desperately small for loading 300GB of csv files.

IOPS would mostly depend on how you are using the system, not how large it is.

> Alternatively I looked at a Dell server with 32 GB of RAM and some
> really good hard drives. But such a box does not come cheap and I don't
> want to keep the pieces if it doesn't cut it

xlarge RDS with 1TB of storage and 10000 iops doesn't come cheap, either.

Cheers,

Jeff


Re: large database

From
Ben Chobot
Date:
On Dec 11, 2012, at 2:25 AM, Chris Angelico wrote:

On Tue, Dec 11, 2012 at 7:26 AM, Mihai Popa <mihai@lattica.com> wrote:
Second, where should I deploy it? The cloud or a dedicated box?

Forget cloud. For similar money, you can get dedicated hosting with
much more reliable performance. We've been looking at places to deploy
a new service, and to that end, we booked a few cloud instances and
started playing. Bang for buck, even the lower-end dedicated servers
(eg about $35/month) majorly outdo Amazon cloud instances.

So, speaking as somebody that's spent the last several months dealing with postgres scaling in the Amazon cloud, I don't think it's so clear cut.

Yes, AWS instances are way more expensive compared to what you can buy in physical hardware. (Paying for a reserved instance drops that cost a lot, but even then you're still paying a hefty premium.) And yes, the biggest AWS instance you can get today is mediocre at best compared to real DB hardware. Yes, smaller instances sizes are crammed together and can suffer from noisy neighbors. (This isn't a problem with larger instances - which of course cost more.) And while you still have to worry about hardware failures, natural disasters, and other forms of calamities... at least with AWS solving those problems is easy and as simple as throwing more money at the problem. When you're relatively small, getting your severs cololocated in nearby-but-seperate datacenters is a hassle, and having enough hardware on hand to manage failures is annoying. Dealing with networking between your data centers is annoying.

So does AWS cost a lot more and give you a lower ceiling? No doubt. But it arguably also gives you a much more stable floor to stand on, and it has a lot of other benefits that don't apply to dbs.

Re: large database

From
Mihai Popa
Date:
On Tue, 2012-12-11 at 09:47 -0700, David Boreham wrote:

> Finally, note that there is a middle-ground available between cloud
> hosting and outright machine purchase -- providers such as Linode and
> SoftLayer will sell physical machines in a way that gives much of the
> convenience of cloud hosting,

I actually looked at Linode, but Amazon looked more competitive...



--
Mihai Popa <mihai@lattica.com>
Lattica, Inc.



Re: large database

From
David Boreham
Date:
On 12/11/2012 2:03 PM, Mihai Popa wrote:
> I actually looked at Linode, but Amazon looked more competitive...
Checking Linode's web site just now it looks like they have removed
physical machines as an option.
Try SoftLayer instead for physical machines delivered on-demand :
http://www.softlayer.com/dedicated-servers/

If you're looking for low cost virtual hosting alternative to Amazon,
try Rackspace.

Different providers offer different features too. For example Rackspace
allows you to add SSD persistent storage to any node (at a price)
whereas Amazon currently doesn't offer that capability.





Re: large database

From
Mihai Popa
Date:
On Tue, 2012-12-11 at 10:00 -0800, Jeff Janes wrote:
> On Mon, Dec 10, 2012 at 12:26 PM, Mihai Popa <mihai@lattica.com> wrote:
> > Hi,
> >
> > I've recently inherited a project that involves importing a large set of
> > Access mdb files into a Postgres or MySQL database.
> > The process is to export the mdb's to comma separated files than import
> > those into the final database.
> > We are now at the point where the csv files are all created and amount
> > to some 300 GB of data.
>
> Compressed or uncompressed?

uncompressed, but that's not much relief...
and it's 800GB not 300 anymore. I still can't believe the size of this
thing.


> Why did you originally choose MySQL?  What has changed that causes you
> to rethink that decision?  Does your team have experience with MySQL
> but not with PostgreSQL?

I did not choose it; somebody before me did. I personally have more
experience with Postgres, but not with databases as large as this one
promises to be.

>
> I like PostgreSQL, of course, but if I already had an
> already-functioning app on MySQL I'd be reluctant to change it.

...and I'm not rushing to do it; I was just asking around, maybe there
are known issues with MySQL, or with Postgres for that matter.


> My understanding is that RDS does not support Postgres, so if you go
> that route the decision is already made for you.  Or am I wrong here?

That's right, but I could still get an EC2 instance and run my own
Postgres
Or use this: http://www.enterprisedb.com/cloud-database/pricing-amazon



> 1TB of storage sounds desperately small for loading 300GB of csv files.

really? that's good to know; I wouldn't have guessed

> IOPS would mostly depend on how you are using the system, not how large it is.

mostly true


--
Mihai Popa <mihai@lattica.com>
Lattica, Inc.



Re: large database

From
Mihai Popa
Date:
On Tue, 2012-12-11 at 14:28 -0700, David Boreham wrote:
> Try SoftLayer instead for physical machines delivered on-demand :
> http://www.softlayer.com/dedicated-servers/
>
> If you're looking for low cost virtual hosting alternative to Amazon,
> try Rackspace.

Thank you, I will


regards,

--
Mihai Popa <mihai@lattica.com>
Lattica, Inc.



Re: large database

From
John R Pierce
Date:
On 12/11/2012 1:58 PM, Mihai Popa wrote:
>> 1TB of storage sounds desperately small for loading 300GB of csv files.
> really? that's good to know; I wouldn't have guessed
>

on many of our databases, the indexes are as large as the tables.




Re: large database

From
Adrian Klaver
Date:
On 12/11/2012 01:58 PM, Mihai Popa wrote:
> On Tue, 2012-12-11 at 10:00 -0800, Jeff Janes wrote:
>> On Mon, Dec 10, 2012 at 12:26 PM, Mihai Popa <mihai@lattica.com> wrote:
>>> Hi,
>>>
>>> I've recently inherited a project that involves importing a large set of
>>> Access mdb files into a Postgres or MySQL database.
>>> The process is to export the mdb's to comma separated files than import
>>> those into the final database.
>>> We are now at the point where the csv files are all created and amount
>>> to some 300 GB of data.
>>
>> Compressed or uncompressed?
>
> uncompressed, but that's not much relief...
> and it's 800GB not 300 anymore. I still can't believe the size of this
> thing.

Are you sure the conversion process is working properly?


--
Adrian Klaver
adrian.klaver@gmail.com


effective_cache_size and work_ram

From
"ac@hsk.hk"
Date:
Hi,

I have a new server for PostgreSQL 9.2.2 with 8GB physical RAM,  I want to turn the server with the following changes from default:

max_connections = 100           # default
shared_buffers = 2048MB # change from 24MB to 2048MB as I think 24MB is not enough
maintenance_work_mem = 400MB    # 50MB/GB x 8GB
effective_cache_size = 4096MB   # Set to 50% of total RAM
work_mem = 24MB
checkpoint_segments = 10
wal_buffers = 16MB

Are these values reasonable?

Thanks
ac

Re: effective_cache_size and work_ram

From
Pavel Stehule
Date:
hello

2012/12/12 ac@hsk.hk <ac@hsk.hk>:
> Hi,
>
> I have a new server for PostgreSQL 9.2.2 with 8GB physical RAM,  I want to
> turn the server with the following changes from default:
>
> max_connections = 100           # default
> shared_buffers = 2048MB # change from 24MB to 2048MB as I think 24MB is not
> enough
> maintenance_work_mem = 400MB    # 50MB/GB x 8GB
> effective_cache_size = 4096MB   # Set to 50% of total RAM
> work_mem = 24MB
> checkpoint_segments = 10
> wal_buffers = 16MB
>
> Are these values reasonable?

it is dedicated server?

then  effective_cache_size can be higher maybe 6GB

checkpoint_segments are too low, can be 128

Regards

Pavel Stehule


>
> Thanks
> ac


Re: effective_cache_size and work_ram

From
"ac@hsk.hk"
Date:
Hi,  yes it is a dedicated server.  and THANKS!




On 12 Dec 2012, at 3:55 PM, Pavel Stehule wrote:

> hello
>
> 2012/12/12 ac@hsk.hk <ac@hsk.hk>:
>> Hi,
>>
>> I have a new server for PostgreSQL 9.2.2 with 8GB physical RAM,  I want to
>> turn the server with the following changes from default:
>>
>> max_connections = 100           # default
>> shared_buffers = 2048MB # change from 24MB to 2048MB as I think 24MB is not
>> enough
>> maintenance_work_mem = 400MB    # 50MB/GB x 8GB
>> effective_cache_size = 4096MB   # Set to 50% of total RAM
>> work_mem = 24MB
>> checkpoint_segments = 10
>> wal_buffers = 16MB
>>
>> Are these values reasonable?
>
> it is dedicated server?
>
> then  effective_cache_size can be higher maybe 6GB
>
> checkpoint_segments are too low, can be 128
>
> Regards
>
> Pavel Stehule
>
>
>>
>> Thanks
>> ac



Re: large database

From
Mihai Popa
Date:
Another question is whether there's a particular reason that you're
converting to CSV prior to importing the data?
>
> All major ETL tools that I know of, including the major open source
> ones (Pentaho / Talend) can move data directly from Access databases
> to Postgresql.
>
Yes, I wish somebody asked this question before we started the process
Found out about it only a few days ago....




PostgreSQL contrib 9.2.x

From
"ac@hsk.hk"
Date:
Hi,

I am using Ubuntu 12.04, I have used installer to install PostgreSQL 9.2.1, the contrib modules are not installed together with the core.  

I tried to run: 

sudo apt-get install postgresql-contrib
The following packages were automatically installed and are no longer required:
  libnet-daemon-perl linux-headers-3.2.0-33 libdbi-perl libdbd-mysql-perl
  libplrpc-perl linux-headers-3.2.0-33-generic
Use 'apt-get autoremove' to remove them.
The following extra packages will be installed:
  libossp-uuid16 libpq5 postgresql-9.1 postgresql-client-9.1
  postgresql-client-common postgresql-common postgresql-contrib-9.1
Suggested packages:
  uuid oidentd ident-server locales-all postgresql-doc-9.1 libdbd-pg-perl
The following NEW packages will be installed:
  libossp-uuid16 libpq5 postgresql-9.1 postgresql-client-9.1
  postgresql-client-common postgresql-common postgresql-contrib
  postgresql-contrib-9.1
0 upgraded, 8 newly installed, 0 to remove and 4 not upgraded.
Need to get 5,995 kB of archives.
After this operation, 17.6 MB of additional disk space will be used.

I could see that it would install older PostgreSQL 9.1 and postgresql-contrib-9.1.  As I already have 9.2.1 and do not want to have older version 9.1 in parallel, I aborted the apt install. 

How can I get pure postgresql-contrib for Postgresql 9.2.x? 

Thanks
ac
 



Re: PostgreSQL contrib 9.2.x

From
Ondrej Ivanič
Date:
Hi,

On 14 December 2012 17:56, ac@hsk.hk <ac@hsk.hk> wrote:
> I could see that it would install older PostgreSQL 9.1 and
> postgresql-contrib-9.1.  As I already have 9.2.1 and do not want to have
> older version 9.1 in parallel, I aborted the apt install.
>
> How can I get pure postgresql-contrib for Postgresql 9.2.x?

You need PostreSQL PPA:

sudo apt-get update
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:pitti/postgresql
sudo apt-get install postgresql-contrib-9.2

--
Ondrej Ivanic
(http://www.linkedin.com/in/ondrejivanic)


Re: PostgreSQL contrib 9.2.x

From
"ac@hsk.hk"
Date:
Hi, got it installed, thanks


On 14 Dec 2012, at 7:36 PM, Ondrej Ivanič wrote:

> Hi,
>
> On 14 December 2012 17:56, ac@hsk.hk <ac@hsk.hk> wrote:
>> I could see that it would install older PostgreSQL 9.1 and
>> postgresql-contrib-9.1.  As I already have 9.2.1 and do not want to have
>> older version 9.1 in parallel, I aborted the apt install.
>>
>> How can I get pure postgresql-contrib for Postgresql 9.2.x?
>
> You need PostreSQL PPA:
>
> sudo apt-get update
> sudo apt-get install python-software-properties
> sudo add-apt-repository ppa:pitti/postgresql
> sudo apt-get install postgresql-contrib-9.2
>
> --
> Ondrej Ivanic
> (http://www.linkedin.com/in/ondrejivanic)
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general



Re: large database

From
Nathan Clayton
Date:


On Dec 11, 2012 2:25 PM, "Adrian Klaver" <adrian.klaver@gmail.com> wrote:
>
> On 12/11/2012 01:58 PM, Mihai Popa wrote:
>>
>> On Tue, 2012-12-11 at 10:00 -0800, Jeff Janes wrote:
>>>
>>> On Mon, Dec 10, 2012 at 12:26 PM, Mihai Popa <mihai@lattica.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I've recently inherited a project that involves importing a large set of
>>>> Access mdb files into a Postgres or MySQL database.
>>>> The process is to export the mdb's to comma separated files than import
>>>> those into the final database.
>>>> We are now at the point where the csv files are all created and amount
>>>> to some 300 GB of data.
>>>
>>>
>>> Compressed or uncompressed?
>>
>>
>> uncompressed, but that's not much relief...
>> and it's 800GB not 300 anymore. I still can't believe the size of this
>> thing.
>
>
> Are you sure the conversion process is working properly?
>
Another question is whether there's a particular reason that you're converting to CSV prior to importing the data?

All major ETL tools that I know of, including the major open source ones (Pentaho / Talend) can move data directly from Access databases to Postgresql. In addition, provided the table names are all the same in the Access files, you can iterate over all of the access files in a directory at once.