Thread: large database
Hi, I've recently inherited a project that involves importing a large set of Access mdb files into a Postgres or MySQL database. The process is to export the mdb's to comma separated files than import those into the final database. We are now at the point where the csv files are all created and amount to some 300 GB of data. I would like to get some advice on the best deployment option. First, the project has been started using MySQL. Is it worth switching to Postgres and if so, which version should I use? Second, where should I deploy it? The cloud or a dedicated box? Amazon seems like the sensible choice; you can scale it up and down as needed and backup is handled automatically. I was thinking of an x-large RDS instance with 10000 IOPS and 1 TB of storage. Would this do, or will I end up with a larger/ more expensive instance? Alternatively I looked at a Dell server with 32 GB of RAM and some really good hard drives. But such a box does not come cheap and I don't want to keep the pieces if it doesn't cut it thank you, -- Mihai Popa <mihai@lattica.com> Lattica, Inc.
Hi, On 11 December 2012 07:26, Mihai Popa <mihai@lattica.com> wrote: > First, the project has been started using MySQL. Is it worth switching > to Postgres and if so, which version should I use? You should to consider several things: - do you have in-depth MySQL knowledge in you team? - do you need any sql_mode "features"? (http://dev.mysql.com/doc/refman/5.6/en/server-sql-mode.html) - do you need flexible, easy to setup and monitor replication? - do you need multi-master? - do you need REPLACE / INSERT ... ON DUPLICATE KEY UPDATE / INSERT IGNORE syntax? - do you need many connections to your database w/o deploying / using load balancer? - do you need something which is MySQL only? (http://dev.mysql.com/doc/refman/5.0/en/compatibility.html) If you have 4 or more 'yes' then I would stick with MySQL... -- Ondrej Ivanic (http://www.linkedin.com/in/ondrejivanic)
Hi Mihai. > We are now at the point where the csv files are all created and amount > to some 300 GB of data. > I would like to get some advice on the best deployment option. First - and maybe best - advice: Do some testing on your own and plan some time for this. > First, the project has been started using MySQL. Is it worth switching > to Postgres and if so, which version should I use? When switching to PostgreSQL I would recommend to use the latest stable version. But your project is already running in MySQL - are there issues you expect to solve with switching to another database system? If not: why switching? > Second, where should I deploy it? The cloud or a dedicated box? Given 1TB of storage, the x-large instance and 10000 provisioned IOPS would mean about 2000USD for a 100% utilized instance on amazon. This is not really ultra-cheap ;-) For two months running you can get a dedicated server with eight drives, buy to extra SSDs and have full control on a Dell server. But things get much cheaper if real IOPS are not at such high rate. Also when using a cloud infrastructure and need your data on local system keep network latency in mind. We have several huge PostgreSQL databases running and have used OpenIndina with ZFS and SSDs for data storage for quite a while now and works perfect. There are some sildes from Sun/Oracle about ZFS, ZIL, SSD and PostgreSQL performance (I can look if I find them if needed). > Alternatively I looked at a Dell server with 32 GB of RAM and some > really good hard drives. But such a box does not come cheap and I don't > want to keep the pieces if it doesn't cut it Just a hint: Do not simply look at Dells prices - phone them and get a quote. I was surprised (but do not buy SSDs there). Think about how you data is structured and how it is queried after it was imported into the database to see where your bottlenecks are. Cheers, Jan
Hello Jan, hello List On 12/11/2012 09:10 AM, Jan Kesten wrote: > There are some sildes from Sun/Oracle about ZFS, ZIL, SSD and > PostgreSQL performance (I can look if I find them if needed). I would very much appreciate a copy or a link to these slides! Johannes
Hi all, > I would very much appreciate a copy or a link to these slides! here they are: http://www.scribd.com/mobile/doc/61186429 Have fun!
On Tue, Dec 11, 2012 at 7:26 AM, Mihai Popa <mihai@lattica.com> wrote: > Second, where should I deploy it? The cloud or a dedicated box? Forget cloud. For similar money, you can get dedicated hosting with much more reliable performance. We've been looking at places to deploy a new service, and to that end, we booked a few cloud instances and started playing. Bang for buck, even the lower-end dedicated servers (eg about $35/month) majorly outdo Amazon cloud instances. But don't take someone's word for it. Amazon let you trial their system for a year, up to (I think) ~750 computation hours per month, of their basic instance type. You can find out for yourself exactly how unsuitable it is! :) The fact is that cloud platforms offer flexibility, and that flexibility comes at a significant cost. I don't think PostgreSQL can adequately exploit X nodes with 600MB RAM each, while it _can_ make excellent use of a single computer with gobs (that's a new SI unit, you know) of memory. Incidentally, I've heard tell that cloud instances can vary enormously in performance through the day or week, but we did some cursory testing and didn't experience that. That doesn't prove you won't have problems, of course, but it's one of the purported downsides of clouding that clearly isn't as universal as I've heard said. ChrisA
Would you say the issue is cloudy?On Tue, Dec 11, 2012 at 7:26 AM, Mihai Popa <mihai@lattica.com> wrote:Second, where should I deploy it? The cloud or a dedicated box?Forget cloud. For similar money, you can get dedicated hosting with much more reliable performance. We've been looking at places to deploy a new service, and to that end, we booked a few cloud instances and started playing. Bang for buck, even the lower-end dedicated servers (eg about $35/month) majorly outdo Amazon cloud instances. But don't take someone's word for it. Amazon let you trial their system for a year, up to (I think) ~750 computation hours per month, of their basic instance type. You can find out for yourself exactly how unsuitable it is! :) The fact is that cloud platforms offer flexibility, and that flexibility comes at a significant cost. I don't think PostgreSQL can adequately exploit X nodes with 600MB RAM each, while it _can_ make excellent use of a single computer with gobs (that's a new SI unit, you know) of memory. Incidentally, I've heard tell that cloud instances can vary enormously in performance through the day or week, but we did some cursory testing and didn't experience that. That doesn't prove you won't have problems, of course, but it's one of the purported downsides of clouding that clearly isn't as universal as I've heard said. ChrisA
(I'm not being entirely facetious!)
Cheers,
Gavin
Hi,
I've recently inherited a project that involves importing a large set of
Access mdb files into a Postgres or MySQL database.
The process is to export the mdb's to comma separated files than import
those into the final database.
We are now at the point where the csv files are all created and amount
to some 300 GB of data.
I would like to get some advice on the best deployment option.
First, the project has been started using MySQL. Is it worth switching
to Postgres and if so, which version should I use?
Second, where should I deploy it? The cloud or a dedicated box?
Amazon seems like the sensible choice; you can scale it up and down as
needed and backup is handled automatically.
I was thinking of an x-large RDS instance with 10000 IOPS and 1 TB of
storage. Would this do, or will I end up with a larger/ more expensive
instance?
Alternatively I looked at a Dell server with 32 GB of RAM and some
really good hard drives. But such a box does not come cheap and I don't
want to keep the pieces if it doesn't cut it
Hi all, On 12/11/2012 11:02 AM, Jan Kesten wrote: >> I would very much appreciate a copy or a link to these slides! > here they are: > > http://www.scribd.com/mobile/doc/61186429 > thank you very much! Johannes
On Tue, Dec 11, 2012 at 9:33 PM, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote: > > On Tue, Dec 11, 2012 at 7:26 AM, Mihai Popa <mihai@lattica.com> wrote: > > Second, where should I deploy it? The cloud or a dedicated box? > > Would you say the issue is cloudy? > (I'm not being entirely facetious!) *Groan* :) It's certainly not clear-cut in the general case. In our specific case, cloud was definitely not the way to go (we have other systems in place for handling scale-out, so it's better for us to simply get X dedicated computers and have an administrative decision to scale up to Y, rather than automate it up and down as cloud can do). ChrisA
On Mon, 10 Dec 2012 15:26:02 -0500 (EST) "Mihai Popa" <mihai@lattica.com> wrote: > Hi, > > I've recently inherited a project that involves importing a large set of > Access mdb files into a Postgres or MySQL database. > The process is to export the mdb's to comma separated files than import > those into the final database. > We are now at the point where the csv files are all created and amount > to some 300 GB of data. > > I would like to get some advice on the best deployment option. > > First, the project has been started using MySQL. Is it worth switching > to Postgres and if so, which version should I use? I've been managing a few large databases this year, on both PostgreSQL and MySQL. Don't put your data in MySQL. Ever. If you feel like you need to use something like MySQL, just go straight to a system that was designed with no constraints right off the bat, like Mongo or something. Don't put large amounts of data in MySQL. There are lots of issuse with it. Despite the fact that lots of people have been able to make it work (me, for example) it's a LOT harder to keep running well than it is on PostgreSQL. MySQL just isn't designed to deal with large data. As some examples: lack of CREATE INDEX CONCURRENTLY, the fact that the default configuration stores everything in a single file, the fact that any table changes (including simple things like adding a comment, or seemingly unrelated things like adding an index) require a complete table rebuild, and the fact that if you use anything other than INT AUTO_INCREMENT for your primary key you're liable to hit on awful inefficiencies. PostgreSQL has none of these problems. -- Bill Moran <wmoran@potentialtech.com>
On Mon, 10 Dec 2012 15:26:02 -0500 (EST) "Mihai Popa" <mihai@lattica.com> wrote:Hi,I've recently inherited a project that involves importing a large set ofAccess mdb files into a Postgres or MySQL database.The process is to export the mdb's to comma separated files than importthose into the final database.We are now at the point where the csv files are all created and amountto some 300 GB of data.I would like to get some advice on the best deployment option.First, the project has been started using MySQL. Is it worth switchingto Postgres and if so, which version should I use?
I've been managing a few large databases this year, on both PostgreSQL and
MySQL.
Don't put your data in MySQL. Ever. If you feel like you need to use
something like MySQL, just go straight to a system that was designed with
no constraints right off the bat, like Mongo or something.
Don't put large amounts of data in MySQL. There are lots of issuse with it.
Despite the fact that lots of people have been able to make it work (me,
for example) it's a LOT harder to keep running well than it is on
PostgreSQL. MySQL just isn't designed to deal with large data. As some
examples: lack of CREATE INDEX CONCURRENTLY, the fact that the default
configuration stores everything in a single file, the fact that any table
changes (including simple things like adding a comment, or seemingly
unrelated things like adding an index) require a complete table rebuild,
and the fact that if you use anything other than INT AUTO_INCREMENT for
your primary key you're liable to hit on awful inefficiencies.
PostgreSQL has none of these problems.
--
Bill Moran <wmoran@potentialtech.com>
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 12/10/2012 1:26 PM, Mihai Popa wrote: > > Second, where should I deploy it? The cloud or a dedicated box? > > Amazon seems like the sensible choice; you can scale it up and down as > needed and backup is handled automatically. > I was thinking of an x-large RDS instance with 10000 IOPS and 1 TB of > storage. Would this do, or will I end up with a larger/ more expensive > instance? > > Alternatively I looked at a Dell server with 32 GB of RAM and some > really good hard drives. But such a box does not come cheap and I don't > want to keep the pieces if it doesn't cut it > Note that it will be much cheaper to buy a machine from the likes of Newegg or Amazon as parts and put it together yourself, if you have the time to spare and don't care about a Dell warranty. We've been deploying 64G hex-core Xeon Ivy Bridge boxes with 300G SSD for about $2500. Another 300G SSD for your database size would add about $1200. The newer Intel DC-series SSDs should be cheaper, but are not yet available. So as someone else pointed out you could buy a very capable box outright for the cost of a few months Amazon fees. I'm not sure I'd worry too much about how to do backups (Amazon just copies the data the same as you can, to an external drive) but if you need things like spare machines, pay a human to manage them and so on, then there are cost benefits to the cloud approach. It also allows you to blame someone else if something goes wrong.
On 12/11/2012 07:27 AM, Bill Moran wrote: > On Mon, 10 Dec 2012 15:26:02 -0500 (EST) "Mihai Popa" <mihai@lattica.com> wrote: > >> Hi, >> >> I've recently inherited a project that involves importing a large set of >> Access mdb files into a Postgres or MySQL database. >> The process is to export the mdb's to comma separated files than import >> those into the final database. >> We are now at the point where the csv files are all created and amount >> to some 300 GB of data. >> >> I would like to get some advice on the best deployment option. >> >> First, the project has been started using MySQL. Is it worth switching >> to Postgres and if so, which version should I use? > I've been managing a few large databases this year, on both PostgreSQL and > MySQL. > > Don't put your data in MySQL. Ever. If you feel like you need to use > something like MySQL, just go straight to a system that was designed with > no constraints right off the bat, like Mongo or something. I've never worked with MySQL before; I did work with Postgres a lot over the last few years, but never with such large databases, so I cannot really choose one over the other; hence my posting:) > and the fact that if you use anything other than INT AUTO_INCREMENT for > your primary key you're liable to hit on awful inefficiencies. Unfortunately, I don't know much yet about the usage pattern; all I know is that the data is mostly read only, there will be a few updates every year, but they will probably happen as batch jobs over night. And meanwhile it appears there is a lot more of it: 800 GB rather than 300 as initially thought. There aren't a lot of tables so each will have a large number of rows. I guess Chris was right, I have to better understand the usage pattern and do some testing of my own. I was just hoping my hunch about Amazon being the better alternative would be confirmed, but this does not seem to be the case; most of you recommend purchasing a box. I want to thank everyone for the input, really appreciate it! regards, mihai
On 12/11/2012 8:28 AM, Mihai Popa wrote: > I guess Chris was right, I have to better understand the usage pattern > and do some testing of my own. > I was just hoping my hunch about Amazon being the better alternative > would be confirmed, but this does not > seem to be the case; most of you recommend purchasing a box. > Amazon (or another PG cloud provider such as Heroku) is a great choice if you want to do some sizing tests. However, beware that if you need big instances running for more than a week or two, you may spend as much in fees as it would have cost to buy a big machine outright. Cloud services are highly economic where you need resources that are significantly _less_ than a present-day physical machine provides. For example I have a VM with 500MB at Rackspace that I can use to run Nagios to check on my physical servers, located in a different state on different peering. I think that costs something like $15/mo. I couldn't locate any kind of physical box in a new data center for anything like that little. But where we need all the resources provided by the biggest box you can buy today (and several of those), it is an order of magnitude cheaper to buy them and pay for colo space. Similarly, if I needed machines for only a short time (to test something, for example), cloud hosting is a nice option vs having a big pile of metal in the corner of the office that you don't know what to do with... Finally, note that there is a middle-ground available between cloud hosting and outright machine purchase -- providers such as Linode and SoftLayer will sell physical machines in a way that gives much of the convenience of cloud hosting, but with the resource dedication and consistency of performance of physical machines. Still not as cheap as buying your own machines of course, if you need them long-term.
On Mon, Dec 10, 2012 at 12:26 PM, Mihai Popa <mihai@lattica.com> wrote: > Hi, > > I've recently inherited a project that involves importing a large set of > Access mdb files into a Postgres or MySQL database. > The process is to export the mdb's to comma separated files than import > those into the final database. > We are now at the point where the csv files are all created and amount > to some 300 GB of data. Compressed or uncompressed? > I would like to get some advice on the best deployment option. > > First, the project has been started using MySQL. Is it worth switching > to Postgres and if so, which version should I use? Why did you originally choose MySQL? What has changed that causes you to rethink that decision? Does your team have experience with MySQL but not with PostgreSQL? I like PostgreSQL, of course, but if I already had an already-functioning app on MySQL I'd be reluctant to change it. If I were going to do so, though, I'd use 9.2. No reason to develop against something other than the latest stable version. > Second, where should I deploy it? The cloud or a dedicated box? > > Amazon seems like the sensible choice; you can scale it up and down as > needed and backup is handled automatically. > I was thinking of an x-large RDS instance with 10000 IOPS and 1 TB of > storage. Would this do, or will I end up with a larger/ more expensive > instance? My understanding is that RDS does not support Postgres, so if you go that route the decision is already made for you. Or am I wrong here? 1TB of storage sounds desperately small for loading 300GB of csv files. IOPS would mostly depend on how you are using the system, not how large it is. > Alternatively I looked at a Dell server with 32 GB of RAM and some > really good hard drives. But such a box does not come cheap and I don't > want to keep the pieces if it doesn't cut it xlarge RDS with 1TB of storage and 10000 iops doesn't come cheap, either. Cheers, Jeff
On Tue, Dec 11, 2012 at 7:26 AM, Mihai Popa <mihai@lattica.com> wrote:Second, where should I deploy it? The cloud or a dedicated box?
Forget cloud. For similar money, you can get dedicated hosting with
much more reliable performance. We've been looking at places to deploy
a new service, and to that end, we booked a few cloud instances and
started playing. Bang for buck, even the lower-end dedicated servers
(eg about $35/month) majorly outdo Amazon cloud instances.
On Tue, 2012-12-11 at 09:47 -0700, David Boreham wrote: > Finally, note that there is a middle-ground available between cloud > hosting and outright machine purchase -- providers such as Linode and > SoftLayer will sell physical machines in a way that gives much of the > convenience of cloud hosting, I actually looked at Linode, but Amazon looked more competitive... -- Mihai Popa <mihai@lattica.com> Lattica, Inc.
On 12/11/2012 2:03 PM, Mihai Popa wrote: > I actually looked at Linode, but Amazon looked more competitive... Checking Linode's web site just now it looks like they have removed physical machines as an option. Try SoftLayer instead for physical machines delivered on-demand : http://www.softlayer.com/dedicated-servers/ If you're looking for low cost virtual hosting alternative to Amazon, try Rackspace. Different providers offer different features too. For example Rackspace allows you to add SSD persistent storage to any node (at a price) whereas Amazon currently doesn't offer that capability.
On Tue, 2012-12-11 at 10:00 -0800, Jeff Janes wrote: > On Mon, Dec 10, 2012 at 12:26 PM, Mihai Popa <mihai@lattica.com> wrote: > > Hi, > > > > I've recently inherited a project that involves importing a large set of > > Access mdb files into a Postgres or MySQL database. > > The process is to export the mdb's to comma separated files than import > > those into the final database. > > We are now at the point where the csv files are all created and amount > > to some 300 GB of data. > > Compressed or uncompressed? uncompressed, but that's not much relief... and it's 800GB not 300 anymore. I still can't believe the size of this thing. > Why did you originally choose MySQL? What has changed that causes you > to rethink that decision? Does your team have experience with MySQL > but not with PostgreSQL? I did not choose it; somebody before me did. I personally have more experience with Postgres, but not with databases as large as this one promises to be. > > I like PostgreSQL, of course, but if I already had an > already-functioning app on MySQL I'd be reluctant to change it. ...and I'm not rushing to do it; I was just asking around, maybe there are known issues with MySQL, or with Postgres for that matter. > My understanding is that RDS does not support Postgres, so if you go > that route the decision is already made for you. Or am I wrong here? That's right, but I could still get an EC2 instance and run my own Postgres Or use this: http://www.enterprisedb.com/cloud-database/pricing-amazon > 1TB of storage sounds desperately small for loading 300GB of csv files. really? that's good to know; I wouldn't have guessed > IOPS would mostly depend on how you are using the system, not how large it is. mostly true -- Mihai Popa <mihai@lattica.com> Lattica, Inc.
On Tue, 2012-12-11 at 14:28 -0700, David Boreham wrote: > Try SoftLayer instead for physical machines delivered on-demand : > http://www.softlayer.com/dedicated-servers/ > > If you're looking for low cost virtual hosting alternative to Amazon, > try Rackspace. Thank you, I will regards, -- Mihai Popa <mihai@lattica.com> Lattica, Inc.
On 12/11/2012 1:58 PM, Mihai Popa wrote: >> 1TB of storage sounds desperately small for loading 300GB of csv files. > really? that's good to know; I wouldn't have guessed > on many of our databases, the indexes are as large as the tables.
On 12/11/2012 01:58 PM, Mihai Popa wrote: > On Tue, 2012-12-11 at 10:00 -0800, Jeff Janes wrote: >> On Mon, Dec 10, 2012 at 12:26 PM, Mihai Popa <mihai@lattica.com> wrote: >>> Hi, >>> >>> I've recently inherited a project that involves importing a large set of >>> Access mdb files into a Postgres or MySQL database. >>> The process is to export the mdb's to comma separated files than import >>> those into the final database. >>> We are now at the point where the csv files are all created and amount >>> to some 300 GB of data. >> >> Compressed or uncompressed? > > uncompressed, but that's not much relief... > and it's 800GB not 300 anymore. I still can't believe the size of this > thing. Are you sure the conversion process is working properly? -- Adrian Klaver adrian.klaver@gmail.com
hello 2012/12/12 ac@hsk.hk <ac@hsk.hk>: > Hi, > > I have a new server for PostgreSQL 9.2.2 with 8GB physical RAM, I want to > turn the server with the following changes from default: > > max_connections = 100 # default > shared_buffers = 2048MB # change from 24MB to 2048MB as I think 24MB is not > enough > maintenance_work_mem = 400MB # 50MB/GB x 8GB > effective_cache_size = 4096MB # Set to 50% of total RAM > work_mem = 24MB > checkpoint_segments = 10 > wal_buffers = 16MB > > Are these values reasonable? it is dedicated server? then effective_cache_size can be higher maybe 6GB checkpoint_segments are too low, can be 128 Regards Pavel Stehule > > Thanks > ac
Hi, yes it is a dedicated server. and THANKS! On 12 Dec 2012, at 3:55 PM, Pavel Stehule wrote: > hello > > 2012/12/12 ac@hsk.hk <ac@hsk.hk>: >> Hi, >> >> I have a new server for PostgreSQL 9.2.2 with 8GB physical RAM, I want to >> turn the server with the following changes from default: >> >> max_connections = 100 # default >> shared_buffers = 2048MB # change from 24MB to 2048MB as I think 24MB is not >> enough >> maintenance_work_mem = 400MB # 50MB/GB x 8GB >> effective_cache_size = 4096MB # Set to 50% of total RAM >> work_mem = 24MB >> checkpoint_segments = 10 >> wal_buffers = 16MB >> >> Are these values reasonable? > > it is dedicated server? > > then effective_cache_size can be higher maybe 6GB > > checkpoint_segments are too low, can be 128 > > Regards > > Pavel Stehule > > >> >> Thanks >> ac
Another question is whether there's a particular reason that you're converting to CSV prior to importing the data? > > All major ETL tools that I know of, including the major open source > ones (Pentaho / Talend) can move data directly from Access databases > to Postgresql. > Yes, I wish somebody asked this question before we started the process Found out about it only a few days ago....
Hi, On 14 December 2012 17:56, ac@hsk.hk <ac@hsk.hk> wrote: > I could see that it would install older PostgreSQL 9.1 and > postgresql-contrib-9.1. As I already have 9.2.1 and do not want to have > older version 9.1 in parallel, I aborted the apt install. > > How can I get pure postgresql-contrib for Postgresql 9.2.x? You need PostreSQL PPA: sudo apt-get update sudo apt-get install python-software-properties sudo add-apt-repository ppa:pitti/postgresql sudo apt-get install postgresql-contrib-9.2 -- Ondrej Ivanic (http://www.linkedin.com/in/ondrejivanic)
Hi, got it installed, thanks On 14 Dec 2012, at 7:36 PM, Ondrej Ivanič wrote: > Hi, > > On 14 December 2012 17:56, ac@hsk.hk <ac@hsk.hk> wrote: >> I could see that it would install older PostgreSQL 9.1 and >> postgresql-contrib-9.1. As I already have 9.2.1 and do not want to have >> older version 9.1 in parallel, I aborted the apt install. >> >> How can I get pure postgresql-contrib for Postgresql 9.2.x? > > You need PostreSQL PPA: > > sudo apt-get update > sudo apt-get install python-software-properties > sudo add-apt-repository ppa:pitti/postgresql > sudo apt-get install postgresql-contrib-9.2 > > -- > Ondrej Ivanic > (http://www.linkedin.com/in/ondrejivanic) > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
On Dec 11, 2012 2:25 PM, "Adrian Klaver" <adrian.klaver@gmail.com> wrote:
>
> On 12/11/2012 01:58 PM, Mihai Popa wrote:
>>
>> On Tue, 2012-12-11 at 10:00 -0800, Jeff Janes wrote:
>>>
>>> On Mon, Dec 10, 2012 at 12:26 PM, Mihai Popa <mihai@lattica.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I've recently inherited a project that involves importing a large set of
>>>> Access mdb files into a Postgres or MySQL database.
>>>> The process is to export the mdb's to comma separated files than import
>>>> those into the final database.
>>>> We are now at the point where the csv files are all created and amount
>>>> to some 300 GB of data.
>>>
>>>
>>> Compressed or uncompressed?
>>
>>
>> uncompressed, but that's not much relief...
>> and it's 800GB not 300 anymore. I still can't believe the size of this
>> thing.
>
>
> Are you sure the conversion process is working properly?
>
Another question is whether there's a particular reason that you're converting to CSV prior to importing the data?
All major ETL tools that I know of, including the major open source ones (Pentaho / Talend) can move data directly from Access databases to Postgresql. In addition, provided the table names are all the same in the Access files, you can iterate over all of the access files in a directory at once.