Thread: Load experimentation
Hello All, I'm in the process of loading a massive amount of data (500 GB). After some initial timings, I'm looking at 260 hours to load the entire 500GB. 10 days seems like an awfully long time so I'm searching for ways to speed this up. The load is happening in the Amazon cloud (EC2), on a m1.large instance: -7.5 GB memory -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each) -64-bit platform So far I have modified my postgresql.conf file (PostgreSQL 8.1.3). The modifications I have made are as follows: shared_buffers = 786432 work_mem = 10240 maintenance_work_mem = 6291456 max_fsm_pages = 3000000 wal_buffers = 2048 checkpoint_segments = 200 checkpoint_timeout = 300 checkpoint_warning = 30 autovacuum = off There are a variety of instance types available in the Amazon cloud (http://aws.amazon.com/ec2/instance-types/), including high memory and high CPU. High memory instance types come with 34GB or 68GB of memory. High CPU instance types have a lot less memory (7GB max) but up to 8 virtual cores. I am more than willing to change to any of the other instance types. Also, there is nothing else happening on the loading server. It is completely dedicated to the load. Any advice would be greatly appreciated. Thanks, Ben
Ben Brehmer <benbrehmer@gmail.com> wrote: > -7.5 GB memory > -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units > each) > -64-bit platform What OS? > (PostgreSQL 8.1.3) Why use such an antiquated, buggy version? Newer versions are faster. -Kevin
On Mon, Dec 7, 2009 at 1:12 PM, Ben Brehmer <benbrehmer@gmail.com> wrote:
Can you go with PG 8.4? That's a start :-)
I'd set fsync=off for the load, I'd also make sure that you're using the COPY command (on the server side) to do the load.
Hello All,
I'm in the process of loading a massive amount of data (500 GB). After some initial timings, I'm looking at 260 hours to load the entire 500GB. 10 days seems like an awfully long time so I'm searching for ways to speed this up. The load is happening in the Amazon cloud (EC2), on a m1.large instance:
-7.5 GB memory
-4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
-64-bit platform
So far I have modified my postgresql.conf file (PostgreSQL 8.1.3). The modifications I have made are as follows:
Can you go with PG 8.4? That's a start :-)
shared_buffers = 786432
work_mem = 10240
maintenance_work_mem = 6291456
max_fsm_pages = 3000000
wal_buffers = 2048
checkpoint_segments = 200
checkpoint_timeout = 300
checkpoint_warning = 30
autovacuum = off
I'd set fsync=off for the load, I'd also make sure that you're using the COPY command (on the server side) to do the load.
2009/12/7 Kevin Grittner <Kevin.Grittner@wicourts.gov>
Ben Brehmer <benbrehmer@gmail.com> wrote:What OS?
> -7.5 GB memory
> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
> each)
> -64-bit platform
> (PostgreSQL 8.1.3)
Why use such an antiquated, buggy version? Newer versions are
faster.
-Kevin
I'd agree with trying to use the latest version you can.
How are you loading this data? I'd make sure you haven't got any indices, primary keys, triggers or constraints on your tables before you begin the initial load, just add them after. Also use either the COPY command for loading, or prepared transactions. Individual insert commands will just take way too long.
Regards
Thom
Kevin, This is running on on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44) Ben On 07/12/2009 10:33 AM, Kevin Grittner wrote: > Ben Brehmer<benbrehmer@gmail.com> wrote: > > >> -7.5 GB memory >> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units >> each) >> -64-bit platform >> > > What OS? > > >> (PostgreSQL 8.1.3) >> > > Why use such an antiquated, buggy version? Newer versions are > faster. > > -Kevin > >
Ben Brehmer wrote: > Hello All, > > I'm in the process of loading a massive amount of data (500 GB). After > some initial timings, I'm looking at 260 hours to load the entire 500GB. You don't say how you are loading the data, so there's not much to go on. But generally, there are two primary ways to speedthings up: 1. Group MANY inserts into a single transaction. If you're doing a row-at-a-time, it will be very slow. The "sweet spot"seems to be somewhere between 100 and 1000 inserts in a single transaction. Below 100, you're still slowing thingsdown, above 1000, it probably won't make much difference. 2. Use the COPY command. This requires you to format your data into the form that COPY uses. But it's VERY fast. Craig > 10 days seems like an awfully long time so I'm searching for ways to > speed this up. The load is happening in the Amazon cloud (EC2), on a > m1.large instance: > -7.5 GB memory > -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each) > -64-bit platform > > > So far I have modified my postgresql.conf file (PostgreSQL 8.1.3). The > modifications I have made are as follows: > > shared_buffers = 786432 > work_mem = 10240 > maintenance_work_mem = 6291456 > max_fsm_pages = 3000000 > wal_buffers = 2048 > checkpoint_segments = 200 > checkpoint_timeout = 300 > checkpoint_warning = 30 > autovacuum = off > > > There are a variety of instance types available in the Amazon cloud > (http://aws.amazon.com/ec2/instance-types/), including high memory and > high CPU. High memory instance types come with 34GB or 68GB of memory. > High CPU instance types have a lot less memory (7GB max) but up to 8 > virtual cores. I am more than willing to change to any of the other > instance types. > > Also, there is nothing else happening on the loading server. It is > completely dedicated to the load. > > Any advice would be greatly appreciated. > > Thanks, > > Ben >
Thanks for the quick responses. I will respond to all questions in one email:
By "Loading data" I am implying: "psql -U postgres -d somedatabase -f sql_file.sql". The sql_file.sql contains table creates and insert statements. There are no indexes present nor created during the load.
OS: x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44)
PostgreSQL: I will try upgrading to latest version.
COPY command: Unfortunately I'm stuck with INSERTS due to the nature this data was generated (Hadoop/MapReduce).
Transactions: Have started a second load process with chunks of 1000 inserts wrapped in a transaction. Its dropped the load time for 1000 inserts from 1 Hour to 7 minutes :)
Disk Setup: Using a single disk Amazon image for the destination (database). Source is coming from an EBS volume. I didn't think there were any disk options in Amazon?
Thanks!
Ben
On 07/12/2009 10:39 AM, Thom Brown wrote:
By "Loading data" I am implying: "psql -U postgres -d somedatabase -f sql_file.sql". The sql_file.sql contains table creates and insert statements. There are no indexes present nor created during the load.
OS: x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44)
PostgreSQL: I will try upgrading to latest version.
COPY command: Unfortunately I'm stuck with INSERTS due to the nature this data was generated (Hadoop/MapReduce).
Transactions: Have started a second load process with chunks of 1000 inserts wrapped in a transaction. Its dropped the load time for 1000 inserts from 1 Hour to 7 minutes :)
Disk Setup: Using a single disk Amazon image for the destination (database). Source is coming from an EBS volume. I didn't think there were any disk options in Amazon?
Thanks!
Ben
On 07/12/2009 10:39 AM, Thom Brown wrote:
2009/12/7 Kevin Grittner <Kevin.Grittner@wicourts.gov>Ben Brehmer <benbrehmer@gmail.com> wrote:What OS?
> -7.5 GB memory
> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
> each)
> -64-bit platform
> (PostgreSQL 8.1.3)
Why use such an antiquated, buggy version? Newer versions are
faster.
-KevinI'd agree with trying to use the latest version you can.How are you loading this data? I'd make sure you haven't got any indices, primary keys, triggers or constraints on your tables before you begin the initial load, just add them after. Also use either the COPY command for loading, or prepared transactions. Individual insert commands will just take way too long.RegardsThom
Ben Brehmer wrote: > Thanks for the quick responses. I will respond to all questions in one > email: > > By "Loading data" I am implying: "psql -U postgres -d somedatabase -f > sql_file.sql". The sql_file.sql contains table creates and insert > statements. There are no indexes present nor created during the load. Although transactions of over 1000 INSERT statements don't speed things up much, they don't hurt either, especially on anew system that nobody is using yet. Since you're loading from big SQL files using psql, just put a "begin;" at the topof the file and a "commit;" at the bottom. Unlike Oracle, Postgres even allows CREATE and such to be done inside a transaction. And BTW, don't forget to ANALYZE when you're all done. Craig > > OS: x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 > (Red Hat 4.1.2-44) > > PostgreSQL: I will try upgrading to latest version. > > COPY command: Unfortunately I'm stuck with INSERTS due to the nature > this data was generated (Hadoop/MapReduce). > > Transactions: Have started a second load process with chunks of 1000 > inserts wrapped in a transaction. Its dropped the load time for 1000 > inserts from 1 Hour to 7 minutes :) > > Disk Setup: Using a single disk Amazon image for the destination > (database). Source is coming from an EBS volume. I didn't think there > were any disk options in Amazon? > > > Thanks! > > Ben > > > > > > On 07/12/2009 10:39 AM, Thom Brown wrote: >> 2009/12/7 Kevin Grittner <Kevin.Grittner@wicourts.gov >> <mailto:Kevin.Grittner@wicourts.gov>> >> >> Ben Brehmer <benbrehmer@gmail.com <mailto:benbrehmer@gmail.com>> >> wrote: >> >> > -7.5 GB memory >> > -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units >> > each) >> > -64-bit platform >> >> What OS? >> >> > (PostgreSQL 8.1.3) >> >> Why use such an antiquated, buggy version? Newer versions are >> faster. >> >> -Kevin >> >> >> >> I'd agree with trying to use the latest version you can. >> >> How are you loading this data? I'd make sure you haven't got any >> indices, primary keys, triggers or constraints on your tables before >> you begin the initial load, just add them after. Also use either the >> COPY command for loading, or prepared transactions. Individual insert >> commands will just take way too long. >> >> Regards >> >> Thom
On Monday 07 December 2009, Ben Brehmer <benbrehmer@gmail.com> wrote: > Disk Setup: Using a single disk Amazon image for the destination > (database). Source is coming from an EBS volume. I didn't think there > were any disk options in Amazon? I don't think any Amazon cloud service is particularly well suited to a database. Combined virtualized hosts with terrible I/O, and it's actually hard to envision a worse place to run a database. -- "No animals were harmed in the recording of this episode. We tried but that damn monkey was just too fast."
Ben Brehmer wrote:
By "Loading data" I am implying: "psql -U postgres -d somedatabase -f sql_file.sql". The sql_file.sql contains table creates and insert statements. There are no indexes present nor created during the load.Your basic options here are to batch the INSERTs into bigger chunks, and/or to split your data file up so that it can be loaded by more than one process at a time. There's some comments and links to more guidance here at http://wiki.postgresql.org/wiki/Bulk_Loading_and_Restores
COPY command: Unfortunately I'm stuck with INSERTS due to the nature this data was generated (Hadoop/MapReduce).
-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Thanks for all the responses. I have one more thought;
Since my input data is split into about 200 files (3GB each), I could potentially spawn one load command for each file. What would be the maximum number of input connections Postgres can handle without bogging down? When I say 'input connection' I mean "psql -U postgres -d dbname -f one_of_many_sql_files".
Thanks,
Ben
On 07/12/2009 12:59 PM, Greg Smith wrote:
Since my input data is split into about 200 files (3GB each), I could potentially spawn one load command for each file. What would be the maximum number of input connections Postgres can handle without bogging down? When I say 'input connection' I mean "psql -U postgres -d dbname -f one_of_many_sql_files".
Thanks,
Ben
On 07/12/2009 12:59 PM, Greg Smith wrote:
Ben Brehmer wrote:By "Loading data" I am implying: "psql -U postgres -d somedatabase -f sql_file.sql". The sql_file.sql contains table creates and insert statements. There are no indexes present nor created during the load.Your basic options here are to batch the INSERTs into bigger chunks, and/or to split your data file up so that it can be loaded by more than one process at a time. There's some comments and links to more guidance here at http://wiki.postgresql.org/wiki/Bulk_Loading_and_Restores
COPY command: Unfortunately I'm stuck with INSERTS due to the nature this data was generated (Hadoop/MapReduce).-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Ben Brehmer wrote:
Since my input data is split into about 200 files (3GB each), I could potentially spawn one load command for each file. What would be the maximum number of input connections Postgres can handle without bogging down?You can expect to easily get one loader process per real CPU going. Beyond that, it depends on how CPU intensive they all are and what the resulting I/O rate out of the combination is. You're probably going to run out of CPU on a loading job long before you hit any of the other limits in this area, and potentially you could run out of disk throughput on a cloud system before that. PostgreSQL isn't going to bog down on a connection basis until you've reached several hundred of them, your loader will be lucky to hit 10 active processes before it grinds to a halt on some physical resources unrelated to general database scaling.
-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Tue, Dec 8, 2009 at 12:22 AM, Ben Brehmer <benbrehmer@gmail.com> wrote: > Thanks for all the responses. I have one more thought; > > Since my input data is split into about 200 files (3GB each), I could > potentially spawn one load command for each file. What would be the maximum > number of input connections Postgres can handle without bogging down? When I > say 'input connection' I mean "psql -U postgres -d dbname -f > one_of_many_sql_files". This is VERY dependent on your IO capacity and number of cores. My experience is that unless you're running on a decent number of disks, you'll run out of IO capacity first in most machines. n pairs of mirrors in a RAID-10 can handle x input threads where x has some near linear relation to n. Have 100 disks in a RAID-10 array? You can surely handle dozens of load threads with no IO wait. Have 4 disks in a RAID-10? Maybe two to four load threads will max you out. Once you're IO bound, adding more threads and more CPUs won't help, it'll hurt. The only way to really know is to benchmark it, but i'd guess that about half as many import threads as mirror pairs in a RAID-10 (or just drives if you're using RAID-0) would be a good place to start and work from there.
On Tue, Dec 8, 2009 at 12:58 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Tue, Dec 8, 2009 at 12:22 AM, Ben Brehmer <benbrehmer@gmail.com> wrote: >> Thanks for all the responses. I have one more thought; >> >> Since my input data is split into about 200 files (3GB each), I could >> potentially spawn one load command for each file. What would be the maximum >> number of input connections Postgres can handle without bogging down? When I >> say 'input connection' I mean "psql -U postgres -d dbname -f >> one_of_many_sql_files". > > This is VERY dependent on your IO capacity and number of cores. My > experience is that unless you're running on a decent number of disks, > you'll run out of IO capacity first in most machines. n pairs of > mirrors in a RAID-10 can handle x input threads where x has some near > linear relation to n. Have 100 disks in a RAID-10 array? You can > surely handle dozens of load threads with no IO wait. Have 4 disks in > a RAID-10? Maybe two to four load threads will max you out. Once > you're IO bound, adding more threads and more CPUs won't help, it'll > hurt. The only way to really know is to benchmark it, but i'd guess > that about half as many import threads as mirror pairs in a RAID-10 > (or just drives if you're using RAID-0) would be a good place to start > and work from there. Note that if you start running out of CPU horsepower first the degradation will be less harsh as you go past the knee in the performance curve. IO has a sharper knee than CPU.
Hi, Ben Brehmer <benbrehmer@gmail.com> writes: > By "Loading data" I am implying: "psql -U postgres -d somedatabase -f sql_file.sql". The sql_file.sql contains table createsand insert statements. There are no > indexes present nor created during the load. > > OS: x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44) > > PostgreSQL: I will try upgrading to latest version. > > COPY command: Unfortunately I'm stuck with INSERTS due to the nature > this data was generated (Hadoop/MapReduce). What I think you could do is the followings: - switch to using 8.4 - load your files in a *local* database - pg_dump -Fc - now pg_restore -j X on the amazon setup That way you will be using COPY rather than INSERTs and parallel loading built-in pg_restore (and optimisations of when to add the indexes and constraints). The X is to choose depending on the IO power and the number of CPU... Regards, -- dim
On Tue, Dec 8, 2009 at 2:08 AM, Dimitri Fontaine <dfontaine@hi-media.com> wrote: > Hi, > > Ben Brehmer <benbrehmer@gmail.com> writes: >> By "Loading data" I am implying: "psql -U postgres -d somedatabase -f sql_file.sql". The sql_file.sql contains tablecreates and insert statements. There are no >> indexes present nor created during the load. >> >> OS: x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44) >> >> PostgreSQL: I will try upgrading to latest version. >> >> COPY command: Unfortunately I'm stuck with INSERTS due to the nature >> this data was generated (Hadoop/MapReduce). > > What I think you could do is the followings: > > - switch to using 8.4 > - load your files in a *local* database > - pg_dump -Fc > - now pg_restore -j X on the amazon setup > > That way you will be using COPY rather than INSERTs and parallel loading > built-in pg_restore (and optimisations of when to add the indexes and > constraints). The X is to choose depending on the IO power and the > number of CPU... That's a lot of work to get to COPY. It might be enough to drop all FK relations and indexes on the destination db in the cloud, load the data in a few (or one) transaction(s), then recreate indexes and FK relationships.
Scott Marlowe <scott.marlowe@gmail.com> writes: > That's a lot of work to get to COPY. Well, yes. I though about it this way only after having read that OP is uneasy with producing another format from his source data, and considering it's a one-shot operation. Ah, tradeoffs, how to find the right one! -- dim
On 12/07/2009 12:12 PM, Ben Brehmer wrote: > Hello All, > > I'm in the process of loading a massive amount of data (500 GB). After > some initial timings, I'm looking at 260 hours to load the entire 500GB. > 10 days seems like an awfully long time so I'm searching for ways to > speed this up. The load is happening in the Amazon cloud (EC2), on a > m1.large instance: > -7.5 GB memory > -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each) > -64-bit platform > > > So far I have modified my postgresql.conf file (PostgreSQL 8.1.3). The > modifications I have made are as follows: > > shared_buffers = 786432 > work_mem = 10240 > maintenance_work_mem = 6291456 > max_fsm_pages = 3000000 > wal_buffers = 2048 > checkpoint_segments = 200 > checkpoint_timeout = 300 > checkpoint_warning = 30 > autovacuum = off > > > There are a variety of instance types available in the Amazon cloud > (http://aws.amazon.com/ec2/instance-types/), including high memory and > high CPU. High memory instance types come with 34GB or 68GB of memory. > High CPU instance types have a lot less memory (7GB max) but up to 8 > virtual cores. I am more than willing to change to any of the other > instance types. > > Also, there is nothing else happening on the loading server. It is > completely dedicated to the load. > > Any advice would be greatly appreciated. > > Thanks, > > Ben > I'm kind of curious, how goes the load? Is it done yet? Still looking at days'n'days to finish? I was thinking... If the .sql files are really nicely formatted, it would not be too hard to whip up a perl script to runas a filter to change the statements into copy's. Each file would have to only fill one table, and only contain inserts, and all the insert statements would have to set thesame fields. (And I'm sure there could be other problems). Also, just for the load, did you disable fsync? -Andy
Hi Andy, Load is chugging along. We've optimized our postgres conf as much as possible but are seeing the inevitable I/O bottleneck. I had the same thought as you (converting inserts into copy's) a while back but unfortunately each file has many inserts into many different tables. Potentially I could rip through this with a little MapReduce job on 50-100 nodes, which is still something I might do. One thought we are playing with was taking advantage of 4 x 414GB EBS devices in a RAID0 configuration. This would spread disk writes across 4 block devices. Right now I'm wrapping about 1500 inserts in a transaction block. Since its an I/O bottlenecks, COPY statements might not give me much advantage. Its definitely a work in progress :) Ben On 09/12/2009 5:31 AM, Andy Colson wrote: > On 12/07/2009 12:12 PM, Ben Brehmer wrote: >> Hello All, >> >> I'm in the process of loading a massive amount of data (500 GB). After >> some initial timings, I'm looking at 260 hours to load the entire 500GB. >> 10 days seems like an awfully long time so I'm searching for ways to >> speed this up. The load is happening in the Amazon cloud (EC2), on a >> m1.large instance: >> -7.5 GB memory >> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each) >> -64-bit platform >> >> >> So far I have modified my postgresql.conf file (PostgreSQL 8.1.3). The >> modifications I have made are as follows: >> >> shared_buffers = 786432 >> work_mem = 10240 >> maintenance_work_mem = 6291456 >> max_fsm_pages = 3000000 >> wal_buffers = 2048 >> checkpoint_segments = 200 >> checkpoint_timeout = 300 >> checkpoint_warning = 30 >> autovacuum = off >> >> >> There are a variety of instance types available in the Amazon cloud >> (http://aws.amazon.com/ec2/instance-types/), including high memory and >> high CPU. High memory instance types come with 34GB or 68GB of memory. >> High CPU instance types have a lot less memory (7GB max) but up to 8 >> virtual cores. I am more than willing to change to any of the other >> instance types. >> >> Also, there is nothing else happening on the loading server. It is >> completely dedicated to the load. >> >> Any advice would be greatly appreciated. >> >> Thanks, >> >> Ben >> > > I'm kind of curious, how goes the load? Is it done yet? Still > looking at days'n'days to finish? > > I was thinking... If the .sql files are really nicely formatted, it > would not be too hard to whip up a perl script to run as a filter to > change the statements into copy's. > > Each file would have to only fill one table, and only contain inserts, > and all the insert statements would have to set the same fields. (And > I'm sure there could be other problems). > > Also, just for the load, did you disable fsync? > > -Andy >
On 12/7/09 11:12 AM, "Ben Brehmer" <benbrehmer@gmail.com> wrote: > Thanks for the quick responses. I will respond to all questions in one email: > > COPY command: Unfortunately I'm stuck with INSERTS due to the nature this data > was generated (Hadoop/MapReduce). If you have control over the MapReduce output, you can have that output result files in a format that COPY likes. If you don't have any control over that its more complicated. I use a final pass Hadoop Map only job to go over the output and insert into postgres directly from the job, using the : INSERT INTO <table> VALUES (val1, val2, ... ), (val1, val2, ...), ... Insert style from Java with about 80 rows per insert statement and a single transaction for about a thousand of these. This was faster than batch inserts . > > On 07/12/2009 10:39 AM, Thom Brown wrote: >> >> 2009/12/7 Kevin Grittner <Kevin.Grittner@wicourts.gov> >> >>> >>> Ben Brehmer <benbrehmer@gmail.com> wrote: >>> >>>> -7.5 GB memory >>>> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units >>>> each) >>>> -64-bit platform >>> >>> >>> What OS? >>> >>>> (PostgreSQL 8.1.3) >>> >>> Why use such an antiquated, buggy version? Newer versions are >>> faster. >>> >>> -Kevin >>> >> >> >> >> >> >> >> I'd agree with trying to use the latest version you can. >> >> >> >> >> How are you loading this data? I'd make sure you haven't got any indices, >> primary keys, triggers or constraints on your tables before you begin the >> initial load, just add them after. Also use either the COPY command for >> loading, or prepared transactions. Individual insert commands will just take >> way too long. >> >> >> >> >> Regards >> >> >> >> >> Thom >
On 12/10/09 3:29 PM, "Scott Carey" <scott@richrelevance.com> wrote: > On 12/7/09 11:12 AM, "Ben Brehmer" <benbrehmer@gmail.com> wrote: > >> Thanks for the quick responses. I will respond to all questions in one email: >> >> COPY command: Unfortunately I'm stuck with INSERTS due to the nature this >> data >> was generated (Hadoop/MapReduce). > > If you have control over the MapReduce output, you can have that output > result files in a format that COPY likes. > > If you don't have any control over that its more complicated. I use a final > pass Hadoop Map only job to go over the output and insert into postgres > directly from the job, using the : > > INSERT INTO <table> VALUES (val1, val2, ... ), (val1, val2, ...), ... > Insert style from Java with about 80 rows per insert statement and a single > transaction for about a thousand of these. This was faster than batch > inserts . > I should mention that the above is a bit off. There is an important caveat that each of these individual tasks might run twice in Hadoop (only one will finish -- speculative execution and retry on error). To deal with this you can run each job inside a single transaction so that a failure will rollback, and likely want to turn off speculative execution. Another option is to run only one map job, with no reduce for this sort of work in order to ensure duplicate data is not inserted. We are inserting into a temp table named uniquely per chunk first (sometimes in parallel). Then while holding a posstgres advisory lock we do a SELECT * FROM <temp> INTO <destination> type operation, which is fast. > >> >> On 07/12/2009 10:39 AM, Thom Brown wrote: >>> >>> 2009/12/7 Kevin Grittner <Kevin.Grittner@wicourts.gov> >>> >>>> >>>> Ben Brehmer <benbrehmer@gmail.com> wrote: >>>> >>>>> -7.5 GB memory >>>>> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units >>>>> each) >>>>> -64-bit platform >>>> >>>> >>>> What OS? >>>> >>>>> (PostgreSQL 8.1.3) >>>> >>>> Why use such an antiquated, buggy version? Newer versions are >>>> faster. >>>> >>>> -Kevin >>>> >>> >>> >>> >>> >>> >>> >>> I'd agree with trying to use the latest version you can. >>> >>> >>> >>> >>> How are you loading this data? I'd make sure you haven't got any indices, >>> primary keys, triggers or constraints on your tables before you begin the >>> initial load, just add them after. Also use either the COPY command for >>> loading, or prepared transactions. Individual insert commands will just >>> take >>> way too long. >>> >>> >>> >>> >>> Regards >>> >>> >>> >>> >>> Thom >> > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >