Thread: Load experimentation

Load experimentation

From

Ben Brehmer

Date:

07 December 2009, 14:13:03

Hello All,

I'm in the process of loading a massive amount of data (500 GB). After
some initial timings, I'm looking at 260 hours to load the entire 500GB.
10 days seems like an awfully long time so I'm searching for ways to
speed this up. The load is happening in the Amazon cloud (EC2), on a
m1.large instance:
-7.5 GB memory
-4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
-64-bit platform


So far I have modified my postgresql.conf  file (PostgreSQL 8.1.3). The
modifications I have made are as follows:

shared_buffers = 786432
work_mem = 10240
maintenance_work_mem = 6291456
max_fsm_pages = 3000000
wal_buffers = 2048
checkpoint_segments = 200
checkpoint_timeout = 300
checkpoint_warning = 30
autovacuum = off


There are a variety of instance types available in the Amazon cloud
(http://aws.amazon.com/ec2/instance-types/), including high memory and
high CPU. High memory instance types come with 34GB or 68GB of memory.
High CPU instance types have a lot less memory (7GB max) but up to 8
virtual cores. I am more than willing to change to any of the other
instance types.

Also, there is nothing else happening on the loading server. It is
completely dedicated to the load.

Any advice would be greatly appreciated.

Thanks,

Ben

Re: Load experimentation

From

"Kevin Grittner"

Date:

07 December 2009, 14:33:35

Ben Brehmer <benbrehmer@gmail.com> wrote:

> -7.5 GB memory
> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
>    each)
> -64-bit platform

What OS?

> (PostgreSQL 8.1.3)

Why use such an antiquated, buggy version?  Newer versions are
faster.

-Kevin

Re: Load experimentation

From

Scott Mead

Date:

07 December 2009, 14:39:39

On Mon, Dec 7, 2009 at 1:12 PM, Ben Brehmer <benbrehmer@gmail.com> wrote:

Hello All,

I'm in the process of loading a massive amount of data (500 GB). After some initial timings, I'm looking at 260 hours to load the entire 500GB. 10 days seems like an awfully long time so I'm searching for ways to speed this up. The load is happening in the Amazon cloud (EC2), on a m1.large instance:
-7.5 GB memory
-4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
-64-bit platform

So far I have modified my postgresql.conf file (PostgreSQL 8.1.3). The modifications I have made are as follows:

Can you go with PG 8.4? That's a start :-)

shared_buffers = 786432
work_mem = 10240
maintenance_work_mem = 6291456
max_fsm_pages = 3000000
wal_buffers = 2048
checkpoint_segments = 200
checkpoint_timeout = 300
checkpoint_warning = 30
autovacuum = off

I'd set fsync=off for the load, I'd also make sure that you're using the COPY command (on the server side) to do the load.

Re: Load experimentation

From

Thom Brown

Date:

07 December 2009, 14:40:34

2009/12/7 Kevin Grittner <Kevin.Grittner@wicourts.gov>

Ben Brehmer <benbrehmer@gmail.com> wrote:

> -7.5 GB memory
> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
> each)
> -64-bit platform

What OS?

> (PostgreSQL 8.1.3)

Why use such an antiquated, buggy version? Newer versions are
faster.

-Kevin

I'd agree with trying to use the latest version you can.

How are you loading this data? I'd make sure you haven't got any indices, primary keys, triggers or constraints on your tables before you begin the initial load, just add them after. Also use either the COPY command for loading, or prepared transactions. Individual insert commands will just take way too long.

Regards

Thom

Re: Load experimentation

From

Ben Brehmer

Date:

07 December 2009, 14:45:40

Kevin,

This is running on on x86_64-unknown-linux-gnu, compiled by GCC gcc
(GCC) 4.1.2 20080704 (Red Hat 4.1.2-44)

Ben

On 07/12/2009 10:33 AM, Kevin Grittner wrote:
> Ben Brehmer<benbrehmer@gmail.com>  wrote:
>
>
>> -7.5 GB memory
>> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
>>     each)
>> -64-bit platform
>>
>
> What OS?
>
>
>> (PostgreSQL 8.1.3)
>>
>
> Why use such an antiquated, buggy version?  Newer versions are
> faster.
>
> -Kevin
>
>

Re: Load experimentation

From

Craig James

Date:

07 December 2009, 14:50:35

Ben Brehmer wrote:
> Hello All,
>
> I'm in the process of loading a massive amount of data (500 GB). After
> some initial timings, I'm looking at 260 hours to load the entire 500GB.

You don't say how you are loading the data, so there's not much to go on.  But generally, there are two primary ways to
speedthings up: 

1. Group MANY inserts into a single transaction.  If you're doing a row-at-a-time, it will be very slow.  The "sweet
spot"seems to be somewhere between 100 and 1000 inserts in a single transaction.  Below 100, you're still slowing
thingsdown, above 1000, it probably won't make much difference. 

2. Use the COPY command.  This requires you to format your data into the form that COPY uses.  But it's VERY fast.

Craig

> 10 days seems like an awfully long time so I'm searching for ways to
> speed this up. The load is happening in the Amazon cloud (EC2), on a
> m1.large instance:
> -7.5 GB memory
> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
> -64-bit platform
>
>
> So far I have modified my postgresql.conf  file (PostgreSQL 8.1.3). The
> modifications I have made are as follows:
>
> shared_buffers = 786432
> work_mem = 10240
> maintenance_work_mem = 6291456
> max_fsm_pages = 3000000
> wal_buffers = 2048
> checkpoint_segments = 200
> checkpoint_timeout = 300
> checkpoint_warning = 30
> autovacuum = off
>
>
> There are a variety of instance types available in the Amazon cloud
> (http://aws.amazon.com/ec2/instance-types/), including high memory and
> high CPU. High memory instance types come with 34GB or 68GB of memory.
> High CPU instance types have a lot less memory (7GB max) but up to 8
> virtual cores. I am more than willing to change to any of the other
> instance types.
>
> Also, there is nothing else happening on the loading server. It is
> completely dedicated to the load.
>
> Any advice would be greatly appreciated.
>
> Thanks,
>
> Ben
>

Re: Load experimentation

From

Ben Brehmer

Date:

07 December 2009, 15:33:25

Thanks for the quick responses. I will respond to all questions in one email:

By "Loading data" I am implying: "psql -U postgres -d somedatabase -f sql_file.sql". The sql_file.sql contains table creates and insert statements. There are no indexes present nor created during the load.

OS: x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44)

PostgreSQL: I will try upgrading to latest version.

COPY command: Unfortunately I'm stuck with INSERTS due to the nature this data was generated (Hadoop/MapReduce).

Transactions: Have started a second load process with chunks of 1000 inserts wrapped in a transaction. Its dropped the load time for 1000 inserts from 1 Hour to 7 minutes :)

Disk Setup: Using a single disk Amazon image for the destination (database). Source is coming from an EBS volume. I didn't think there were any disk options in Amazon?

Thanks!

Ben

On 07/12/2009 10:39 AM, Thom Brown wrote:

2009/12/7 Kevin Grittner <Kevin.Grittner@wicourts.gov>
Ben Brehmer <benbrehmer@gmail.com> wrote:

> -7.5 GB memory
> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
> each)
> -64-bit platform

What OS?

> (PostgreSQL 8.1.3)

Why use such an antiquated, buggy version? Newer versions are
faster.

-Kevin

I'd agree with trying to use the latest version you can.

How are you loading this data? I'd make sure you haven't got any indices, primary keys, triggers or constraints on your tables before you begin the initial load, just add them after. Also use either the COPY command for loading, or prepared transactions. Individual insert commands will just take way too long.

Regards

Thom

Re: Load experimentation

From

Craig James

Date:

07 December 2009, 15:37:59

Ben Brehmer wrote:
> Thanks for the quick responses. I will respond to all questions in one
> email:
>
> By "Loading data" I am implying: "psql -U postgres -d somedatabase -f
> sql_file.sql".  The sql_file.sql contains table creates and insert
> statements. There are no indexes present nor created during the load.

Although transactions of over 1000 INSERT statements don't speed things up much, they don't hurt either, especially on
anew system that nobody is using yet.  Since you're loading from big SQL files using psql, just put a "begin;" at the
topof the file and a "commit;" at the bottom.  Unlike Oracle, Postgres even allows CREATE and such to be done inside a
transaction.

And BTW, don't forget to ANALYZE when you're all done.

Craig

>
> OS: x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704
> (Red Hat 4.1.2-44)
>
> PostgreSQL: I will try upgrading to latest version.
>
> COPY command: Unfortunately I'm stuck with INSERTS due to the nature
> this data was generated (Hadoop/MapReduce).
>
> Transactions: Have started a second load process with chunks of 1000
> inserts wrapped in a transaction. Its dropped the load time for 1000
> inserts from 1 Hour to 7 minutes :)
>
> Disk Setup: Using a single disk Amazon image for the destination
> (database). Source is coming from an EBS volume. I didn't think there
> were any disk options in Amazon?
>
>
> Thanks!
>
> Ben
>
>
>
>
>
> On 07/12/2009 10:39 AM, Thom Brown wrote:
>> 2009/12/7 Kevin Grittner <Kevin.Grittner@wicourts.gov
>> <mailto:Kevin.Grittner@wicourts.gov>>
>>
>>     Ben Brehmer <benbrehmer@gmail.com <mailto:benbrehmer@gmail.com>>
>>     wrote:
>>
>>     > -7.5 GB memory
>>     > -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
>>     >    each)
>>     > -64-bit platform
>>
>>     What OS?
>>
>>     > (PostgreSQL 8.1.3)
>>
>>     Why use such an antiquated, buggy version?  Newer versions are
>>     faster.
>>
>>     -Kevin
>>
>>
>>
>> I'd agree with trying to use the latest version you can.
>>
>> How are you loading this data?  I'd make sure you haven't got any
>> indices, primary keys, triggers or constraints on your tables before
>> you begin the initial load, just add them after.  Also use either the
>> COPY command for loading, or prepared transactions.  Individual insert
>> commands will just take way too long.
>>
>> Regards
>>
>> Thom

Re: Load experimentation

From

Alan Hodgson

Date:

07 December 2009, 15:48:50

On Monday 07 December 2009, Ben Brehmer <benbrehmer@gmail.com> wrote:
> Disk Setup: Using a single disk Amazon image for the destination
> (database). Source is coming from an EBS volume. I didn't think there
> were any disk options in Amazon?

I don't think any Amazon cloud service is particularly well suited to a
database. Combined virtualized hosts with terrible I/O, and it's actually
hard to envision a worse place to run a database.

--
"No animals were harmed in the recording of this episode. We tried but that
damn monkey was just too fast."

Re: Load experimentation

From

Greg Smith

Date:

07 December 2009, 16:59:51

Ben Brehmer wrote:

By "Loading data" I am implying: "psql -U postgres -d somedatabase -f sql_file.sql". The sql_file.sql contains table creates and insert statements. There are no indexes present nor created during the load.
COPY command: Unfortunately I'm stuck with INSERTS due to the nature this data was generated (Hadoop/MapReduce).

Your basic options here are to batch the INSERTs into bigger chunks, and/or to split your data file up so that it can be loaded by more than one process at a time. There's some comments and links to more guidance here at http://wiki.postgresql.org/wiki/Bulk_Loading_and_Restores

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: Load experimentation

From

Ben Brehmer

Date:

08 December 2009, 03:22:36

Thanks for all the responses. I have one more thought;

Since my input data is split into about 200 files (3GB each), I could potentially spawn one load command for each file. What would be the maximum number of input connections Postgres can handle without bogging down? When I say 'input connection' I mean "psql -U postgres -d dbname -f one_of_many_sql_files".

Thanks,
Ben

On 07/12/2009 12:59 PM, Greg Smith wrote:

Ben Brehmer wrote:
By "Loading data" I am implying: "psql -U postgres -d somedatabase -f sql_file.sql". The sql_file.sql contains table creates and insert statements. There are no indexes present nor created during the load.
COPY command: Unfortunately I'm stuck with INSERTS due to the nature this data was generated (Hadoop/MapReduce).
Your basic options here are to batch the INSERTs into bigger chunks, and/or to split your data file up so that it can be loaded by more than one process at a time. There's some comments and links to more guidance here at http://wiki.postgresql.org/wiki/Bulk_Loading_and_Restores
-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com 

Re: Load experimentation

From

Greg Smith

Date:

08 December 2009, 03:35:27

Ben Brehmer wrote:

Since my input data is split into about 200 files (3GB each), I could potentially spawn one load command for each file. What would be the maximum number of input connections Postgres can handle without bogging down?

You can expect to easily get one loader process per real CPU going. Beyond that, it depends on how CPU intensive they all are and what the resulting I/O rate out of the combination is. You're probably going to run out of CPU on a loading job long before you hit any of the other limits in this area, and potentially you could run out of disk throughput on a cloud system before that. PostgreSQL isn't going to bog down on a connection basis until you've reached several hundred of them, your loader will be lucky to hit 10 active processes before it grinds to a halt on some physical resources unrelated to general database scaling.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: Load experimentation

From

Scott Marlowe

Date:

08 December 2009, 03:59:00

On Tue, Dec 8, 2009 at 12:22 AM, Ben Brehmer <benbrehmer@gmail.com> wrote:
> Thanks for all the responses. I have one more thought;
>
> Since my input data is split into about 200 files (3GB each), I could
> potentially spawn one load command for each file. What would be the maximum
> number of input connections Postgres can handle without bogging down? When I
> say 'input connection' I mean "psql -U postgres -d dbname -f
> one_of_many_sql_files".

This is VERY dependent on your IO capacity and number of cores.  My
experience is that unless you're running on a decent number of disks,
you'll run out of IO capacity first in most machines.  n pairs of
mirrors in a RAID-10 can handle x input threads where x has some near
linear relation to n.  Have 100 disks in a RAID-10 array?  You can
surely handle dozens of load threads with no IO wait.  Have 4 disks in
a RAID-10?  Maybe two to four load threads will max you out.  Once
you're IO bound, adding more threads and more CPUs won't help, it'll
hurt.  The only way to really know is to benchmark it, but i'd guess
that about half as many import threads as mirror pairs in a RAID-10
(or just drives if you're using RAID-0) would be a good place to start
and work from there.

Re: Load experimentation

From

Scott Marlowe

Date:

08 December 2009, 03:59:54

On Tue, Dec 8, 2009 at 12:58 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Tue, Dec 8, 2009 at 12:22 AM, Ben Brehmer <benbrehmer@gmail.com> wrote:
>> Thanks for all the responses. I have one more thought;
>>
>> Since my input data is split into about 200 files (3GB each), I could
>> potentially spawn one load command for each file. What would be the maximum
>> number of input connections Postgres can handle without bogging down? When I
>> say 'input connection' I mean "psql -U postgres -d dbname -f
>> one_of_many_sql_files".
>
> This is VERY dependent on your IO capacity and number of cores.  My
> experience is that unless you're running on a decent number of disks,
> you'll run out of IO capacity first in most machines.  n pairs of
> mirrors in a RAID-10 can handle x input threads where x has some near
> linear relation to n.  Have 100 disks in a RAID-10 array?  You can
> surely handle dozens of load threads with no IO wait.  Have 4 disks in
> a RAID-10?  Maybe two to four load threads will max you out.  Once
> you're IO bound, adding more threads and more CPUs won't help, it'll
> hurt.  The only way to really know is to benchmark it, but i'd guess
> that about half as many import threads as mirror pairs in a RAID-10
> (or just drives if you're using RAID-0) would be a good place to start
> and work from there.

Note that if you start running out of CPU horsepower first the
degradation will be less harsh as you go past the knee in the
performance curve.    IO has a sharper knee than CPU.

Re: Load experimentation

From

Dimitri Fontaine

Date:

08 December 2009, 05:10:34

Hi,

Ben Brehmer <benbrehmer@gmail.com> writes:
> By "Loading data" I am implying: "psql -U postgres -d somedatabase -f sql_file.sql".  The sql_file.sql contains table
createsand insert statements. There are no 
> indexes present nor created during the load.
>
> OS: x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44)
>
> PostgreSQL: I will try upgrading to latest version.
>
> COPY command: Unfortunately I'm stuck with INSERTS due to the nature
> this data was generated (Hadoop/MapReduce).

What I think you could do is the followings:

 - switch to using 8.4
 - load your files in a *local* database
 - pg_dump -Fc
 - now pg_restore -j X on the amazon setup

That way you will be using COPY rather than INSERTs and parallel loading
built-in pg_restore (and optimisations of when to add the indexes and
constraints). The X is to choose depending on the IO power and the
number of CPU...

Regards,
--
dim

Re: Load experimentation

From

Scott Marlowe

Date:

08 December 2009, 05:28:38

On Tue, Dec 8, 2009 at 2:08 AM, Dimitri Fontaine <dfontaine@hi-media.com> wrote:
> Hi,
>
> Ben Brehmer <benbrehmer@gmail.com> writes:
>> By "Loading data" I am implying: "psql -U postgres -d somedatabase -f sql_file.sql".  The sql_file.sql contains
tablecreates and insert statements. There are no 
>> indexes present nor created during the load.
>>
>> OS: x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44)
>>
>> PostgreSQL: I will try upgrading to latest version.
>>
>> COPY command: Unfortunately I'm stuck with INSERTS due to the nature
>> this data was generated (Hadoop/MapReduce).
>
> What I think you could do is the followings:
>
>  - switch to using 8.4
>  - load your files in a *local* database
>  - pg_dump -Fc
>  - now pg_restore -j X on the amazon setup
>
> That way you will be using COPY rather than INSERTs and parallel loading
> built-in pg_restore (and optimisations of when to add the indexes and
> constraints). The X is to choose depending on the IO power and the
> number of CPU...

That's a lot of work to get to COPY.  It might be enough to drop all
FK relations and indexes on the destination db in the cloud, load the
data in a few (or one) transaction(s), then recreate indexes and FK
relationships.

Re: Load experimentation

From

Dimitri Fontaine

Date:

08 December 2009, 05:39:06

Scott Marlowe <scott.marlowe@gmail.com> writes:
> That's a lot of work to get to COPY.

Well, yes. I though about it this way only after having read that OP is
uneasy with producing another format from his source data, and
considering it's a one-shot operation.

Ah, tradeoffs, how to find the right one!
--
dim

Re: Load experimentation

From

Andy Colson

Date:

09 December 2009, 09:38:03

On 12/07/2009 12:12 PM, Ben Brehmer wrote:
> Hello All,
>
> I'm in the process of loading a massive amount of data (500 GB). After
> some initial timings, I'm looking at 260 hours to load the entire 500GB.
> 10 days seems like an awfully long time so I'm searching for ways to
> speed this up. The load is happening in the Amazon cloud (EC2), on a
> m1.large instance:
> -7.5 GB memory
> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
> -64-bit platform
>
>
> So far I have modified my postgresql.conf file (PostgreSQL 8.1.3). The
> modifications I have made are as follows:
>
> shared_buffers = 786432
> work_mem = 10240
> maintenance_work_mem = 6291456
> max_fsm_pages = 3000000
> wal_buffers = 2048
> checkpoint_segments = 200
> checkpoint_timeout = 300
> checkpoint_warning = 30
> autovacuum = off
>
>
> There are a variety of instance types available in the Amazon cloud
> (http://aws.amazon.com/ec2/instance-types/), including high memory and
> high CPU. High memory instance types come with 34GB or 68GB of memory.
> High CPU instance types have a lot less memory (7GB max) but up to 8
> virtual cores. I am more than willing to change to any of the other
> instance types.
>
> Also, there is nothing else happening on the loading server. It is
> completely dedicated to the load.
>
> Any advice would be greatly appreciated.
>
> Thanks,
>
> Ben
>

I'm kind of curious, how goes the load?  Is it done yet?  Still looking at days'n'days to finish?

I was thinking... If the .sql files are really nicely formatted, it would not be too hard to whip up a perl script to
runas a filter to change the statements into copy's. 

Each file would have to only fill one table, and only contain inserts, and all the insert statements would have to set
thesame fields.  (And I'm sure there could be other problems). 

Also, just for the load, did you disable fsync?

-Andy

Re: Load experimentation

From

Ben Brehmer

Date:

10 December 2009, 16:24:56

Hi Andy,

Load is chugging along. We've optimized our postgres conf as much as
possible but are seeing the inevitable I/O bottleneck. I had the same
thought as you (converting inserts into copy's) a while back but
unfortunately each file has many inserts into many different tables.
Potentially I could rip through this with a little MapReduce job on
50-100 nodes, which is still something I might do.

One thought we are playing with was taking advantage of 4 x 414GB EBS
devices in a RAID0 configuration. This would spread disk writes across 4
block devices.

Right now I'm wrapping about 1500 inserts in a transaction block. Since
its an I/O bottlenecks, COPY statements might not give me much advantage.

Its definitely a work in progress :)

Ben


On 09/12/2009 5:31 AM, Andy Colson wrote:
> On 12/07/2009 12:12 PM, Ben Brehmer wrote:
>> Hello All,
>>
>> I'm in the process of loading a massive amount of data (500 GB). After
>> some initial timings, I'm looking at 260 hours to load the entire 500GB.
>> 10 days seems like an awfully long time so I'm searching for ways to
>> speed this up. The load is happening in the Amazon cloud (EC2), on a
>> m1.large instance:
>> -7.5 GB memory
>> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
>> -64-bit platform
>>
>>
>> So far I have modified my postgresql.conf file (PostgreSQL 8.1.3). The
>> modifications I have made are as follows:
>>
>> shared_buffers = 786432
>> work_mem = 10240
>> maintenance_work_mem = 6291456
>> max_fsm_pages = 3000000
>> wal_buffers = 2048
>> checkpoint_segments = 200
>> checkpoint_timeout = 300
>> checkpoint_warning = 30
>> autovacuum = off
>>
>>
>> There are a variety of instance types available in the Amazon cloud
>> (http://aws.amazon.com/ec2/instance-types/), including high memory and
>> high CPU. High memory instance types come with 34GB or 68GB of memory.
>> High CPU instance types have a lot less memory (7GB max) but up to 8
>> virtual cores. I am more than willing to change to any of the other
>> instance types.
>>
>> Also, there is nothing else happening on the loading server. It is
>> completely dedicated to the load.
>>
>> Any advice would be greatly appreciated.
>>
>> Thanks,
>>
>> Ben
>>
>
> I'm kind of curious, how goes the load?  Is it done yet?  Still
> looking at days'n'days to finish?
>
> I was thinking... If the .sql files are really nicely formatted, it
> would not be too hard to whip up a perl script to run as a filter to
> change the statements into copy's.
>
> Each file would have to only fill one table, and only contain inserts,
> and all the insert statements would have to set the same fields.  (And
> I'm sure there could be other problems).
>
> Also, just for the load, did you disable fsync?
>
> -Andy
>

Re: Load experimentation

From

Scott Carey

Date:

10 December 2009, 19:30:56

On 12/7/09 11:12 AM, "Ben Brehmer" <benbrehmer@gmail.com> wrote:

> Thanks for the quick responses. I will respond to all questions in one email:
>
> COPY command: Unfortunately I'm stuck with INSERTS due to the nature this data
> was generated (Hadoop/MapReduce).

If you have control over the MapReduce output, you can have that output
result files in a format that COPY likes.

If you don't have any control over that its more complicated.  I use a final
pass Hadoop Map only job to go over the output and insert into postgres
directly from the job, using the :

INSERT INTO <table> VALUES (val1, val2, ... ), (val1, val2, ...), ...
Insert style from Java with about 80 rows per insert statement and a single
transaction for about a thousand of these.  This was faster than batch
inserts .

>
> On 07/12/2009 10:39 AM, Thom Brown wrote:
>>
>> 2009/12/7 Kevin Grittner <Kevin.Grittner@wicourts.gov>
>>
>>>
>>> Ben Brehmer <benbrehmer@gmail.com> wrote:
>>>
>>>> -7.5 GB memory
>>>> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
>>>>    each)
>>>> -64-bit platform
>>>
>>>
>>> What OS?
>>>
>>>> (PostgreSQL 8.1.3)
>>>
>>> Why use such an antiquated, buggy version?  Newer versions are
>>> faster.
>>>
>>> -Kevin
>>>
>>
>>
>>
>>
>>
>>
>> I'd agree with trying to use the latest version you can.
>>
>>
>>
>>
>> How are you loading this data?  I'd make sure you haven't got any indices,
>> primary keys, triggers or constraints on your tables before you begin the
>> initial load, just add them after.  Also use either the COPY command for
>> loading, or prepared transactions.  Individual insert commands will just take
>> way too long.
>>
>>
>>
>>
>> Regards
>>
>>
>>
>>
>> Thom
>

Re: Load experimentation

From

Scott Carey

Date:

10 December 2009, 22:38:12



On 12/10/09 3:29 PM, "Scott Carey" <scott@richrelevance.com> wrote:

> On 12/7/09 11:12 AM, "Ben Brehmer" <benbrehmer@gmail.com> wrote:
>
>> Thanks for the quick responses. I will respond to all questions in one email:
>>
>> COPY command: Unfortunately I'm stuck with INSERTS due to the nature this
>> data
>> was generated (Hadoop/MapReduce).
>
> If you have control over the MapReduce output, you can have that output
> result files in a format that COPY likes.
>
> If you don't have any control over that its more complicated.  I use a final
> pass Hadoop Map only job to go over the output and insert into postgres
> directly from the job, using the :
>
> INSERT INTO <table> VALUES (val1, val2, ... ), (val1, val2, ...), ...
> Insert style from Java with about 80 rows per insert statement and a single
> transaction for about a thousand of these.  This was faster than batch
> inserts .
>

I should mention that the above is a bit off.  There is an important caveat
that each of these individual tasks might run twice in Hadoop (only one will
finish -- speculative execution and retry on error).  To deal with this you
can run each job inside a single transaction so that a failure will
rollback, and likely want to turn off speculative execution.

Another option is to run only one map job, with no reduce for this sort of
work in order to ensure duplicate data is not inserted.  We are inserting
into a temp table named uniquely per chunk first (sometimes in parallel).
Then while holding a posstgres advisory lock we do a SELECT * FROM <temp>
INTO <destination> type operation, which is fast.

>
>>
>> On 07/12/2009 10:39 AM, Thom Brown wrote:
>>>
>>> 2009/12/7 Kevin Grittner <Kevin.Grittner@wicourts.gov>
>>>
>>>>
>>>> Ben Brehmer <benbrehmer@gmail.com> wrote:
>>>>
>>>>> -7.5 GB memory
>>>>> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
>>>>>    each)
>>>>> -64-bit platform
>>>>
>>>>
>>>> What OS?
>>>>
>>>>> (PostgreSQL 8.1.3)
>>>>
>>>> Why use such an antiquated, buggy version?  Newer versions are
>>>> faster.
>>>>
>>>> -Kevin
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> I'd agree with trying to use the latest version you can.
>>>
>>>
>>>
>>>
>>> How are you loading this data?  I'd make sure you haven't got any indices,
>>> primary keys, triggers or constraints on your tables before you begin the
>>> initial load, just add them after.  Also use either the COPY command for
>>> loading, or prepared transactions.  Individual insert commands will just
>>> take
>>> way too long.
>>>
>>>
>>>
>>>
>>> Regards
>>>
>>>
>>>
>>>
>>> Thom
>>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>