Thread: Getting even more insert performance (250m+rows/day)

Getting even more insert performance (250m+rows/day)

From
"Daniel J. Luke"
Date:
I have a system that currently inserts ~ 250 million rows per day (I
have about 10k more raw data than that, but I'm at the limit of my
ability to get useful insert performance out of postgres).

Things I've already done that have made a big difference:
- modified postgresql.conf shared_buffers value
- converted to COPY from individual insert statements
- changed BLCKSZ to 32768

I currently get ~35k/sec inserts on a table with one index (~70k/sec
inserts if I don't have any indexes).

The indexed field is basically a time_t (seconds since the epoch),
autovacuum is running (or postgres would stop choosing to use the
index). The other fields have relatively lower cardinality.

Each days worth of data gets inserted into its own table so that I
can expire the data without too much effort (since drop table is much
faster than running a delete and then vacuum).

I would really like to be able to have 1 (or 2) more indexes on the
table since it takes a while for a sequential scan of 250million rows
to complete, but CPU time goes way up.

In fact, it looks like I'm not currently IO bound, but CPU-bound. I
think some sort of lazy-index generation (especially if it could be
parallelized to use the other processors/cores that currently sit
mostly idle) would be a solution. Is anyone working on something like
this? Any other ideas? Where should I look if I want to start to
think about creating a new index that would work this way (or am I
just crazy)?

Thanks for any insight!

--
Daniel J. Luke
+========================================================+
| *---------------- dluke@geeklair.net ----------------* |
| *-------------- http://www.geeklair.net -------------* |
+========================================================+
|   Opinions expressed are mine and do not necessarily   |
|          reflect the opinions of my employer.          |
+========================================================+



Attachment

Re: Getting even more insert performance (250m+rows/day)

From
"Dave Dutcher"
Date:
If you can live with possible database corruption, you could try turning
Fsync off.  For example if you could just reinsert the data on the off
chance a hardware failure corrupts the database, you might get a decent
improvement.

Also have you tried creating the index after you have inserted all your
data?  (Or maybe copy already disables the indexes while inserting?)



> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
> Daniel J. Luke
> Sent: Wednesday, May 24, 2006 2:45 PM
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] Getting even more insert performance
> (250m+rows/day)
>
>
> I have a system that currently inserts ~ 250 million rows per day (I
> have about 10k more raw data than that, but I'm at the limit of my
> ability to get useful insert performance out of postgres).
>
> Things I've already done that have made a big difference:
> - modified postgresql.conf shared_buffers value
> - converted to COPY from individual insert statements
> - changed BLCKSZ to 32768
>
> I currently get ~35k/sec inserts on a table with one index (~70k/sec
> inserts if I don't have any indexes).
>
> The indexed field is basically a time_t (seconds since the epoch),
> autovacuum is running (or postgres would stop choosing to use the
> index). The other fields have relatively lower cardinality.
>
> Each days worth of data gets inserted into its own table so that I
> can expire the data without too much effort (since drop table
> is much
> faster than running a delete and then vacuum).
>
> I would really like to be able to have 1 (or 2) more indexes on the
> table since it takes a while for a sequential scan of
> 250million rows
> to complete, but CPU time goes way up.
>
> In fact, it looks like I'm not currently IO bound, but CPU-bound. I
> think some sort of lazy-index generation (especially if it could be
> parallelized to use the other processors/cores that currently sit
> mostly idle) would be a solution. Is anyone working on
> something like
> this? Any other ideas? Where should I look if I want to start to
> think about creating a new index that would work this way (or am I
> just crazy)?
>
> Thanks for any insight!
>
> --
> Daniel J. Luke
> +========================================================+
> | *---------------- dluke@geeklair.net ----------------* |
> | *-------------- http://www.geeklair.net -------------* |
> +========================================================+
> |   Opinions expressed are mine and do not necessarily   |
> |          reflect the opinions of my employer.          |
> +========================================================+
>
>
>


Re: Getting even more insert performance (250m+rows/day)

From
"Steinar H. Gunderson"
Date:
On Wed, May 24, 2006 at 03:45:17PM -0400, Daniel J. Luke wrote:
> Things I've already done that have made a big difference:
> - modified postgresql.conf shared_buffers value
> - converted to COPY from individual insert statements
> - changed BLCKSZ to 32768

Have you tried fiddling with the checkpointing settings? Check your logs --
if you get a warning about checkpoints being too close together, that should
give you quite some boost.

Apart from that, you should have quite a bit to go on -- somebody on this
list reported 2 billion rows/day earlier, but it might have been on beefier
hardware, of course. :-)

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Getting even more insert performance (250m+rows/day)

From
"Daniel J. Luke"
Date:
On May 24, 2006, at 4:02 PM, Dave Dutcher wrote:
> If you can live with possible database corruption, you could try
> turning
> Fsync off.  For example if you could just reinsert the data on the off
> chance a hardware failure corrupts the database, you might get a
> decent
> improvement.

I tried, but I didn't see much of an improvement (and it's not really
acceptable for this application).

> Also have you tried creating the index after you have inserted all
> your
> data?  (Or maybe copy already disables the indexes while inserting?)

The data gets inserted in batches every 5 minutes and I potentially
have people querying it constantly, so I can't remove and re-create
the index.

--
Daniel J. Luke
+========================================================+
| *---------------- dluke@geeklair.net ----------------* |
| *-------------- http://www.geeklair.net -------------* |
+========================================================+
|   Opinions expressed are mine and do not necessarily   |
|          reflect the opinions of my employer.          |
+========================================================+



Attachment

Re: Getting even more insert performance (250m+rows/day)

From
"Daniel J. Luke"
Date:
On May 24, 2006, at 4:03 PM, Steinar H. Gunderson wrote:
> Have you tried fiddling with the checkpointing settings? Check your
> logs --
> if you get a warning about checkpoints being too close together,
> that should
> give you quite some boost.

no warnings in the log (I did change the checkpoint settings when I
set up the database, but didn't notice an appreciable difference in
insert performance).

> Apart from that, you should have quite a bit to go on -- somebody
> on this
> list reported 2 billion rows/day earlier, but it might have been on
> beefier
> hardware, of course. :-)

Probably :) I'll keep searching the list archives and see if I find
anything else (I did some searching and didn't find anything that I
hadn't already tried).

Thanks!

--
Daniel J. Luke
+========================================================+
| *---------------- dluke@geeklair.net ----------------* |
| *-------------- http://www.geeklair.net -------------* |
+========================================================+
|   Opinions expressed are mine and do not necessarily   |
|          reflect the opinions of my employer.          |
+========================================================+



Attachment

Re: Getting even more insert performance (250m+rows/day)

From
"Larry Rosenman"
Date:
Daniel J. Luke wrote:
> On May 24, 2006, at 4:02 PM, Dave Dutcher wrote:
>> If you can live with possible database corruption, you could try
>> turning Fsync off.  For example if you could just reinsert the data
>> on the off chance a hardware failure corrupts the database, you
>> might get a decent improvement.
>
> I tried, but I didn't see much of an improvement (and it's not really
> acceptable for this application).
>
>> Also have you tried creating the index after you have inserted all
>> your data?  (Or maybe copy already disables the indexes while
>> inserting?)
>
> The data gets inserted in batches every 5 minutes and I potentially
> have people querying it constantly, so I can't remove and re-create
> the index.
>
are the batches single insert's, or within a big transaction?

I.E., does the inserts look like:
INSERT
INSERT
INSERT

or

BEGIN
INSERT
INSERT
INSERT
COMMIT

If the former, the latter is a big win.

Also, what release(s) are you running?

LER

--
Larry Rosenman
Database Support Engineer

PERVASIVE SOFTWARE. INC.
12365B RIATA TRACE PKWY
3015
AUSTIN TX  78727-6531

Tel: 512.231.6173
Fax: 512.231.6597
Email: Larry.Rosenman@pervasive.com
Web: www.pervasive.com

Re: Getting even more insert performance (250m+rows/day)

From
"Steinar H. Gunderson"
Date:
On Wed, May 24, 2006 at 04:09:54PM -0400, Daniel J. Luke wrote:
> no warnings in the log (I did change the checkpoint settings when I
> set up the database, but didn't notice an appreciable difference in
> insert performance).

How about wal_buffers? Upping it might not help all that much if only one
thread is writing, but you might give it a try...

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Getting even more insert performance (250m+rows/day)

From
Mark Lewis
Date:
> The data gets inserted in batches every 5 minutes and I potentially
> have people querying it constantly, so I can't remove and re-create
> the index.

How live does your data need to be?  One possibility would be to use a
separate table for each batch instead of a separate table per day,
create the indexes after the import and only after the indexes have been
created make the table available for user queries.

You'd be trading latency for throughput in that case.

Also, you mentioned that you're CPU-bound, but that you have multiple
CPU's.  In that case, performing N concurrent imports (where N is the
number of processor cores available) might be a win over a single-
threaded import.

-- Mark Lewis

Re: Getting even more insert performance (250m+rows/day)

From
"Daniel J. Luke"
Date:
On May 24, 2006, at 4:13 PM, Steinar H. Gunderson wrote:
> On Wed, May 24, 2006 at 04:09:54PM -0400, Daniel J. Luke wrote:
>> no warnings in the log (I did change the checkpoint settings when I
>> set up the database, but didn't notice an appreciable difference in
>> insert performance).
>
> How about wal_buffers? Upping it might not help all that much if
> only one
> thread is writing, but you might give it a try...

I tried, but I didn't notice a difference.

I should probably emphasize that I appear to be CPU bound (and I can
double my # of rows inserted per second by removing the index on the
table, or half it by adding another index).

I really should run gprof just to verify.

--
Daniel J. Luke
+========================================================+
| *---------------- dluke@geeklair.net ----------------* |
| *-------------- http://www.geeklair.net -------------* |
+========================================================+
|   Opinions expressed are mine and do not necessarily   |
|          reflect the opinions of my employer.          |
+========================================================+



Attachment

Re: Getting even more insert performance (250m+rows/day)

From
"Daniel J. Luke"
Date:
On May 24, 2006, at 4:12 PM, Larry Rosenman wrote:
> are the batches single insert's, or within a big transaction?
> If the former, the latter is a big win.

One big transaction every 5 minutes using 'COPY FROM' (instead of
inserts).

> Also, what release(s) are you running?

8.1.x (I think we're upgrading from 8.1.3 to 8.1.4 today).

--
Daniel J. Luke
+========================================================+
| *---------------- dluke@geeklair.net ----------------* |
| *-------------- http://www.geeklair.net -------------* |
+========================================================+
|   Opinions expressed are mine and do not necessarily   |
|          reflect the opinions of my employer.          |
+========================================================+



Attachment

Re: Getting even more insert performance (250m+rows/day)

From
"Larry Rosenman"
Date:
Daniel J. Luke wrote:
> On May 24, 2006, at 4:12 PM, Larry Rosenman wrote:
>> are the batches single insert's, or within a big transaction?
>> If the former, the latter is a big win.
>
> One big transaction every 5 minutes using 'COPY FROM' (instead of
> inserts).
>
>> Also, what release(s) are you running?
>
> 8.1.x (I think we're upgrading from 8.1.3 to 8.1.4 today).
>
Had to ask :)

Also, is pg_xlog on the same or different spindles from the rest of the
PG Data directory?

LER

--
Larry Rosenman
Database Support Engineer

PERVASIVE SOFTWARE. INC.
12365B RIATA TRACE PKWY
3015
AUSTIN TX  78727-6531

Tel: 512.231.6173
Fax: 512.231.6597
Email: Larry.Rosenman@pervasive.com
Web: www.pervasive.com

Re: Getting even more insert performance (250m+rows/day)

From
"Daniel J. Luke"
Date:
On May 24, 2006, at 4:24 PM, Larry Rosenman wrote:
> Also, is pg_xlog on the same or different spindles from the rest of
> the
> PG Data directory?

It's sitting on the same disk array (but I'm doing 1 transaction
every 5 minutes, and I'm not near the array's sustained write
capacity, so I don't think that's currently limiting performance).

--
Daniel J. Luke
+========================================================+
| *---------------- dluke@geeklair.net ----------------* |
| *-------------- http://www.geeklair.net -------------* |
+========================================================+
|   Opinions expressed are mine and do not necessarily   |
|          reflect the opinions of my employer.          |
+========================================================+



Attachment

Re: Getting even more insert performance (250m+rows/day)

From
"Jim C. Nasby"
Date:
On Wed, May 24, 2006 at 04:09:54PM -0400, Daniel J. Luke wrote:
> On May 24, 2006, at 4:03 PM, Steinar H. Gunderson wrote:
> >Have you tried fiddling with the checkpointing settings? Check your
> >logs --
> >if you get a warning about checkpoints being too close together,
> >that should
> >give you quite some boost.
>
> no warnings in the log (I did change the checkpoint settings when I
> set up the database, but didn't notice an appreciable difference in
> insert performance).

Keep in mind that the default warning time of 30 seconds is pretty
conservative; you'd want to bump that up to 300 seconds or so, probably.

As for the suggestion of multiple insert runs at a time, I suspect that
would just result in a lot of contention for some mutex/semaphore in the
index.

Your best bet really is to run gprof and post those results. It's also
possible that this is fixed be a recent patch to HEAD that reduces the
amount of traffic on the index metapage, something gprof would probably
confirm.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Getting even more insert performance (250m+rows/day)

From
"Ian Westmacott"
Date:
We were able to achieve 2B (small) rows per day sustained with
very little latency.  It is beefy hardware, but things that did
help include WAL on its own I/O channel, XFS, binary copy,
and tuning bgwriter and checkpoint settings for the application
and hardware.  Things that didn't help much were shared_buffers
and wal_buffers.  But our application is single-writer, and a
small number of readers.

Although there is tons of great advice in this and other forums,
I think you just have to do a lot of experimentation with careful
measurement to find what's right for your application/environment.
i.e., YMMV.

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Steinar H.
Gunderson
Sent: Wednesday, May 24, 2006 4:04 PM
To: Daniel J. Luke
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Getting even more insert performance
(250m+rows/day)


On Wed, May 24, 2006 at 03:45:17PM -0400, Daniel J. Luke wrote:
> Things I've already done that have made a big difference:
> - modified postgresql.conf shared_buffers value
> - converted to COPY from individual insert statements
> - changed BLCKSZ to 32768

Have you tried fiddling with the checkpointing settings? Check your logs --
if you get a warning about checkpoints being too close together, that should
give you quite some boost.

Apart from that, you should have quite a bit to go on -- somebody on this
list reported 2 billion rows/day earlier, but it might have been on beefier
hardware, of course. :-)

/* Steinar */
--
Homepage: http://www.sesse.net/

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend


Re: Getting even more insert performance (250m+rows/day)

From
Markus Schaber
Date:
Hi, Daniel,

Daniel J. Luke wrote:

> One big transaction every 5 minutes using 'COPY FROM' (instead of
> inserts).

Are you using "COPY table FROM '/path/to/file'", having the file sitting
on the server, or "COPY table FROM STDIN" or psql "/copy", having the
file sitting on the client?

From our tests, having the file on the server can speed up the things by
 factor 2 or 3 in some cases.

Also, using BINARY copy may give great benefits due to lower parsing
overhead.

As you say you're I/O bound, spreading tables, indices, wal and input
file to different spindles won't help you much.


HTH
Markus



--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

Re: Getting even more insert performance (250m+rows/day)

From
"Daniel J. Luke"
Date:
On May 29, 2006, at 7:11 AM, Markus Schaber wrote:
>> One big transaction every 5 minutes using 'COPY FROM' (instead of
>> inserts).
>
> Are you using "COPY table FROM '/path/to/file'", having the file
> sitting
> on the server, or "COPY table FROM STDIN" or psql "/copy", having the
> file sitting on the client?

COPY table FROM STDIN using psql on the server

I should have gprof numbers on a similarly set up test machine soon ...
--
Daniel J. Luke
+========================================================+
| *---------------- dluke@geeklair.net ----------------* |
| *-------------- http://www.geeklair.net -------------* |
+========================================================+
|   Opinions expressed are mine and do not necessarily   |
|          reflect the opinions of my employer.          |
+========================================================+



Attachment

Re: Getting even more insert performance (250m+rows/day)

From
"Daniel J. Luke"
Date:
On May 30, 2006, at 3:59 PM, Daniel J. Luke wrote:
> I should have gprof numbers on a similarly set up test machine
> soon ...

gprof output is available at http://geeklair.net/~dluke/
postgres_profiles/

(generated from CVS HEAD as of today).

Any ideas are welcome.

Thanks!
--
Daniel J. Luke
+========================================================+
| *---------------- dluke@geeklair.net ----------------* |
| *-------------- http://www.geeklair.net -------------* |
+========================================================+
|   Opinions expressed are mine and do not necessarily   |
|          reflect the opinions of my employer.          |
+========================================================+



Attachment