Thread: Migration study, step 1: bulk write performance optimization

Migration study, step 1: bulk write performance optimization

From

"Mikael Carneholm"

Date:

20 March 2006, 10:58:47

Ok, here's the deal:

I am responisble for an exciting project of evaluating migration of a medium/large application for a well-known swedish
car&truckmanufacturer from a proprietary DB to Postgres. The size of the database is currently about 50Gb, annual
growthdepending on sales, but probably in the 30-50Gb range. 

Migrating the schema was easily done, mostly involving a search/replace of some vendor specific datatypes. The next
stepis to migrate the data itself, and for this we have written a Java app relying on JDBC metadata to map the tables
inthe source schema to the target schema. The goal right now is to find the set of parameters that gives as short bulk
inserttime as possible, minimizing downtime while the data itself is migrated. 

The machine used for the study is a Dell PE2850, 6GB memory, 1xXEON 3.0GHz/2MB cache, internal SCSI 0+1 raid (currently
4x36GB10000rpm striped+mirrored, two more 146GB 15000rpm disks will arrive later). Not sure about the brand/model of
theraid controller, so I'll leave that for now. File system is ext3(I know, maybe not the optimal choice but this is
howit was when I got it) with a 8k block size. The OS currently installed is CentOS4. 

Until the new disks arrive, both the OS itself, pg_xlog and the data reside on the same disks. When they arrive, I will
probablymove the data to the new disks (need two more to get raid 0+1, though) and leave the OS + pg_xlog on the
10000rpmdisks. Mounting the 15000rpm data disks with the noatime option (this is safe, right?) and using a 16kb block
size(for read performance) will probably be considered as well. 

NOTE: this machine/configuration is NOT what we will be using in production if the study turns out OK, it's just
supposedto work as a development machine in the first phase whose purpose more or less is to get familiar with
configuratingPostgres and see if we can get the application up and running (we will probably use a 64bit platform and
eithera FC SAN or internal raid with a battery backed cache for production use, if all goes well). 

The first thing I did when I got the machine was to do a raw dd write test:

# time bash -c "(dd if=/dev/zero of=/opt/bigfile count=1310720 bs=8k && sync)"
1310720+0 records in
1310720+0 records out

real    2m21.438s
user    0m0.998s
sys     0m51.347s

(10*1024)Mb/~141s => ~75.5Mb/s

As a simple benchmark, I created a simple table without PK/indexes with 1k wide rows:

create table iotest.one_kb_rows
(
        the_col char(1024) not null
);

To fill the table, I use this simple function:

create or replace function iotest.writestress(megs integer) returns void as $$
declare
    char_str char(1024) := repeat('x', 1024);
begin
     for i in 1..megs loop
          for j in 1..1024 loop
              insert into one_kb_rows(the_col) values (char_str);
          end loop;
     end loop;
end;
$$
language plpgsql;

Then, I tested how long it takes to write 10Gb of data to this table:

iotest=> \timing
Timing is on.

iotest=> select writestress((10*1024));
 writestress
-------------

(1 row)

Time: 379971.252 ms

This gives that 10Gb takes ~380s => ~27Mb/s (with fsync=off), compared to the raw dd result (~75.5Mb/s).

I assume this difference is due to:
- simultaneous WAL write activity (assumed: for each byte written to the table, at least one byte is also written to
WAL,in effect: 10Gb data inserted in the table equals 20Gb written to disk) 
- lousy test method (it is done using a function => the transaction size is 10Gb, and 10Gb will *not* fit in
wal_buffers:) ) 
- poor config
- something else?

I have tried to read up as much as possible on Postgres configuration (disk layout, buffer management, WAL sync
methods,etc) and found this post regarding bgwriter tweaking:
http://archives.postgresql.org/pgsql-performance/2006-03/msg00218.php- which explains the bgwriter config below. 

All params in postgresql.conf that are not commented out:
---------------------------------------------------------
max_connections = 100
superuser_reserved_connections = 2
shared_buffers = 16000
bgwriter_lru_percent = 20
bgwriter_lru_maxpages = 160
bgwriter_all_percent = 10
bgwriter_all_maxpages = 320
fsync = off
wal_sync_method = open_sync
wal_buffers = 128
checkpoint_segments = 3
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_rotation_age = 1440
log_line_prefix = '%m: (%u@%d) '
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'

fsync can safely be kept off during data migration as we are able to restart the procedure without losing data if
somethinggoes wrong. Increasing chekpoint_segments to 8/16/32 only increased the insert time, so I kept it at the
default.I will increase shared_buffers and effective_cache_size as soon as it's time to tweak read performance, but for
nowI'm just focusing on write performance. 


Postgres version used:

iotest=> select version();
                                              version
---------------------------------------------------------------------------------------------------
 PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)
(1 row)


I want to make sure I have made correct assumptions before I carry on, so comments are welcome.

- Mikael

Re: Migration study, step 1: bulk write performance optimization

From

"Craig A. James"

Date:

20 March 2006, 11:16:22

Mikael Carneholm wrote:

> I am responisble for an exciting project of evaluating migration of a
> medium/large application for a well-known swedish car&truck manufacturer
> ... The goal right now is to find the set of parameters that gives as
> short bulk insert time as possible, minimizing downtime while the data
> itself is migrated.

If you haven't explored the COPY command yet, check it out.  It is stunningly fast compared to normal INSERT commands.

   http://www.postgresql.org/docs/8.1/static/sql-copy.html

pg_dump and pg_restore make use of the COPY command.  Since you're coming from a different vendor, you'd have to dump
thedata into a COPY-compatible set of files yourself.  But it will be worth the effort. 

Craig

Re: Migration study, step 1: bulk write performance

From

Csaba Nagy

Date:

20 March 2006, 11:19:12

Mikael,

I've just recently passed such an experience, i.e. migrating from
another vendor to postgres of a DB about the same size category you
have.

I think you got it right with the fsync turned off during migration
(just don't forget to turn it back after finishing ;-), and using tables
without indexes/foreign keys. In our case recreating all the
indexes/foreign keys/other constraints took actually longer than the raw
data transfer itself... but it's possible that the process was not tuned
100%, we are still learning how to tune postgres...

What I can add from our experience: ext3 turned out lousy for our
application, and converting to XFS made a quite big improvement for our
DB load. I don't have hard figures, but I think it was some 30%
improvement in overall speed, and it had a huge improvement for heavy
load times... what I mean is that with ext3 we had multiple parallel big
tasks executing in more time than if we would have executed them
sequentially, and with XFS that was gone, load scales linearly. In any
case you should test the performance of your application on different FS
and different settings, as this could make a huge difference.

And another thing, we're still fighting with performance problems due to
the fact that our application was designed to perform well with the
other DB product... I think you'll have more work to do in this regard
than just some search/replace ;-)

Cheers,
Csaba.

Re: Migration study, step 1: bulk write performance optimization

From

PFC

Date:

20 March 2006, 12:34:40

> using a 16kb block  size (for read performance) will probably be
> considered as well.

    Hm, this means that when postgres wants to write just one 8k page, the OS
will have to read 16k, replace half of it with the new block, and write
16k again... I guess it should be better to stick with the usual block
size. Also, it will have to read 16k every time it rally wants to read one
page... which happens quite often except for seq scan.

> NOTE: this machine/configuration is NOT what we will be using in
> production if the study turns out OK, it's just supposed to work as a
> development machine in the first phase whose purpose more or less is to
> get familiar with configurating Postgres and see if we can get the
> application up and running (we will probably use a 64bit platform and

    Opteron xDDD

    Use XFS or Reiser... ext3 isn't well suited for this. use noatime AND
nodiratime.

    It's safe to turn off fsync while importing your data.
    For optimum speed, put the WAL on another physical disk.

    Look in the docs which of maintenance_work_mem, or work_mem or sort_mem
is used for index creation, and set it to a very large value, to speed up
that index creation. Create your indexes with fsync=off also.

Re: Migration study, step 1: bulk write performance optimization

From

Dave Cramer

Date:

20 March 2006, 13:44:38

Others are reporting better performance on 8.1.x with very large
shared buffers. You may want to try tweaking that possibly as high as
20% of available memory

Dave
On 20-Mar-06, at 9:59 AM, Mikael Carneholm wrote:

> Ok, here's the deal:
>
> I am responisble for an exciting project of evaluating migration of
> a medium/large application for a well-known swedish car&truck
> manufacturer from a proprietary DB to Postgres. The size of the
> database is currently about 50Gb, annual growth depending on sales,
> but probably in the 30-50Gb range.
>
> Migrating the schema was easily done, mostly involving a search/
> replace of some vendor specific datatypes. The next step is to
> migrate the data itself, and for this we have written a Java app
> relying on JDBC metadata to map the tables in the source schema to
> the target schema. The goal right now is to find the set of
> parameters that gives as short bulk insert time as possible,
> minimizing downtime while the data itself is migrated.
>
> The machine used for the study is a Dell PE2850, 6GB memory, 1xXEON
> 3.0GHz/2MB cache, internal SCSI 0+1 raid (currently 4x36GB 10000rpm
> striped+mirrored, two more 146GB 15000rpm disks will arrive later).
> Not sure about the brand/model of the raid controller, so I'll
> leave that for now. File system is ext3(I know, maybe not the
> optimal choice but this is how it was when I got it) with a 8k
> block size. The OS currently installed is CentOS4.
>
> Until the new disks arrive, both the OS itself, pg_xlog and the
> data reside on the same disks. When they arrive, I will probably
> move the data to the new disks (need two more to get raid 0+1,
> though) and leave the OS + pg_xlog on the 10000rpm disks. Mounting
> the 15000rpm data disks with the noatime option (this is safe,
> right?) and using a 16kb block size (for read performance) will
> probably be considered as well.
>
> NOTE: this machine/configuration is NOT what we will be using in
> production if the study turns out OK, it's just supposed to work as
> a development machine in the first phase whose purpose more or less
> is to get familiar with configurating Postgres and see if we can
> get the application up and running (we will probably use a 64bit
> platform and either a FC SAN or internal raid with a battery backed
> cache for production use, if all goes well).
>
> The first thing I did when I got the machine was to do a raw dd
> write test:
>
> # time bash -c "(dd if=/dev/zero of=/opt/bigfile count=1310720
> bs=8k && sync)"
> 1310720+0 records in
> 1310720+0 records out
>
> real    2m21.438s
> user    0m0.998s
> sys     0m51.347s
>
> (10*1024)Mb/~141s => ~75.5Mb/s
>
> As a simple benchmark, I created a simple table without PK/indexes
> with 1k wide rows:
>
> create table iotest.one_kb_rows
> (
>         the_col char(1024) not null
> );
>
> To fill the table, I use this simple function:
>
> create or replace function iotest.writestress(megs integer) returns
> void as $$
> declare
>     char_str char(1024) := repeat('x', 1024);
> begin
>      for i in 1..megs loop
>           for j in 1..1024 loop
>               insert into one_kb_rows(the_col) values (char_str);
>           end loop;
>      end loop;
> end;
> $$
> language plpgsql;
>
> Then, I tested how long it takes to write 10Gb of data to this table:
>
> iotest=> \timing
> Timing is on.
>
> iotest=> select writestress((10*1024));
>  writestress
> -------------
>
> (1 row)
>
> Time: 379971.252 ms
>
> This gives that 10Gb takes ~380s => ~27Mb/s (with fsync=off),
> compared to the raw dd result (~75.5Mb/s).
>
> I assume this difference is due to:
> - simultaneous WAL write activity (assumed: for each byte written
> to the table, at least one byte is also written to WAL, in effect:
> 10Gb data inserted in the table equals 20Gb written to disk)
> - lousy test method (it is done using a function => the transaction
> size is 10Gb, and 10Gb will *not* fit in wal_buffers :) )
> - poor config
> - something else?
>
> I have tried to read up as much as possible on Postgres
> configuration (disk layout, buffer management, WAL sync methods,
> etc) and found this post regarding bgwriter tweaking: http://
> archives.postgresql.org/pgsql-performance/2006-03/msg00218.php -
> which explains the bgwriter config below.
>
> All params in postgresql.conf that are not commented out:
> ---------------------------------------------------------
> max_connections = 100
> superuser_reserved_connections = 2
> shared_buffers = 16000
> bgwriter_lru_percent = 20
> bgwriter_lru_maxpages = 160
> bgwriter_all_percent = 10
> bgwriter_all_maxpages = 320
> fsync = off
> wal_sync_method = open_sync
> wal_buffers = 128
> checkpoint_segments = 3
> log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
> log_rotation_age = 1440
> log_line_prefix = '%m: (%u@%d) '
> lc_messages = 'C'
> lc_monetary = 'C'
> lc_numeric = 'C'
> lc_time = 'C'
>
> fsync can safely be kept off during data migration as we are able
> to restart the procedure without losing data if something goes
> wrong. Increasing chekpoint_segments to 8/16/32 only increased the
> insert time, so I kept it at the default. I will increase
> shared_buffers and effective_cache_size as soon as it's time to
> tweak read performance, but for now I'm just focusing on write
> performance.
>
>
> Postgres version used:
>
> iotest=> select version();
>                                               version
> ----------------------------------------------------------------------
> -----------------------------
>  PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC)
> 3.4.4 20050721 (Red Hat 3.4.4-2)
> (1 row)
>
>
> I want to make sure I have made correct assumptions before I carry
> on, so comments are welcome.
>
> - Mikael
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

update == delete + insert?

From

"Craig A. James"

Date:

20 March 2006, 18:53:49

I've seen it said here several times that "update == delete + insert".  On the other hand, I've noticed that "alter
table[add|drop] column ..." is remarkably fast, even for very large tables, which leads me to wonder whether each
column'scontents are in a file specifically for that column. 

My question: Suppose I have a very "wide" set of data, say 100 columns, and one of those columns will be updated often,
butthe others are fairly static.  I have two choices: 

Design 1:
   create table a (
     id integer,
     frequently_updated  integer);

   create table b(
     id integer,
     infrequently_updated_1 integer,
     infrequently_updated_2 integer,
     infrequently_updated_3 integer,
     ... etc.
     infrequently_updated_99 integer);

Design 2:
   create table c(
     id integer,
     frequently_updated  integer,
     infrequently_updated_1 integer,
     infrequently_updated_2 integer,
     infrequently_updated_3 integer,
     ... etc.
     infrequently_updated_99 integer);

If "update == delete + insert" is strictly true, then "Design 2" would be poor since 99 columns would be moved around
witheach update.  But if columns are actually stored in separate files, the Designs 1 and 2 would be essentially
equivalentwhen it comes to vacuuming. 

Thanks,
Craig

Re: update == delete + insert?

From

"Jim Buttafuoco"

Date:

20 March 2006, 18:56:40

go with design 1, update does = delete + insert.


---------- Original Message -----------
From: "Craig A. James" <cjames@modgraph-usa.com>
To: pgsql-performance@postgresql.org
Sent: Mon, 20 Mar 2006 14:49:43 -0800
Subject: [PERFORM] update == delete + insert?

> I've seen it said here several times that "update == delete + insert".  On the other hand, I've noticed that
> "alter table [add|drop] column ..." is remarkably fast, even for very large tables, which leads me to wonder
> whether each column's contents are in a file specifically for that column.
>
> My question: Suppose I have a very "wide" set of data, say 100 columns, and one of those columns will be
> updated often, but the others are fairly static.  I have two choices:
>
> Design 1:
>    create table a (
>      id integer,
>      frequently_updated  integer);
>
>    create table b(
>      id integer,
>      infrequently_updated_1 integer,
>      infrequently_updated_2 integer,
>      infrequently_updated_3 integer,
>      ... etc.
>      infrequently_updated_99 integer);
>
> Design 2:
>    create table c(
>      id integer,
>      frequently_updated  integer,
>      infrequently_updated_1 integer,
>      infrequently_updated_2 integer,
>      infrequently_updated_3 integer,
>      ... etc.
>      infrequently_updated_99 integer);
>
> If "update == delete + insert" is strictly true, then "Design 2" would be poor since 99 columns would be moved
> around with each update.  But if columns are actually stored in separate files, the Designs 1 and 2 would be
> essentially equivalent when it comes to vacuuming.
>
> Thanks,
> Craig
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
------- End of Original Message -------

Re: update == delete + insert?

From

Tom Lane

Date:

20 March 2006, 19:22:48

"Craig A. James" <cjames@modgraph-usa.com> writes:
> I've seen it said here several times that "update == delete + insert".  On the other hand, I've noticed that "alter
table[add|drop] column ..." is remarkably fast, even for very large tables, which leads me to wonder whether each
column'scontents are in a file specifically for that column. 

No.  The reason "drop column" is fast is that we make no attempt to
remove the data from existing rows; we only mark the column's entry in
the system catalogs as deleted.  "add column" is only fast if you are
adding a column with no default (a/k/a default NULL).  In that case
likewise we don't have to modify existing rows; the desired behavior
falls out from the fact that the tuple access routines return NULL if
asked to fetch a column beyond those existing in a particular tuple.

You can read about the storage layout in
http://developer.postgresql.org/docs/postgres/storage.html

            regards, tom lane

Re: Migration study, step 1: bulk write performance optimization

From

Tom Lane

Date:

20 March 2006, 20:27:19

"Craig A. James" <cjames@modgraph-usa.com> writes:
> If you haven't explored the COPY command yet, check it out.  It is stunningly fast compared to normal INSERT
commands.

Note also that his "benchmark" is testing multiple INSERTs issued within
a loop in a plpgsql function, which has got nearly nothing to do with
the performance that will be obtained from INSERTs issued by a client
(especially if said INSERTs aren't prepared and/or aren't batched into
transactions).

            regards, tom lane

Re: update == delete + insert?

From

"Jaime Casanova"

Date:

20 March 2006, 21:38:14

On 3/20/06, Craig A. James <cjames@modgraph-usa.com> wrote:
> I've seen it said here several times that "update == delete + insert".  On the other hand, I've noticed that "alter
table[add|drop] column ..." is remarkably fast, even for very large tables, which leads me to wonder whether each
column'scontents are in a file specifically for that column. 
>
> My question: Suppose I have a very "wide" set of data, say 100 columns, and one of those columns will be updated
often,but the others are fairly static.  I have two choices: 
>
> Design 1:
>    create table a (
>      id integer,
>      frequently_updated  integer);
>
>    create table b(
>      id integer,
>      infrequently_updated_1 integer,
>      infrequently_updated_2 integer,
>      infrequently_updated_3 integer,
>      ... etc.
>      infrequently_updated_99 integer);
>
> Design 2:
>    create table c(
>      id integer,
>      frequently_updated  integer,
>      infrequently_updated_1 integer,
>      infrequently_updated_2 integer,
>      infrequently_updated_3 integer,
>      ... etc.
>      infrequently_updated_99 integer);
>
> If "update == delete + insert" is strictly true, then "Design 2" would be poor since 99 columns would be moved around
witheach update.  But if columns are actually stored in separate files, the Designs 1 and 2 would be essentially
equivalentwhen it comes to vacuuming. 
>
> Thanks,
> Craig
>

design 1 is normalized and better
design 2 is denormalized and a bad approach no matter the RDBMS

update does delete + insert, and vacuum is the way to recover the space

--
Atentamente,
Jaime Casanova

"What they (MySQL) lose in usability, they gain back in benchmarks, and that's
all that matters: getting the wrong answer really fast."
                           Randal L. Schwartz

Re: update == delete + insert?

From

"Jim C. Nasby"

Date:

21 March 2006, 07:23:50

On Mon, Mar 20, 2006 at 08:38:15PM -0500, Jaime Casanova wrote:
> On 3/20/06, Craig A. James <cjames@modgraph-usa.com> wrote:
> > Design 1:
> >    create table a (
> >      id integer,
> >      frequently_updated  integer);
> >
> >    create table b(
> >      id integer,
> >      infrequently_updated_1 integer,
> >      infrequently_updated_2 integer,
> >      infrequently_updated_3 integer,
> >      ... etc.
> >      infrequently_updated_99 integer);
> >
> > Design 2:
> >    create table c(
> >      id integer,
> >      frequently_updated  integer,
> >      infrequently_updated_1 integer,
> >      infrequently_updated_2 integer,
> >      infrequently_updated_3 integer,
> >      ... etc.
> >      infrequently_updated_99 integer);
> design 1 is normalized and better
> design 2 is denormalized and a bad approach no matter the RDBMS

How is design 1 denormalized?

> "What they (MySQL) lose in usability, they gain back in benchmarks, and that's
> all that matters: getting the wrong answer really fast."
>                            Randal L. Schwartz

Where's that quote from?
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Migration study, step 1: bulk write performance

From

"Jim C. Nasby"

Date:

21 March 2006, 07:28:01

On Mon, Mar 20, 2006 at 04:19:12PM +0100, Csaba Nagy wrote:
> What I can add from our experience: ext3 turned out lousy for our
> application, and converting to XFS made a quite big improvement for our
> DB load. I don't have hard figures, but I think it was some 30%
> improvement in overall speed, and it had a huge improvement for heavy
> load times... what I mean is that with ext3 we had multiple parallel big
> tasks executing in more time than if we would have executed them
> sequentially, and with XFS that was gone, load scales linearly. In any
> case you should test the performance of your application on different FS
> and different settings, as this could make a huge difference.

Did you try mounting ext3 whith data=writeback by chance? People have
found that makes a big difference in performance.

http://archives.postgresql.org/pgsql-performance/2003-01/msg00320.php
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Migration study, step 1: bulk write performance

From

Csaba Nagy

Date:

21 March 2006, 07:52:46

> Did you try mounting ext3 whith data=writeback by chance? People have
> found that makes a big difference in performance.

I'm not sure, there's other people here doing the OS stuff - I'm pretty
much ignorant about what "data=writeback" could mean :-D

They knew however that for the data partitions no FS journaling is
needed, and for the WAL partition meta data journaling is enough, so I
guess they tuned ext3 for this.

Cheers,
Csaba.

Re: Migration study, step 1: bulk write performance

From

"Steinar H. Gunderson"

Date:

21 March 2006, 07:55:41

On Tue, Mar 21, 2006 at 12:52:46PM +0100, Csaba Nagy wrote:
> They knew however that for the data partitions no FS journaling is
> needed, and for the WAL partition meta data journaling is enough, so I
> guess they tuned ext3 for this.

For the record, that's the wrong way round. For the data partitioning
metadata journaling is enough, and for the WAL partition you don't need any
FS journaling at all.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Migration study, step 1: bulk write performance

From

Csaba Nagy

Date:

21 March 2006, 07:59:12

> For the record, that's the wrong way round. For the data partitioning
> metadata journaling is enough, and for the WAL partition you don't need any
> FS journaling at all.

Yes, you're right: the data partition shouldn't loose file creation,
deletion, etc., which is not important for the WAL partition where the
WAL files are mostly recycled... right ?

Cheers,
Csaba.

Re: Migration study, step 1: bulk write performance

From

"Jim C. Nasby"

Date:

21 March 2006, 08:01:59

On Tue, Mar 21, 2006 at 12:56:18PM +0100, Steinar H. Gunderson wrote:
> On Tue, Mar 21, 2006 at 12:52:46PM +0100, Csaba Nagy wrote:
> > They knew however that for the data partitions no FS journaling is
> > needed, and for the WAL partition meta data journaling is enough, so I
> > guess they tuned ext3 for this.
>
> For the record, that's the wrong way round. For the data partitioning
> metadata journaling is enough, and for the WAL partition you don't need any
> FS journaling at all.

Are you sure? Metadate changes are probably a lot more common on the WAL
partition. In any case, I don't see why there should be a difference.
The real issue is: is related filesystem metadata sync'd as part of a
file being fsync'd?
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Migration study, step 1: bulk write performance

From

"Steinar H. Gunderson"

Date:

21 March 2006, 08:09:53

On Tue, Mar 21, 2006 at 06:01:58AM -0600, Jim C. Nasby wrote:
> Are you sure? Metadate changes are probably a lot more common on the WAL
> partition. In any case, I don't see why there should be a difference.
> The real issue is: is related filesystem metadata sync'd as part of a
> file being fsync'd?

I've been told on this list that PostgreSQL actually takes care to fill a new
WAL file with zeroes etc. when initializing it; dig a few months back and I'm
sure it's there.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Migration study, step 1: bulk write performance

From

"Jim C. Nasby"

Date:

21 March 2006, 08:18:40

On Tue, Mar 21, 2006 at 01:10:32PM +0100, Steinar H. Gunderson wrote:
> On Tue, Mar 21, 2006 at 06:01:58AM -0600, Jim C. Nasby wrote:
> > Are you sure? Metadate changes are probably a lot more common on the WAL
> > partition. In any case, I don't see why there should be a difference.
> > The real issue is: is related filesystem metadata sync'd as part of a
> > file being fsync'd?
>
> I've been told on this list that PostgreSQL actually takes care to fill a new
> WAL file with zeroes etc. when initializing it; dig a few months back and I'm
> sure it's there.

That's fine and all, but does no good if the filesystem doesn't know
that the file exists on a crash. The same concern is also true on the
data partition, although it's less likely to be a problem because if you
happen to crash soon after a DDL operation it's likely that you haven't
had a checkpoint yet, so the operation will likely be repeated during
WAL replay. But depending on that is a race condition.

Basically, you need to know for certain that if PostgreSQL creates a
file and then fsync's it that that file is safely on disk, and that the
filesystem knows how to find it (ie: the metadata is also on disk in
some fashion).
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Migration study, step 1: bulk write performance

From

"Steinar H. Gunderson"

Date:

21 March 2006, 08:29:14

On Tue, Mar 21, 2006 at 06:18:39AM -0600, Jim C. Nasby wrote:
> Basically, you need to know for certain that if PostgreSQL creates a
> file and then fsync's it that that file is safely on disk, and that the
> filesystem knows how to find it (ie: the metadata is also on disk in
> some fashion).

It seems to do, quoting Tom from
http://archives.postgresql.org/pgsql-performance/2005-11/msg00184.php:

== snip ==
  No, Mike is right: for WAL you shouldn't need any journaling.  This is
  because we zero out *and fsync* an entire WAL file before we ever
  consider putting live WAL data in it.  During live use of a WAL file,
  its metadata is not changing.  As long as the filesystem follows
  the minimal rule of syncing metadata about a file when it fsyncs the
  file, all the live WAL files should survive crashes OK.

  We can afford to do this mainly because WAL files can normally be
  recycled instead of created afresh, so the zero-out overhead doesn't
  get paid during normal operation.

  You do need metadata journaling for all non-WAL PG files, since we don't
  fsync them every time we extend them; which means the filesystem could
  lose track of which disk blocks belong to such a file, if it's not
  journaled.
== snip ==

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Migration study, step 1: bulk write performance

From

"Jim C. Nasby"

Date:

21 March 2006, 08:45:26

On Tue, Mar 21, 2006 at 01:29:54PM +0100, Steinar H. Gunderson wrote:
> On Tue, Mar 21, 2006 at 06:18:39AM -0600, Jim C. Nasby wrote:
> > Basically, you need to know for certain that if PostgreSQL creates a
> > file and then fsync's it that that file is safely on disk, and that the
> > filesystem knows how to find it (ie: the metadata is also on disk in
> > some fashion).
>
> It seems to do, quoting Tom from
> http://archives.postgresql.org/pgsql-performance/2005-11/msg00184.php:

404 :(

>
> == snip ==
>   its metadata is not changing.  As long as the filesystem follows
>   the minimal rule of syncing metadata about a file when it fsyncs the
>   file, all the live WAL files should survive crashes OK.

And therin lies the rub: file metadata *must* commit to disk as part of
an fsync, and it's needed for both WAL and heap data. It's needed for
heap data because as soon as a checkpoint completes, PostgreSQL is free
to erase any WAL info about previous DDL changes.

On FreeBSD, if you're using softupdates, the filesystem will properly
order writes to the drive so that metadata must be written before file
data; this ensures that an fsync on the file will first write any
metadata before writing the data itself.

With fsync turned off, any metadata-changing commands will wait for the
metadata to commit to disk before returning (unless you run async...)

I'm not really sure how this all plays out on a journalling filesystem.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Migration study, step 1: bulk write performance

From

Michael Stone

Date:

21 March 2006, 08:48:46

On Tue, Mar 21, 2006 at 06:01:58AM -0600, Jim C. Nasby wrote:
>On Tue, Mar 21, 2006 at 12:56:18PM +0100, Steinar H. Gunderson wrote:
>> For the record, that's the wrong way round. For the data partitioning
>> metadata journaling is enough, and for the WAL partition you don't need any
>> FS journaling at all.
>
>Are you sure?

Yes. :) You actually shouldn't need metadata journaling in either
case--fsck will do the same thing. But fsck can take a *very* long time
on a large paritition, so for your data partition the journaling fs is a
big win. But your wal partition isn't likely to have very many files
and should fsck in a snap, and data consistency is taken care of by
synchronous operations. (Which is the reason you really don't need/want
data journalling.)

Mike Stone

Re: update == delete + insert?

From

"Merlin Moncure"

Date:

21 March 2006, 10:12:07

> > design 1 is normalized and better
> > design 2 is denormalized and a bad approach no matter the RDBMS
>
> How is design 1 denormalized?

It isn't :)...he said it is normalized.  Design 2 may or may not be
de-normalized (IMO there is not enough information to make that
determination) but as stated it's a good idea to split the table on
practical grounds.

merlin

Re: update == delete + insert?

From

"Jim C. Nasby"

Date:

21 March 2006, 13:38:18

On Tue, Mar 21, 2006 at 09:12:08AM -0500, Merlin Moncure wrote:
> > > design 1 is normalized and better
> > > design 2 is denormalized and a bad approach no matter the RDBMS
> >
> > How is design 1 denormalized?
>
> It isn't :)...he said it is normalized.  Design 2 may or may not be
> de-normalized (IMO there is not enough information to make that
> determination) but as stated it's a good idea to split the table on
> practical grounds.

Err, sorry, got the number backwards. My point is that 2 isn't
denormalized afaik, at least not based just on the example. But yes, in
a case like this, vertical partitioning can make a lot of sense.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Migration study, step 1: bulk write performance

From

Simon Riggs

Date:

21 March 2006, 16:45:01

On Mon, 2006-03-20 at 15:59 +0100, Mikael Carneholm wrote:

> This gives that 10Gb takes ~380s => ~27Mb/s (with fsync=off), compared to the raw dd result (~75.5Mb/s).
>
> I assume this difference is due to:
> - simultaneous WAL write activity (assumed: for each byte written to the table, at least one byte is also written to
WAL,in effect: 10Gb data inserted in the table equals 20Gb written to disk) 
> - lousy test method (it is done using a function => the transaction size is 10Gb, and 10Gb will *not* fit in
wal_buffers:) ) 
> - poor config

> checkpoint_segments = 3

With those settings, you'll be checkpointing every 48 Mb, which will be
every about once per second. Since the checkpoint will take a reasonable
amount of time, even with fsync off, you'll be spending most of your
time checkpointing. bgwriter will just be slowing you down too because
you'll always have more clean buffers than you can use, since you have
132MB of shared_buffers, yet flushing all of them every checkpoint.

Please read you're logfile, which should have relevant WARNING messages.

Best Regards, Simon Riggs

Re: Migration study, step 1: bulk write performance

From

Ron

Date:

21 March 2006, 17:15:07

At 03:44 PM 3/21/2006, Simon Riggs wrote:
>On Mon, 2006-03-20 at 15:59 +0100, Mikael Carneholm wrote:
>
> > This gives that 10Gb takes ~380s => ~27Mb/s (with fsync=off),
> compared to the raw dd result (~75.5Mb/s).
> >
> > I assume this difference is due to:
> > - simultaneous WAL write activity (assumed: for each byte written
> to the table, at least one byte is also written to WAL, in effect:
> 10Gb data inserted in the table equals 20Gb written to disk)
> > - lousy test method (it is done using a function => the
> transaction size is 10Gb, and 10Gb will *not* fit in wal_buffers :) )
> > - poor config
>
> > checkpoint_segments = 3
>
>With those settings, you'll be checkpointing every 48 Mb, which will be
>every about once per second. Since the checkpoint will take a reasonable
>amount of time, even with fsync off, you'll be spending most of your
>time checkpointing. bgwriter will just be slowing you down too because
>you'll always have more clean buffers than you can use, since you have
>132MB of shared_buffers, yet flushing all of them every checkpoint.
IIRC, Josh Berkus did some benches that suggests in pg 8.x a value of
64 - 256 is best for checkpoint_segments as long as you have the RAM available.

I'd suggest trying values of 64, 128, and 256 and setting
checkpoint_segments to the best of those.

Ron

Re: Migration study, step 1: bulk write performance

From

Scott Marlowe

Date:

21 March 2006, 17:19:26

On Mon, 2006-03-20 at 15:17, Ron wrote:
> At 03:44 PM 3/21/2006, Simon Riggs wrote:
> >On Mon, 2006-03-20 at 15:59 +0100, Mikael Carneholm wrote:
> >
> > > This gives that 10Gb takes ~380s => ~27Mb/s (with fsync=off),
> > compared to the raw dd result (~75.5Mb/s).
> > >
> > > I assume this difference is due to:
> > > - simultaneous WAL write activity (assumed: for each byte written
> > to the table, at least one byte is also written to WAL, in effect:
> > 10Gb data inserted in the table equals 20Gb written to disk)
> > > - lousy test method (it is done using a function => the
> > transaction size is 10Gb, and 10Gb will *not* fit in wal_buffers :) )
> > > - poor config
> >
> > > checkpoint_segments = 3
> >
> >With those settings, you'll be checkpointing every 48 Mb, which will be
> >every about once per second. Since the checkpoint will take a reasonable
> >amount of time, even with fsync off, you'll be spending most of your
> >time checkpointing. bgwriter will just be slowing you down too because
> >you'll always have more clean buffers than you can use, since you have
> >132MB of shared_buffers, yet flushing all of them every checkpoint.
> IIRC, Josh Berkus did some benches that suggests in pg 8.x a value of
> 64 - 256 is best for checkpoint_segments as long as you have the RAM available.
>
> I'd suggest trying values of 64, 128, and 256 and setting
> checkpoint_segments to the best of those.

I've also found that modest increases in commit_siblings and
commit_delay help a lot on certain types of imports.

Re: Migration study, step 1: bulk write performance

From

Alvaro Herrera

Date:

21 March 2006, 17:20:42

Ron wrote:

> IIRC, Josh Berkus did some benches that suggests in pg 8.x a value of
> 64 - 256 is best for checkpoint_segments as long as you have the RAM
> available.

I think you are confusing checkpoint_segments with wal_buffers.
checkpoint_segments certainly has little to do with available RAM!

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Migration study, step 1: bulk write performance

From

Tom Lane

Date:

21 March 2006, 17:56:28

Scott Marlowe <smarlowe@g2switchworks.com> writes:
> I've also found that modest increases in commit_siblings and
> commit_delay help a lot on certain types of imports.

On a data import?  Those really should have zero effect on a
single-process workload.  Or are you doing multiple concurrent imports?

            regards, tom lane

Re: Migration study, step 1: bulk write performance

From

Scott Marlowe

Date:

21 March 2006, 18:23:15

On Tue, 2006-03-21 at 15:56, Tom Lane wrote:
> Scott Marlowe <smarlowe@g2switchworks.com> writes:
> > I've also found that modest increases in commit_siblings and
> > commit_delay help a lot on certain types of imports.
>
> On a data import?  Those really should have zero effect on a
> single-process workload.  Or are you doing multiple concurrent imports?

That, and it's a machine that's doing other things.  Also, a lot of the
imports are NOT bundled up into groups of transactions.  i.e. lots and
lots of individual insert queries.