Thread: concurrent COPY performance

concurrent COPY performance

From

Stefan Kaltenbrunner

Date:

16 June 2009, 13:48:10

Hi!

I have been doing some bulk loading testing recently - mostly with a 
focus on answering why we are "only" getting a (max of) cores/2(up to 
around 8 cores even less with more) speedup using parallel restore.
What I found is that on some fast IO-subsystem we are CPU bottlenecked 
on concurrent copy which is able to utilize WAL bypass (and scale up to 
around cores/2) and performance without wal bypass is very bad.
In the WAL logged case we are only able to get a 50% speedup using the 
second process already and we are never able to scale better than 3x (up 
to 8 cores) and performance degrades even after that point.

the profile(loading the lineitem table from the DBT3 benchmark) for that 
case looks fairly similiar to what I have seen in the past for 
io-intensive concurrent workloads:


Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a 
unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        symbol name
47546    15.4562  XLogInsert
39598    12.8725  DoCopy
33798    10.9870  CopyReadLine
14191     4.6132  DecodeNumber
12986     4.2215  heap_fill_tuple
12092     3.9308  pg_verify_mbstr_len
9553      3.1055  DecodeDate
9289      3.0197  InputFunctionCall
7972      2.5915  ParseDateTime
7324      2.3809  DecodeDateTime
7290      2.3698  pg_next_dst_boundary
7218      2.3464  heap_form_tuple
5385      1.7505  AllocSetAlloc
4779      1.5536  heap_compute_data_size
4367      1.4196  float4in
3903      1.2688  DetermineTimeZoneOffset
3603      1.1713  pg_mblen
3494      1.1358  pg_atoi
3461      1.1251  .plt
3428      1.1144  date2j
3416      1.1105  pg_mbstrlen_with_len

this is for 8 connections on an 8core/16 thread box. on higher 
connection counts the server is >70% idle and showing even worse througput.

in the WAL bypass case I can actually get a performance improvement up 
to 16 parallel connections however I also only get cores/2 maximum 
throughput here too. We are actually able to max out the CPU on the 
server in that case though (no idle time and iowait only in the single 
digit range). Profiling that workload(loading around 1400000 rows/s) I get:

samples  %        symbol name
2462358  15.3353  DoCopy
2319677  14.4467  CopyReadLine
806269    5.0214  pg_verify_mbstr_len
782594    4.8739  DecodeNumber
712426    4.4369  heap_fill_tuple
642893    4.0039  DecodeDate
609313    3.7947  InputFunctionCall
476887    2.9700  ParseDateTime
456569    2.8435  pg_next_dst_boundary
435812    2.7142  DecodeDateTime
429733    2.6763  heap_form_tuple
361750    2.2529  heap_compute_data_size
301193    1.8758  AllocSetAlloc
268830    1.6742  float4in
238119    1.4830  DetermineTimeZoneOffset
229598    1.4299  pg_atoi
225169    1.4023  .plt
217817    1.3565  pg_mbstrlen_with_len
207041    1.2894  PageAddItem
200024    1.2457  pg_mblen
192237    1.1972  bpchar_input
181552    1.1307  date2j


for those interested I have some additional information up on:

http://www.kaltenbrunner.cc/blog/index.php?/archives/27-Benchmarking-8.4-Chapter-2bulk-loading.html


Stefan

Re: concurrent COPY performance

From

Merlin Moncure

Date:

16 June 2009, 18:33:16

On Tue, Jun 16, 2009 at 12:47 PM, Stefan
Kaltenbrunner<stefan@kaltenbrunner.cc> wrote:
> Hi!
>
> I have been doing some bulk loading testing recently - mostly with a focus
> on answering why we are "only" getting a (max of) cores/2(up to around 8
> cores even less with more) speedup using parallel restore.
> What I found is that on some fast IO-subsystem we are CPU bottlenecked on
> concurrent copy which is able to utilize WAL bypass (and scale up to around
> cores/2) and performance without wal bypass is very bad.
> In the WAL logged case we are only able to get a 50% speedup using the
> second process already and we are never able to scale better than 3x (up to
> 8 cores) and performance degrades even after that point.

how are you bypassing wal?  do I read this properly that on your 8
core system you are getting 4x speedup with wal bypass and 3x speedup
without?

merlin

Re: concurrent COPY performance

From

Andrew Dunstan

Date:

16 June 2009, 18:49:47


Merlin Moncure wrote:
> On Tue, Jun 16, 2009 at 12:47 PM, Stefan
> Kaltenbrunner<stefan@kaltenbrunner.cc> wrote:
>   
>> Hi!
>>
>> I have been doing some bulk loading testing recently - mostly with a focus
>> on answering why we are "only" getting a (max of) cores/2(up to around 8
>> cores even less with more) speedup using parallel restore.
>> What I found is that on some fast IO-subsystem we are CPU bottlenecked on
>> concurrent copy which is able to utilize WAL bypass (and scale up to around
>> cores/2) and performance without wal bypass is very bad.
>> In the WAL logged case we are only able to get a 50% speedup using the
>> second process already and we are never able to scale better than 3x (up to
>> 8 cores) and performance degrades even after that point.
>>     
>
> how are you bypassing wal?  do I read this properly that on your 8
> core system you are getting 4x speedup with wal bypass and 3x speedup
> without?
>   

If a table is created or truncated in the same transaction that does the 
load, and archiving is not on, the COPY is not WALed. That is why 
parallel restore wraps the COPY in a transaction and precedes it with a 
TRUNCATE if it created the table.

cheers

andrew

Re: concurrent COPY performance

From

"Kevin Grittner"

Date:

16 June 2009, 19:14:47

Andrew Dunstan <andrew@dunslane.net> wrote:
> If a table is created or truncated in the same transaction that does
> the load, and archiving is not on, the COPY is not WALed.
Slightly off topic, but possibly relevant to the overall process:
those are the same conditions under which I would love to see the
rows inserted with the hint bits showing successful commit and the
transaction ID showing frozen.  We currently do a VACUUM FREEZE
ANALYZE after such a load, to avoid burdening random users with the
writes.  It would be nice not to have to write all the pages again
right after a load.
-Kevin

Re: concurrent COPY performance

From

Stefan Kaltenbrunner

Date:

17 June 2009, 01:40:21

Merlin Moncure wrote:
> On Tue, Jun 16, 2009 at 12:47 PM, Stefan
> Kaltenbrunner<stefan@kaltenbrunner.cc> wrote:
>> Hi!
>>
>> I have been doing some bulk loading testing recently - mostly with a focus
>> on answering why we are "only" getting a (max of) cores/2(up to around 8
>> cores even less with more) speedup using parallel restore.
>> What I found is that on some fast IO-subsystem we are CPU bottlenecked on
>> concurrent copy which is able to utilize WAL bypass (and scale up to around
>> cores/2) and performance without wal bypass is very bad.
>> In the WAL logged case we are only able to get a 50% speedup using the
>> second process already and we are never able to scale better than 3x (up to
>> 8 cores) and performance degrades even after that point.
> 
> how are you bypassing wal?  do I read this properly that on your 8
> core system you are getting 4x speedup with wal bypass and 3x speedup
> without?

The test is simply executing something like psql -c "BEGIN;TRUNCATE 
lineitem1;COPY lineitem1 FROM ....;COMMIT;". in parallel with the source 
file being hosted on a seperate array and primed into the OS buffercache.
The box has 8cores/16 threads actually - I get a 3x speedup up to using 
8 processes without wal-bypass but on higher connection counts the 
performances degraded.
Utilizing wal bypass I get near perfect scalability up to using 4 
connections and a maximum speedup of ~8x by using 16 connections (ie all 
threads)

Stefan

Re: concurrent COPY performance

From

Gurjeet Singh

Date:

19 June 2009, 17:51:18

On Wed, Jun 17, 2009 at 3:44 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

Andrew Dunstan <andrew@dunslane.net> wrote:

> If a table is created or truncated in the same transaction that does
> the load, and archiving is not on, the COPY is not WALed.

Slightly off topic, but possibly relevant to the overall process:
those are the same conditions under which I would love to see the
rows inserted with the hint bits showing successful commit and the
transaction ID showing frozen. We currently do a VACUUM FREEZE
ANALYZE after such a load, to avoid burdening random users with the
writes. It would be nice not to have to write all the pages again
right after a load.

+1 (if it is doable, that is).

Best regards,
--
Lets call it Postgres

EnterpriseDB http://www.enterprisedb.com

gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
Mail sent from my BlackLaptop device