Thread: concurrent COPY performance
Hi! I have been doing some bulk loading testing recently - mostly with a focus on answering why we are "only" getting a (max of) cores/2(up to around 8 cores even less with more) speedup using parallel restore. What I found is that on some fast IO-subsystem we are CPU bottlenecked on concurrent copy which is able to utilize WAL bypass (and scale up to around cores/2) and performance without wal bypass is very bad. In the WAL logged case we are only able to get a 50% speedup using the second process already and we are never able to scale better than 3x (up to 8 cores) and performance degrades even after that point. the profile(loading the lineitem table from the DBT3 benchmark) for that case looks fairly similiar to what I have seen in the past for io-intensive concurrent workloads: Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % symbol name 47546 15.4562 XLogInsert 39598 12.8725 DoCopy 33798 10.9870 CopyReadLine 14191 4.6132 DecodeNumber 12986 4.2215 heap_fill_tuple 12092 3.9308 pg_verify_mbstr_len 9553 3.1055 DecodeDate 9289 3.0197 InputFunctionCall 7972 2.5915 ParseDateTime 7324 2.3809 DecodeDateTime 7290 2.3698 pg_next_dst_boundary 7218 2.3464 heap_form_tuple 5385 1.7505 AllocSetAlloc 4779 1.5536 heap_compute_data_size 4367 1.4196 float4in 3903 1.2688 DetermineTimeZoneOffset 3603 1.1713 pg_mblen 3494 1.1358 pg_atoi 3461 1.1251 .plt 3428 1.1144 date2j 3416 1.1105 pg_mbstrlen_with_len this is for 8 connections on an 8core/16 thread box. on higher connection counts the server is >70% idle and showing even worse througput. in the WAL bypass case I can actually get a performance improvement up to 16 parallel connections however I also only get cores/2 maximum throughput here too. We are actually able to max out the CPU on the server in that case though (no idle time and iowait only in the single digit range). Profiling that workload(loading around 1400000 rows/s) I get: samples % symbol name 2462358 15.3353 DoCopy 2319677 14.4467 CopyReadLine 806269 5.0214 pg_verify_mbstr_len 782594 4.8739 DecodeNumber 712426 4.4369 heap_fill_tuple 642893 4.0039 DecodeDate 609313 3.7947 InputFunctionCall 476887 2.9700 ParseDateTime 456569 2.8435 pg_next_dst_boundary 435812 2.7142 DecodeDateTime 429733 2.6763 heap_form_tuple 361750 2.2529 heap_compute_data_size 301193 1.8758 AllocSetAlloc 268830 1.6742 float4in 238119 1.4830 DetermineTimeZoneOffset 229598 1.4299 pg_atoi 225169 1.4023 .plt 217817 1.3565 pg_mbstrlen_with_len 207041 1.2894 PageAddItem 200024 1.2457 pg_mblen 192237 1.1972 bpchar_input 181552 1.1307 date2j for those interested I have some additional information up on: http://www.kaltenbrunner.cc/blog/index.php?/archives/27-Benchmarking-8.4-Chapter-2bulk-loading.html Stefan
On Tue, Jun 16, 2009 at 12:47 PM, Stefan Kaltenbrunner<stefan@kaltenbrunner.cc> wrote: > Hi! > > I have been doing some bulk loading testing recently - mostly with a focus > on answering why we are "only" getting a (max of) cores/2(up to around 8 > cores even less with more) speedup using parallel restore. > What I found is that on some fast IO-subsystem we are CPU bottlenecked on > concurrent copy which is able to utilize WAL bypass (and scale up to around > cores/2) and performance without wal bypass is very bad. > In the WAL logged case we are only able to get a 50% speedup using the > second process already and we are never able to scale better than 3x (up to > 8 cores) and performance degrades even after that point. how are you bypassing wal? do I read this properly that on your 8 core system you are getting 4x speedup with wal bypass and 3x speedup without? merlin
Merlin Moncure wrote: > On Tue, Jun 16, 2009 at 12:47 PM, Stefan > Kaltenbrunner<stefan@kaltenbrunner.cc> wrote: > >> Hi! >> >> I have been doing some bulk loading testing recently - mostly with a focus >> on answering why we are "only" getting a (max of) cores/2(up to around 8 >> cores even less with more) speedup using parallel restore. >> What I found is that on some fast IO-subsystem we are CPU bottlenecked on >> concurrent copy which is able to utilize WAL bypass (and scale up to around >> cores/2) and performance without wal bypass is very bad. >> In the WAL logged case we are only able to get a 50% speedup using the >> second process already and we are never able to scale better than 3x (up to >> 8 cores) and performance degrades even after that point. >> > > how are you bypassing wal? do I read this properly that on your 8 > core system you are getting 4x speedup with wal bypass and 3x speedup > without? > If a table is created or truncated in the same transaction that does the load, and archiving is not on, the COPY is not WALed. That is why parallel restore wraps the COPY in a transaction and precedes it with a TRUNCATE if it created the table. cheers andrew
Andrew Dunstan <andrew@dunslane.net> wrote: > If a table is created or truncated in the same transaction that does > the load, and archiving is not on, the COPY is not WALed. Slightly off topic, but possibly relevant to the overall process: those are the same conditions under which I would love to see the rows inserted with the hint bits showing successful commit and the transaction ID showing frozen. We currently do a VACUUM FREEZE ANALYZE after such a load, to avoid burdening random users with the writes. It would be nice not to have to write all the pages again right after a load. -Kevin
Merlin Moncure wrote: > On Tue, Jun 16, 2009 at 12:47 PM, Stefan > Kaltenbrunner<stefan@kaltenbrunner.cc> wrote: >> Hi! >> >> I have been doing some bulk loading testing recently - mostly with a focus >> on answering why we are "only" getting a (max of) cores/2(up to around 8 >> cores even less with more) speedup using parallel restore. >> What I found is that on some fast IO-subsystem we are CPU bottlenecked on >> concurrent copy which is able to utilize WAL bypass (and scale up to around >> cores/2) and performance without wal bypass is very bad. >> In the WAL logged case we are only able to get a 50% speedup using the >> second process already and we are never able to scale better than 3x (up to >> 8 cores) and performance degrades even after that point. > > how are you bypassing wal? do I read this properly that on your 8 > core system you are getting 4x speedup with wal bypass and 3x speedup > without? The test is simply executing something like psql -c "BEGIN;TRUNCATE lineitem1;COPY lineitem1 FROM ....;COMMIT;". in parallel with the source file being hosted on a seperate array and primed into the OS buffercache. The box has 8cores/16 threads actually - I get a 3x speedup up to using 8 processes without wal-bypass but on higher connection counts the performances degraded. Utilizing wal bypass I get near perfect scalability up to using 4 connections and a maximum speedup of ~8x by using 16 connections (ie all threads) Stefan
On Wed, Jun 17, 2009 at 3:44 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
+1 (if it is doable, that is).
Best regards,Andrew Dunstan <andrew@dunslane.net> wrote:Slightly off topic, but possibly relevant to the overall process:
> If a table is created or truncated in the same transaction that does
> the load, and archiving is not on, the COPY is not WALed.
those are the same conditions under which I would love to see the
rows inserted with the hint bits showing successful commit and the
transaction ID showing frozen. We currently do a VACUUM FREEZE
ANALYZE after such a load, to avoid burdening random users with the
writes. It would be nice not to have to write all the pages again
right after a load.
+1 (if it is doable, that is).
--
Lets call it Postgres
EnterpriseDB http://www.enterprisedb.com
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
Mail sent from my BlackLaptop device