Thread: CPU bound
Hi all, I notice that when restoring a DB on a laptop with an SDD, typically postgres is maxing out a CPU - even during a COPY. I wonder, what is postgres usually doing with the CPU? I would have thought the disk would usually be the bottleneck inthe DB, but occasionally it's not. We're embarking on a new DB server project and it'd be helpful to understand wherethe CPU is likely to be the bottleneck. Cheers, --Royce
On 12/13/2010 10:43 AM, Royce Ausburn wrote: > Hi all, > > I notice that when restoring a DB on a laptop with an SDD, typically postgres is maxing out a CPU - even during a COPY. I wonder, what is postgres usually doing with the CPU? I would have thought the disk would usually be the bottleneckin the DB, but occasionally it's not. We're embarking on a new DB server project and it'd be helpful to understandwhere the CPU is likely to be the bottleneck. A few thoughts: - Pg isn't capable of using more than one core for a single task, so if you have one really big job, you'll more easily start struggling on CPU. Restores appear to be a pain point here, though recent work has been done to address that. - Even with pg_dump/pg_restore's parallel restore, you can't be using more than one core to do work for a single COPY or other individual operation. You can only parallelize down to the table level at the moment. - Pg's design has always focused on rotating media. It can make sense to trade increased CPU costs for reduced I/O when disk storage is slower relative to CPU/RAM. There aren't, AFAIK, many controls beyond the random/seq io knobs to get Pg to try to save CPU at the cost of more I/O when opportunities to do so appear. - Pg's CPU load depends a lot on the data types and table structures in use. What're your tables like? Do they have indexes added at the end, or are they created with indexes then populated with rows? The former is MUCH faster. Are they full of NUMERIC fields? Those seem to be incredibly slow compared to int/float/etc, which is hardly surprising given their storage and how they work. -- Craig Ringer
On 12/12/10 6:43 PM, Royce Ausburn wrote: > Hi all, > > I notice that when restoring a DB on a laptop with an SDD, typically postgres is maxing out a CPU - even during a COPY. I wonder, what is postgres usually doing with the CPU? I would have thought the disk would usually be the bottleneckin the DB, but occasionally it's not. We're embarking on a new DB server project and it'd be helpful to understandwhere the CPU is likely to be the bottleneck. That's pretty normal; as soon as you get decent disk, especially something like an SSD with a RAM cache, you become CPU-bound. COPY does a LOT of parsing and data manipulation. Index building, of course, is almost pure CPU if you have a decent amount of RAM available. If you're restoring from a pg_dump file, and have several cores available, I suggest using parallel pg_restore. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Thanks guys - interesting. On 14/12/2010, at 5:59 AM, Josh Berkus wrote: > On 12/12/10 6:43 PM, Royce Ausburn wrote: >> Hi all, >> >> I notice that when restoring a DB on a laptop with an SDD, typically postgres is maxing out a CPU - even during a COPY. I wonder, what is postgres usually doing with the CPU? I would have thought the disk would usually be the bottleneckin the DB, but occasionally it's not. We're embarking on a new DB server project and it'd be helpful to understandwhere the CPU is likely to be the bottleneck. > > That's pretty normal; as soon as you get decent disk, especially > something like an SSD with a RAM cache, you become CPU-bound. COPY does > a LOT of parsing and data manipulation. Index building, of course, is > almost pure CPU if you have a decent amount of RAM available. > > If you're restoring from a pg_dump file, and have several cores > available, I suggest using parallel pg_restore. > > > -- > -- Josh Berkus > PostgreSQL Experts Inc. > http://www.pgexperts.com > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
>>>>> "RA" == Royce Ausburn <royce@inomial.com> writes: RA> I notice that when restoring a DB on a laptop with an SDD, RA> typically postgres is maxing out a CPU - even during a COPY. The time the CPUs spend waiting on system RAM shows up as CPU time, not as Wait time. It could be just that the SSD is fast enough that the RAM is now the bottleneck, although parsing and text<=>binary conversions (especially for integers, reals and anything stored as an integer) also can be CPU-intensive. -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6
On 12/19/2010 7:57 PM, James Cloos wrote: >>>>>> "RA" == Royce Ausburn<royce@inomial.com> writes: > RA> I notice that when restoring a DB on a laptop with an SDD, > RA> typically postgres is maxing out a CPU - even during a COPY. > > The time the CPUs spend waiting on system RAM shows up as CPU > time, not as Wait time. It could be just that the SSD is fast > enough that the RAM is now the bottleneck, although parsing > and text<=>binary conversions (especially for integers, reals > and anything stored as an integer) also can be CPU-intensive. > > -JimC Good time accounting is the most compelling reason for having a wait event interface, like Oracle. Without the wait event interface, one cannot really tell where the time is spent, at least not without profiling the database code, which is not an option for a production database. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
>>>>> "MG" == Mladen Gogala <mladen.gogala@vmsinfo.com> writes: MG> Good time accounting is the most compelling reason for having a wait MG> event interface, like Oracle. Without the wait event interface, one MG> cannot really tell where the time is spent, at least not without MG> profiling the database code, which is not an option for a production MG> database. And how exactly, given that the kernel does not know whether the CPU is active or waiting on ram, could an application do so? -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6
On Mon, Dec 20, 2010 at 10:33:26AM -0500, James Cloos wrote: > >>>>> "MG" == Mladen Gogala <mladen.gogala@vmsinfo.com> writes: > > MG> Good time accounting is the most compelling reason for having a wait > MG> event interface, like Oracle. Without the wait event interface, one > MG> cannot really tell where the time is spent, at least not without > MG> profiling the database code, which is not an option for a production > MG> database. > > And how exactly, given that the kernel does not know whether the CPU is > active or waiting on ram, could an application do so? > Exactly. I have only seen this data from hardware emulators. It would be nice to have... :) Ken
On 2010-12-20 15:48, Kenneth Marshall wrote: >> And how exactly, given that the kernel does not know whether the CPU is >> active or waiting on ram, could an application do so? >> > > Exactly. I have only seen this data from hardware emulators. It would > be nice to have... :) There's no reason that the cpu hardware couldn't gather such, and IMHO it's be dead useful, at least at the outermost cache level (preferably separately at each level). But people have trouble understanding vmstat already.... Note that dtrace *can* get to the cpu performance counters, just that the kernel doesn't routinely account for all that info per-process as routine. I'd expect IBM to have equivalent facilities. -- Jeremy
On 12/20/2010 10:33 AM, James Cloos wrote: > > And how exactly, given that the kernel does not know whether the CPU is > active or waiting on ram, could an application do so? > > -JimC That particular aspect will remain hidden, it's a domain of the hardware architecture. Nevertheless, there are things like waiting on I/O or waiting on lock, that would be extremely useful. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
On Dec 20, 2010, at 12:47 AM, Mladen Gogala wrote: > Good time accounting is the most compelling reason for having a wait event interface, like Oracle. Without the wait eventinterface, one cannot really tell where the time is spent, at least not without profiling the database code, which isnot an option for a production database. Out of curiosity, have you tried using the information that Postgres exposes to dtrace? I suspect it comes close to whatyou can get directly out of Oracle... -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Jim Nasby wrote: > On Dec 20, 2010, at 12:47 AM, Mladen Gogala wrote: > >> Good time accounting is the most compelling reason for having a wait event interface, like Oracle. Without the wait eventinterface, one cannot really tell where the time is spent, at least not without profiling the database code, which isnot an option for a production database. >> > > Out of curiosity, have you tried using the information that Postgres exposes to dtrace? I suspect it comes close to whatyou can get directly out of Oracle... > -- > Jim C. Nasby, Database Architect jim@nasby.net > 512.569.9461 (cell) http://jim.nasby.net > > > > No, I haven't but I looked it in the documentation. I surmise, however, that I will have to build my software with "--enable-dtrace", which is not enabled by default. This certainly looks promising. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 http://www.vmsinfo.com The Leader in Integrated Media Intelligence Solutions
Has anyone had a chance to recompile and try larger a larger blocksize than 8192 with pSQL 8.4.x? I'm finally getting aroundto tuning some FusionIO drives that we are setting up. We are looking to setup 4 fusionIO drives per server, and thenuse pgpooler to scale them to 3 servers so that we can scale up to 72 processors. I'm almost done with the configurationto start heavily testing but wanted to know if anyone has really messed with the blocksize options? If anyone has done load balancing with pgpooler I'd love to hear their experience with it as well. I have attached the randwriteperformance test below, as you can see going from 8K -> 32K -> 1M blocksize the drives really start to move. Thanks, - John [v025554@athenaash05 /]$ fio --filename=/fusionIO/export1/test1 --size=25G --bs=8k --direct=1 --rw=randwrite --numjobs=8--runtime=30 --group_reporting --name=file1 file1: (g=0): rw=randwrite, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1 ... file1: (g=0): rw=randwrite, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1 Starting 8 processes Jobs: 8 (f=8): [wwwwwwww] [100.0% done] [0K/138M /s] [0/17K iops] [eta 00m:00s] file1: (groupid=0, jobs=8): err= 0: pid=23287 write: io=3,819MB, bw=127MB/s, iops=16,292, runt= 30001msec clat (usec): min=41, max=1,835K, avg=268.42, stdev=3627.29 bw (KB/s) : min= 4, max=142304, per=15.13%, avg=19714.31, stdev=8364.40 cpu : usr=0.16%, sys=3.13%, ctx=1123544, majf=0, minf=176 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued r/w: total=0/488779, short=0/0 lat (usec): 50=14.81%, 100=58.17%, 250=0.28%, 500=22.33%, 750=3.67% lat (usec): 1000=0.16% lat (msec): 2=0.35%, 4=0.09%, 10=0.11%, 20=0.01%, 50=0.01% lat (msec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec): 2000=0.01% Run status group 0 (all jobs): WRITE: io=3,819MB, aggrb=127MB/s, minb=130MB/s, maxb=130MB/s, mint=30001msec, maxt=30001msec Disk stats (read/write): md0: ios=0/514993, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,aggrutil=0.00% fiod: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan% fioc: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan% fiob: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan% fioa: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan% [v025554@athenaash05 /]$ fio --filename=/fusionIO/export1/test1 --size=25G --bs=32k --direct=1 --rw=randwrite --numjobs=8--runtime=30 --group_reporting --name=file1 file1: (g=0): rw=randwrite, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1 ... file1: (g=0): rw=randwrite, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1 Starting 8 processes Jobs: 8 (f=8): [wwwwwwww] [100.0% done] [0K/343M /s] [0/11K iops] [eta 00m:00s] file1: (groupid=0, jobs=8): err= 0: pid=23835 write: io=9,833MB, bw=328MB/s, iops=10,487, runt= 30002msec clat (usec): min=64, max=227K, avg=349.31, stdev=1517.64 bw (KB/s) : min= 883, max=171712, per=16.25%, avg=54548.49, stdev=13973.76 cpu : usr=0.18%, sys=2.82%, ctx=669138, majf=0, minf=176 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued r/w: total=0/314659, short=0/0 lat (usec): 100=84.14%, 250=8.73%, 500=0.07%, 750=2.45%, 1000=3.19% lat (msec): 2=0.29%, 4=0.17%, 10=0.22%, 20=0.23%, 50=0.31% lat (msec): 100=0.13%, 250=0.05% Run status group 0 (all jobs): WRITE: io=9,833MB, aggrb=328MB/s, minb=336MB/s, maxb=336MB/s, mint=30002msec, maxt=30002msec Disk stats (read/write): md0: ios=0/455522, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,aggrutil=0.00% fiod: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan% fioc: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan% fiob: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan% fioa: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan% [v025554@athenaash05 /]$ fio --filename=/fusionIO/export1/test1 --size=25G --bs=1M --direct=1 --rw=randwrite --numjobs=8--runtime=30 --group_reporting --name=file1 file1: (g=0): rw=randwrite, bs=1M-1M/1M-1M, ioengine=sync, iodepth=1 ... file1: (g=0): rw=randwrite, bs=1M-1M/1M-1M, ioengine=sync, iodepth=1 Starting 8 processes Jobs: 8 (f=8): [wwwwwwww] [100.0% done] [0K/912M /s] [0/890 iops] [eta 00m:00s] file1: (groupid=0, jobs=8): err= 0: pid=24877 write: io=25,860MB, bw=862MB/s, iops=861, runt= 30004msec clat (usec): min=456, max=83,766, avg=5599.02, stdev=2026.74 bw (KB/s) : min=28603, max=216966, per=11.93%, avg=105311.22, stdev=10668.06 cpu : usr=0.06%, sys=2.03%, ctx=91888, majf=0, minf=176 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued r/w: total=0/25860, short=0/0 lat (usec): 500=12.74%, 750=20.60%, 1000=7.37% lat (msec): 2=3.12%, 4=6.57%, 10=26.95%, 20=21.37%, 50=1.12% lat (msec): 100=0.14% Run status group 0 (all jobs): WRITE: io=25,860MB, aggrb=862MB/s, minb=883MB/s, maxb=883MB/s, mint=30004msec, maxt=30004msec Disk stats (read/write): md0: ios=0/500382, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,aggrutil=0.00% fiod: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan% fioc: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan% fiob: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan% fioa: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan% This communication is for informational purposes only. It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction. All market prices, data and other information are not warranted as to completeness or accuracy and are subject to change without notice. Any comments or statements made herein do not necessarily reflect those of JPMorgan Chase & Co., its subsidiaries and affiliates. This transmission may contain information that is privileged, confidential, legally privileged, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is STRICTLY PROHIBITED. Although this transmission and any attachments are believed to be free of any virus or other defect that might affect any computer system into which it is received and opened, it is the responsibility of the recipient to ensure that it is virus free and no responsibility is accepted by JPMorgan Chase & Co., its subsidiaries and affiliates, as applicable, for any loss or damage arising in any way from its use. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format. Thank you. Please refer to http://www.jpmorgan.com/pages/disclosures for disclosures relating to European legal entities.
Strange, John W wrote: > Has anyone had a chance to recompile and try larger a larger blocksize than 8192 with pSQL 8.4.x? While I haven't done the actual experiment you're asking about, the problem working against you here is how WAL data is used to protect against partial database writes. See the documentation for full_page_writes at http://www.postgresql.org/docs/current/static/runtime-config-wal.html Because full size copies of the blocks have to get written there, attempts to chunk writes into larger pieces end up requiring a correspondingly larger volume of writes to protect against partial writes to those pages. You might get a nice efficiency gain on the read side, but the situation when under a heavy write load (the main thing you have to be careful about with these SSDs) is much less clear. I wouldn't draw any conclusions whatsoever from what fio says about this; it's pretty useless IMHO for simulating anything like a real database workload. I don't even use that utility anymore, as I found it just wasted my time compared with moving straight onto something that tries to act to like a database application simulation. You might try running pgbench with the database scale set to large enough that the resulting database is large relative to total system RAM instead. P.S. Make sure you put the FusionIO drives under a heavy write load and power down the server hard, so you can see what happens if there's a real-world crash. Recovery time to remount the drives in that situation is the main drawback of their design. It does the right thing to protect your data as far as I know, but the recovery can be quite time intensive. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Mon, Jan 3, 2011 at 9:13 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Strange, John W wrote: >> >> Has anyone had a chance to recompile and try larger a larger blocksize >> than 8192 with pSQL 8.4.x? > > While I haven't done the actual experiment you're asking about, the problem > working against you here is how WAL data is used to protect against partial > database writes. See the documentation for full_page_writes at > http://www.postgresql.org/docs/current/static/runtime-config-wal.html > Because full size copies of the blocks have to get written there, attempts > to chunk writes into larger pieces end up requiring a correspondingly larger > volume of writes to protect against partial writes to those pages. You > might get a nice efficiency gain on the read side, but the situation when > under a heavy write load (the main thing you have to be careful about with > these SSDs) is much less clear. most flash drives, especially mlc flash, use huge blocks anyways on physical level. the numbers claimed here (http://www.fusionio.com/products/iodrive/) (141k write iops) are simply not believable without write buffering. i didn't see any note of how fault tolerance is maintained through the buffer (anyone know?). assuming they do buffer, i would expect a smaller blocksize would be better/easier on the ssd, since this would mean less gross writing, higher maximum throughput, and less wear and tear on the flash; the advantages of the larger blocksize are very hardware driven and already managed by the controller. if they don't buffer (again, I'm very skeptical this is the case), a larger block size, possibly even a much larger block size (like 256k) would be an interesting test. i'm pretty skeptical about the fusion i/o product generally, because i don't think the sata interface is a bottleneck save for read caching, and the o/s is already buffering reads. the storage medium is still the bottleneck for the most part (even if it's much faster at certain things). note fusion is still useful for some things, but the nice is narrower than it looks on the surface. merlin
On Jan 4, 2011, at 8:48 AM, Merlin Moncure wrote: > > most flash drives, especially mlc flash, use huge blocks anyways on > physical level. the numbers claimed here > (http://www.fusionio.com/products/iodrive/) (141k write iops) are > simply not believable without write buffering. i didn't see any note > of how fault tolerance is maintained through the buffer (anyone > know?). FusionIO buffers. They have capacitors onboard to protect against crashing and power failure. They passed our crash attemptsto corrupt writes to them before we put them into production, for whatever that's worth, but they do take a longtime to come back online after an unclean shutdown.
This has gotten a lot better with the 2.x drivers as well. I'm completely aware of the FusionIO and it's advantages/disadvantages.. I'm currently getting the following pgbench resultsbut still only hitting the array for about 800MB/sec, short of the 3GB/sec that it's capable of. This is simply atrash DB for us to store results in for short periods of time. If something bad was to happen we can regenerate the results. So performance with limited risk is what we are looking to achieve. asgprod@ASH01_riskresults $ pgbench -v -j 4 -t 200000 -c 16 -h localhost -p 4410 pgbench_10000 starting vacuum...end. starting vacuum pgbench_accounts...end. transaction type: TPC-B (sort of) scaling factor: 10000 query mode: simple number of clients: 16 number of threads: 4 number of transactions per client: 200000 number of transactions actually processed: 3200000/3200000 tps = 16783.841042 (including connections establishing) tps = 16785.592722 (excluding connections establishing) -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Ben Chobot Sent: Tuesday, January 04, 2011 12:37 PM To: Merlin Moncure Cc: pgsql-performance@postgresql.org Performance Subject: Re: [PERFORM] Question: BlockSize > 8192 with FusionIO On Jan 4, 2011, at 8:48 AM, Merlin Moncure wrote: > > most flash drives, especially mlc flash, use huge blocks anyways on > physical level. the numbers claimed here > (http://www.fusionio.com/products/iodrive/) (141k write iops) are > simply not believable without write buffering. i didn't see any note > of how fault tolerance is maintained through the buffer (anyone > know?). FusionIO buffers. They have capacitors onboard to protect against crashing and power failure. They passed our crash attemptsto corrupt writes to them before we put them into production, for whatever that's worth, but they do take a longtime to come back online after an unclean shutdown. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance This communication is for informational purposes only. It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction. All market prices, data and other information are not warranted as to completeness or accuracy and are subject to change without notice. Any comments or statements made herein do not necessarily reflect those of JPMorgan Chase & Co., its subsidiaries and affiliates. This transmission may contain information that is privileged, confidential, legally privileged, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is STRICTLY PROHIBITED. Although this transmission and any attachments are believed to be free of any virus or other defect that might affect any computer system into which it is received and opened, it is the responsibility of the recipient to ensure that it is virus free and no responsibility is accepted by JPMorgan Chase & Co., its subsidiaries and affiliates, as applicable, for any loss or damage arising in any way from its use. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format. Thank you. Please refer to http://www.jpmorgan.com/pages/disclosures for disclosures relating to European legal entities.
Test, Sorry trying to fix why my email is getting formatted to bits when posting to the list. - John -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Strange, John W Sent: Tuesday, January 04, 2011 1:01 PM To: Ben Chobot; Merlin Moncure Cc: pgsql-performance@postgresql.org Performance Subject: Re: [PERFORM] Question: BlockSize > 8192 with FusionIO This has gotten a lot better with the 2.x drivers as well. I'm completely aware of the FusionIO and it's advantages/disadvantages.. I'm currently getting the following pgbench resultsbut still only hitting the array for about 800MB/sec, short of the 3GB/sec that it's capable of. This is simply atrash DB for us to store results in for short periods of time. If something bad was to happen we can regenerate the results. So performance with limited risk is what we are looking to achieve. asgprod@ASH01_riskresults $ pgbench -v -j 4 -t 200000 -c 16 -h localhost -p 4410 pgbench_10000 starting vacuum...end. starting vacuum pgbench_accounts...end. transaction type: TPC-B (sort of) scaling factor: 10000 query mode: simple number of clients: 16 number of threads: 4 number of transactions per client: 200000 number of transactions actually processed: 3200000/3200000 tps = 16783.841042 (including connections establishing) tps = 16785.592722 (excluding connections establishing) -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Ben Chobot Sent: Tuesday, January 04, 2011 12:37 PM To: Merlin Moncure Cc: pgsql-performance@postgresql.org Performance Subject: Re: [PERFORM] Question: BlockSize > 8192 with FusionIO On Jan 4, 2011, at 8:48 AM, Merlin Moncure wrote: > > most flash drives, especially mlc flash, use huge blocks anyways on > physical level. the numbers claimed here > (http://www.fusionio.com/products/iodrive/) (141k write iops) are > simply not believable without write buffering. i didn't see any note > of how fault tolerance is maintained through the buffer (anyone > know?). FusionIO buffers. They have capacitors onboard to protect against crashing and power failure. They passed our crash attemptsto corrupt writes to them before we put them into production, for whatever that's worth, but they do take a longtime to come back online after an unclean shutdown. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance This communication is for informational purposes only. It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction. All market prices, data and other information are not warranted as to completeness or accuracy and are subject to change without notice. Any comments or statements made herein do not necessarily reflect those of JPMorgan Chase & Co., its subsidiaries and affiliates. This transmission may contain information that is privileged, confidential, legally privileged, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is STRICTLY PROHIBITED. Although this transmission and any attachments are believed to be free of any virus or other defect that might affect any computer system into which it is received and opened, it is the responsibility of the recipient to ensure that it is virus free and no responsibility is accepted by JPMorgan Chase & Co., its subsidiaries and affiliates, as applicable, for any loss or damage arising in any way from its use. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format. Thank you. Please refer to http://www.jpmorgan.com/pages/disclosures for disclosures relating to European legal entities. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance This communication is for informational purposes only. It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction. All market prices, data and other information are not warranted as to completeness or accuracy and are subject to change without notice. Any comments or statements made herein do not necessarily reflect those of JPMorgan Chase & Co., its subsidiaries and affiliates. This transmission may contain information that is privileged, confidential, legally privileged, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is STRICTLY PROHIBITED. Although this transmission and any attachments are believed to be free of any virus or other defect that might affect any computer system into which it is received and opened, it is the responsibility of the recipient to ensure that it is virus free and no responsibility is accepted by JPMorgan Chase & Co., its subsidiaries and affiliates, as applicable, for any loss or damage arising in any way from its use. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format. Thank you. Please refer to http://www.jpmorgan.com/pages/disclosures for disclosures relating to European legal entities.
On Jan 4, 2011, at 8:48 AM, Merlin Moncure wrote: > On Mon, Jan 3, 2011 at 9:13 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> Strange, John W wrote: >>> >>> Has anyone had a chance to recompile and try larger a larger blocksize >>> than 8192 with pSQL 8.4.x? >> >> While I haven't done the actual experiment you're asking about, the problem >> working against you here is how WAL data is used to protect against partial >> database writes. See the documentation for full_page_writes at >> http://www.postgresql.org/docs/current/static/runtime-config-wal.html >> Because full size copies of the blocks have to get written there, attempts >> to chunk writes into larger pieces end up requiring a correspondingly larger >> volume of writes to protect against partial writes to those pages. You >> might get a nice efficiency gain on the read side, but the situation when >> under a heavy write load (the main thing you have to be careful about with >> these SSDs) is much less clear. > > most flash drives, especially mlc flash, use huge blocks anyways on > physical level. the numbers claimed here > (http://www.fusionio.com/products/iodrive/) (141k write iops) are > simply not believable without write buffering. i didn't see any note > of how fault tolerance is maintained through the buffer (anyone > know?). Flash may have very large erase blocks -- 4k to 16M, but you can write to it at much smaller block sizes sequentially. It has to delete a block in bulk, but it can write to an erased block bit by bit, sequentially (512 or 4096 bytes typically,but some is 8k and 16k). Older MLC NAND flash could be written to at a couple bytes at a time -- but drives today incorporate too much EEC and uselarger chunks to do that. The minimum write size now is caused by the EEC requirements and not the physical NAND flashrequirements. So, buffering isn't that big of a requirement with the current LBA > Physical translations which change all writes -- randomor not -- to sequential writes in one erase block. But performance if waiting for the write to complete will not be all that good, especially with MLC. Turn off the bufferon an Intel SLC drive for example, and write IOPS is cut by 1/3 or more -- to 'only' 1000 or so iops.