Thread: CPU bound

CPU bound

From

Royce Ausburn

Date:

13 December 2010, 06:57:20

Hi all,

I notice that when restoring a DB on a laptop with an SDD, typically postgres is maxing out a CPU - even during a COPY.
I wonder, what is postgres usually doing with the CPU?  I would have thought the disk would usually be the bottleneck
inthe DB, but occasionally it's not.  We're embarking on a new DB server project and it'd be helpful to understand
wherethe CPU is likely to be the bottleneck. 

Cheers,

--Royce

Re: CPU bound

From

Craig Ringer

Date:

13 December 2010, 10:11:08

On 12/13/2010 10:43 AM, Royce Ausburn wrote:
> Hi all,
>
> I notice that when restoring a DB on a laptop with an SDD, typically postgres is maxing out a CPU - even during a
COPY. I wonder, what is postgres usually doing with the CPU?  I would have thought the disk would usually be the
bottleneckin the DB, but occasionally it's not.  We're embarking on a new DB server project and it'd be helpful to
understandwhere the CPU is likely to be the bottleneck. 

A few thoughts:

- Pg isn't capable of using more than one core for a single task, so if
you have one really big job, you'll more easily start struggling on CPU.
Restores appear to be a pain point here, though recent work has been
done to address that.

- Even with pg_dump/pg_restore's parallel restore, you can't be using
more than one core to do work for a single COPY or other individual
operation. You can only parallelize down to the table level at the moment.

- Pg's design has always focused on rotating media. It can make sense to
trade increased CPU costs for reduced I/O when disk storage is slower
relative to CPU/RAM. There aren't, AFAIK, many controls beyond the
random/seq io knobs to get Pg to try to save CPU at the cost of more I/O
when opportunities to do so appear.

- Pg's CPU load depends a lot on the data types and table structures in
use. What're your tables like? Do they have indexes added at the end, or
are they created with indexes then populated with rows? The former is
MUCH faster. Are they full of NUMERIC fields? Those seem to be
incredibly slow compared to int/float/etc, which is hardly surprising
given their storage and how they work.

--
Craig Ringer

Re: CPU bound

From

Josh Berkus

Date:

13 December 2010, 17:59:40

On 12/12/10 6:43 PM, Royce Ausburn wrote:
> Hi all,
>
> I notice that when restoring a DB on a laptop with an SDD, typically postgres is maxing out a CPU - even during a
COPY. I wonder, what is postgres usually doing with the CPU?  I would have thought the disk would usually be the
bottleneckin the DB, but occasionally it's not.  We're embarking on a new DB server project and it'd be helpful to
understandwhere the CPU is likely to be the bottleneck. 

That's pretty normal; as soon as you get decent disk, especially
something like an SSD with a RAM cache, you become CPU-bound.  COPY does
a LOT of parsing and data manipulation.  Index building, of course, is
almost pure CPU if you have a decent amount of RAM available.

If you're restoring from a pg_dump file, and have several cores
available, I suggest using parallel pg_restore.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: CPU bound

From

Royce Ausburn

Date:

14 December 2010, 07:22:32

Thanks guys - interesting.


On 14/12/2010, at 5:59 AM, Josh Berkus wrote:

> On 12/12/10 6:43 PM, Royce Ausburn wrote:
>> Hi all,
>>
>> I notice that when restoring a DB on a laptop with an SDD, typically postgres is maxing out a CPU - even during a
COPY. I wonder, what is postgres usually doing with the CPU?  I would have thought the disk would usually be the
bottleneckin the DB, but occasionally it's not.  We're embarking on a new DB server project and it'd be helpful to
understandwhere the CPU is likely to be the bottleneck. 
>
> That's pretty normal; as soon as you get decent disk, especially
> something like an SSD with a RAM cache, you become CPU-bound.  COPY does
> a LOT of parsing and data manipulation.  Index building, of course, is
> almost pure CPU if you have a decent amount of RAM available.
>
> If you're restoring from a pg_dump file, and have several cores
> available, I suggest using parallel pg_restore.
>
>
> --
>                                  -- Josh Berkus
>                                     PostgreSQL Experts Inc.
>                                     http://www.pgexperts.com
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: CPU bound

From

James Cloos

Date:

20 December 2010, 00:12:03

>>>>> "RA" == Royce Ausburn <royce@inomial.com> writes:

RA> I notice that when restoring a DB on a laptop with an SDD,
RA> typically postgres is maxing out a CPU - even during a COPY.

The time the CPUs spend waiting on system RAM shows up as CPU
time, not as Wait time.  It could be just that the SSD is fast
enough that the RAM is now the bottleneck, although parsing
and text<=>binary conversions (especially for integers, reals
and anything stored as an integer) also can be CPU-intensive.

-JimC
--
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

Re: CPU bound

From

Mladen Gogala

Date:

20 December 2010, 05:47:39

On 12/19/2010 7:57 PM, James Cloos wrote:
>>>>>> "RA" == Royce Ausburn<royce@inomial.com>  writes:
> RA>  I notice that when restoring a DB on a laptop with an SDD,
> RA>  typically postgres is maxing out a CPU - even during a COPY.
>
> The time the CPUs spend waiting on system RAM shows up as CPU
> time, not as Wait time.  It could be just that the SSD is fast
> enough that the RAM is now the bottleneck, although parsing
> and text<=>binary conversions (especially for integers, reals
> and anything stored as an integer) also can be CPU-intensive.
>
> -JimC

Good time accounting is the most compelling reason for having a wait
event interface, like Oracle. Without the wait event interface, one
cannot really tell where the time is spent, at least not without
profiling the database code, which is not an option for a production
database.

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com

Re: CPU bound

From

James Cloos

Date:

20 December 2010, 14:41:00

>>>>> "MG" == Mladen Gogala <mladen.gogala@vmsinfo.com> writes:

MG> Good time accounting is the most compelling reason for having a wait
MG> event interface, like Oracle. Without the wait event interface, one
MG> cannot really tell where the time is spent, at least not without
MG> profiling the database code, which is not an option for a production
MG> database.

And how exactly, given that the kernel does not know whether the CPU is
active or waiting on ram, could an application do so?

-JimC
--
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

Re: CPU bound

From

Kenneth Marshall

Date:

20 December 2010, 14:49:02

On Mon, Dec 20, 2010 at 10:33:26AM -0500, James Cloos wrote:
> >>>>> "MG" == Mladen Gogala <mladen.gogala@vmsinfo.com> writes:
>
> MG> Good time accounting is the most compelling reason for having a wait
> MG> event interface, like Oracle. Without the wait event interface, one
> MG> cannot really tell where the time is spent, at least not without
> MG> profiling the database code, which is not an option for a production
> MG> database.
>
> And how exactly, given that the kernel does not know whether the CPU is
> active or waiting on ram, could an application do so?
>

Exactly. I have only seen this data from hardware emulators. It would
be nice to have...  :)

Ken

Re: CPU bound

From

Jeremy Harris

Date:

20 December 2010, 17:15:37

On 2010-12-20 15:48, Kenneth Marshall wrote:
>> And how exactly, given that the kernel does not know whether the CPU is
>> active or waiting on ram, could an application do so?
>>
>
> Exactly. I have only seen this data from hardware emulators. It would
> be nice to have...  :)

There's no reason that the cpu hardware couldn't gather such, and
IMHO it's be dead useful, at least at the outermost cache level
(preferably separately at each level).   But people have trouble
understanding vmstat already....

Note that dtrace *can* get to the cpu performance counters,
just that the kernel doesn't routinely account for all that info
per-process as routine.   I'd expect IBM to have equivalent
facilities.

--
Jeremy

Re: CPU bound

From

Mladen Gogala

Date:

21 December 2010, 10:34:39

On 12/20/2010 10:33 AM, James Cloos wrote:
>
> And how exactly, given that the kernel does not know whether the CPU is
> active or waiting on ram, could an application do so?
>
> -JimC
That particular aspect will remain hidden, it's a domain of the hardware
architecture. Nevertheless, there are things like waiting on I/O or
waiting on lock, that would be extremely useful.


--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com

Re: CPU bound

From

Jim Nasby

Date:

02 January 2011, 20:57:19

On Dec 20, 2010, at 12:47 AM, Mladen Gogala wrote:
> Good time accounting is the most compelling reason for having a wait event interface, like Oracle. Without the wait
eventinterface, one cannot really tell where the time is spent, at least not without profiling the database code, which
isnot an option for a production database. 

Out of curiosity, have you tried using the information that Postgres exposes to dtrace? I suspect it comes close to
whatyou can get directly out of Oracle... 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: CPU bound

From

Mladen Gogala

Date:

03 January 2011, 15:17:27

Jim Nasby wrote:
> On Dec 20, 2010, at 12:47 AM, Mladen Gogala wrote:
>
>> Good time accounting is the most compelling reason for having a wait event interface, like Oracle. Without the wait
eventinterface, one cannot really tell where the time is spent, at least not without profiling the database code, which
isnot an option for a production database. 
>>
>
> Out of curiosity, have you tried using the information that Postgres exposes to dtrace? I suspect it comes close to
whatyou can get directly out of Oracle... 
> --
> Jim C. Nasby, Database Architect                   jim@nasby.net
> 512.569.9461 (cell)                         http://jim.nasby.net
>
>
>
>
No, I haven't but I looked it in the documentation. I surmise, however,
that I will have to build my software with "--enable-dtrace", which is
not enabled by default. This certainly looks promising.

--

Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com
The Leader in Integrated Media Intelligence Solutions

Question: BlockSize > 8192 with FusionIO

From

"Strange, John W"

Date:

03 January 2011, 23:33:35

Has anyone had a chance to recompile and try larger a larger blocksize than 8192 with pSQL 8.4.x?  I'm finally getting
aroundto tuning some FusionIO drives that we are setting up.  We are looking to setup 4 fusionIO drives per server, and
thenuse pgpooler to scale them to 3 servers so that we can scale up to 72 processors.  I'm almost done with the
configurationto start heavily testing but wanted to know if anyone has really messed with the blocksize options?
 

If anyone has done load balancing with pgpooler I'd love to hear their experience with it as well.  I have attached the
randwriteperformance test below, as you can see going from 8K -> 32K -> 1M blocksize the drives really start to move.
 

Thanks,
- John

[v025554@athenaash05 /]$ fio --filename=/fusionIO/export1/test1 --size=25G --bs=8k --direct=1 --rw=randwrite
--numjobs=8--runtime=30 --group_reporting --name=file1
 
file1: (g=0): rw=randwrite, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1
...
file1: (g=0): rw=randwrite, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1
Starting 8 processes
Jobs: 8 (f=8): [wwwwwwww] [100.0% done] [0K/138M /s] [0/17K iops] [eta 00m:00s]
file1: (groupid=0, jobs=8): err= 0: pid=23287
  write: io=3,819MB, bw=127MB/s, iops=16,292, runt= 30001msec
    clat (usec): min=41, max=1,835K, avg=268.42, stdev=3627.29
    bw (KB/s) : min=    4, max=142304, per=15.13%, avg=19714.31, stdev=8364.40
  cpu          : usr=0.16%, sys=3.13%, ctx=1123544, majf=0, minf=176
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=0/488779, short=0/0
     lat (usec): 50=14.81%, 100=58.17%, 250=0.28%, 500=22.33%, 750=3.67%
     lat (usec): 1000=0.16%
     lat (msec): 2=0.35%, 4=0.09%, 10=0.11%, 20=0.01%, 50=0.01%
     lat (msec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2000=0.01%

Run status group 0 (all jobs):
  WRITE: io=3,819MB, aggrb=127MB/s, minb=130MB/s, maxb=130MB/s, mint=30001msec, maxt=30001msec

Disk stats (read/write):
  md0: ios=0/514993, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0,
aggrin_queue=0,aggrutil=0.00%
 
    fiod: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan%
    fioc: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan%
    fiob: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan%
    fioa: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan%
[v025554@athenaash05 /]$ fio --filename=/fusionIO/export1/test1 --size=25G --bs=32k --direct=1 --rw=randwrite
--numjobs=8--runtime=30 --group_reporting --name=file1
 
file1: (g=0): rw=randwrite, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1
...
file1: (g=0): rw=randwrite, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1
Starting 8 processes
Jobs: 8 (f=8): [wwwwwwww] [100.0% done] [0K/343M /s] [0/11K iops] [eta 00m:00s]
file1: (groupid=0, jobs=8): err= 0: pid=23835
  write: io=9,833MB, bw=328MB/s, iops=10,487, runt= 30002msec
    clat (usec): min=64, max=227K, avg=349.31, stdev=1517.64
    bw (KB/s) : min=  883, max=171712, per=16.25%, avg=54548.49, stdev=13973.76
  cpu          : usr=0.18%, sys=2.82%, ctx=669138, majf=0, minf=176
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=0/314659, short=0/0
     lat (usec): 100=84.14%, 250=8.73%, 500=0.07%, 750=2.45%, 1000=3.19%
     lat (msec): 2=0.29%, 4=0.17%, 10=0.22%, 20=0.23%, 50=0.31%
     lat (msec): 100=0.13%, 250=0.05%

Run status group 0 (all jobs):
  WRITE: io=9,833MB, aggrb=328MB/s, minb=336MB/s, maxb=336MB/s, mint=30002msec, maxt=30002msec

Disk stats (read/write):
  md0: ios=0/455522, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0,
aggrin_queue=0,aggrutil=0.00%
 
    fiod: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan%
    fioc: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan%
    fiob: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan%
    fioa: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan%
[v025554@athenaash05 /]$ fio --filename=/fusionIO/export1/test1 --size=25G --bs=1M --direct=1 --rw=randwrite
--numjobs=8--runtime=30 --group_reporting --name=file1
 
file1: (g=0): rw=randwrite, bs=1M-1M/1M-1M, ioengine=sync, iodepth=1
...
file1: (g=0): rw=randwrite, bs=1M-1M/1M-1M, ioengine=sync, iodepth=1
Starting 8 processes
Jobs: 8 (f=8): [wwwwwwww] [100.0% done] [0K/912M /s] [0/890 iops] [eta 00m:00s]
file1: (groupid=0, jobs=8): err= 0: pid=24877
  write: io=25,860MB, bw=862MB/s, iops=861, runt= 30004msec
    clat (usec): min=456, max=83,766, avg=5599.02, stdev=2026.74
    bw (KB/s) : min=28603, max=216966, per=11.93%, avg=105311.22, stdev=10668.06
  cpu          : usr=0.06%, sys=2.03%, ctx=91888, majf=0, minf=176
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=0/25860, short=0/0
     lat (usec): 500=12.74%, 750=20.60%, 1000=7.37%
     lat (msec): 2=3.12%, 4=6.57%, 10=26.95%, 20=21.37%, 50=1.12%
     lat (msec): 100=0.14%

Run status group 0 (all jobs):
  WRITE: io=25,860MB, aggrb=862MB/s, minb=883MB/s, maxb=883MB/s, mint=30004msec, maxt=30004msec

Disk stats (read/write):
  md0: ios=0/500382, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0,
aggrin_queue=0,aggrutil=0.00%
 
    fiod: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan%
    fioc: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan%
    fiob: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan%
    fioa: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=nan%

This communication is for informational purposes only. It is not
intended as an offer or solicitation for the purchase or sale of
any financial instrument or as an official confirmation of any
transaction. All market prices, data and other information are not
warranted as to completeness or accuracy and are subject to change
without notice. Any comments or statements made herein do not
necessarily reflect those of JPMorgan Chase & Co., its subsidiaries
and affiliates.

This transmission may contain information that is privileged,
confidential, legally privileged, and/or exempt from disclosure
under applicable law. If you are not the intended recipient, you
are hereby notified that any disclosure, copying, distribution, or
use of the information contained herein (including any reliance
thereon) is STRICTLY PROHIBITED. Although this transmission and any
attachments are believed to be free of any virus or other defect
that might affect any computer system into which it is received and
opened, it is the responsibility of the recipient to ensure that it
is virus free and no responsibility is accepted by JPMorgan Chase &
Co., its subsidiaries and affiliates, as applicable, for any loss
or damage arising in any way from its use. If you received this
transmission in error, please immediately contact the sender and
destroy the material in its entirety, whether in electronic or hard
copy format. Thank you.

Please refer to http://www.jpmorgan.com/pages/disclosures for
disclosures relating to European legal entities.

Re: Question: BlockSize > 8192 with FusionIO

From

Greg Smith

Date:

04 January 2011, 01:13:39

Strange, John W wrote:
> Has anyone had a chance to recompile and try larger a larger blocksize than 8192 with pSQL 8.4.x?

While I haven't done the actual experiment you're asking about, the
problem working against you here is how WAL data is used to protect
against partial database writes. See the documentation for
full_page_writes at
http://www.postgresql.org/docs/current/static/runtime-config-wal.html
Because full size copies of the blocks have to get written there,
attempts to chunk writes into larger pieces end up requiring a
correspondingly larger volume of writes to protect against partial
writes to those pages. You might get a nice efficiency gain on the read
side, but the situation when under a heavy write load (the main thing
you have to be careful about with these SSDs) is much less clear.

I wouldn't draw any conclusions whatsoever from what fio says about
this; it's pretty useless IMHO for simulating anything like a real
database workload. I don't even use that utility anymore, as I found it
just wasted my time compared with moving straight onto something that
tries to act to like a database application simulation. You might try
running pgbench with the database scale set to large enough that the
resulting database is large relative to total system RAM instead.

P.S. Make sure you put the FusionIO drives under a heavy write load and
power down the server hard, so you can see what happens if there's a
real-world crash. Recovery time to remount the drives in that situation
is the main drawback of their design. It does the right thing to
protect your data as far as I know, but the recovery can be quite time
intensive.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: Question: BlockSize > 8192 with FusionIO

From

Merlin Moncure

Date:

04 January 2011, 15:49:12

On Mon, Jan 3, 2011 at 9:13 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Strange, John W wrote:
>>
>> Has anyone had a chance to recompile and try larger a larger blocksize
>> than 8192 with pSQL 8.4.x?
>
> While I haven't done the actual experiment you're asking about, the problem
> working against you here is how WAL data is used to protect against partial
> database writes.  See the documentation for full_page_writes at
> http://www.postgresql.org/docs/current/static/runtime-config-wal.html
>  Because full size copies of the blocks have to get written there, attempts
> to chunk writes into larger pieces end up requiring a correspondingly larger
> volume of writes to protect against partial writes to those pages.  You
> might get a nice efficiency gain on the read side, but the situation when
> under a heavy write load (the main thing you have to be careful about with
> these SSDs) is much less clear.

most flash drives, especially mlc flash, use huge blocks anyways on
physical level.  the numbers claimed here
(http://www.fusionio.com/products/iodrive/)  (141k write iops) are
simply not believable without write buffering.  i didn't see any note
of how fault tolerance is maintained through the buffer (anyone
know?).

assuming they do buffer, i would expect a smaller blocksize would be
better/easier on the ssd, since this would mean less gross writing,
higher maximum throughput, and less wear and tear on the flash; the
advantages of the larger blocksize are very hardware driven and
already managed by the controller.

if they don't buffer (again, I'm very skeptical this is the case), a
larger block size, possibly even a much larger block size (like 256k)
would be an interesting test.  i'm pretty skeptical about the fusion
i/o product generally, because i don't think the sata interface is a
bottleneck save for read caching, and the o/s is already buffering
reads.  the storage medium is still the bottleneck for the most part
(even if it's much faster at certain things).   note fusion is still
useful for some things, but the nice is narrower than it looks on the
surface.

merlin

Re: Question: BlockSize > 8192 with FusionIO

From

Ben Chobot

Date:

04 January 2011, 17:37:08

On Jan 4, 2011, at 8:48 AM, Merlin Moncure wrote:

>
> most flash drives, especially mlc flash, use huge blocks anyways on
> physical level.  the numbers claimed here
> (http://www.fusionio.com/products/iodrive/)  (141k write iops) are
> simply not believable without write buffering.  i didn't see any note
> of how fault tolerance is maintained through the buffer (anyone
> know?).

FusionIO buffers. They have capacitors onboard to protect against crashing and power failure. They passed our crash
attemptsto corrupt writes to them before we put them into production, for whatever that's worth, but they do take a
longtime to come back online after an unclean shutdown.

Re: Question: BlockSize > 8192 with FusionIO

From

"Strange, John W"

Date:

04 January 2011, 19:07:47

This has gotten a lot better with the 2.x drivers as well.

I'm completely aware of the FusionIO and it's advantages/disadvantages.. I'm currently getting the following pgbench
resultsbut still only hitting the array for about 800MB/sec, short of the 3GB/sec that it's capable of.  This is simply
atrash DB for us to store results in for short periods of time.  If something bad was to happen we can regenerate the
results. So performance with limited risk is what we are looking to achieve. 

asgprod@ASH01_riskresults $ pgbench -v -j 4 -t 200000 -c 16 -h localhost -p 4410 pgbench_10000
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: TPC-B (sort of)
scaling factor: 10000
query mode: simple
number of clients: 16
number of threads: 4
number of transactions per client: 200000
number of transactions actually processed: 3200000/3200000
tps = 16783.841042 (including connections establishing)
tps = 16785.592722 (excluding connections establishing)

-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Ben Chobot
Sent: Tuesday, January 04, 2011 12:37 PM
To: Merlin Moncure
Cc: pgsql-performance@postgresql.org Performance
Subject: Re: [PERFORM] Question: BlockSize > 8192 with FusionIO

On Jan 4, 2011, at 8:48 AM, Merlin Moncure wrote:

> 
> most flash drives, especially mlc flash, use huge blocks anyways on
> physical level.  the numbers claimed here
> (http://www.fusionio.com/products/iodrive/)  (141k write iops) are
> simply not believable without write buffering.  i didn't see any note
> of how fault tolerance is maintained through the buffer (anyone
> know?).

FusionIO buffers. They have capacitors onboard to protect against crashing and power failure. They passed our crash
attemptsto corrupt writes to them before we put them into production, for whatever that's worth, but they do take a
longtime to come back online after an unclean shutdown.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
This communication is for informational purposes only. It is not
intended as an offer or solicitation for the purchase or sale of
any financial instrument or as an official confirmation of any
transaction. All market prices, data and other information are not
warranted as to completeness or accuracy and are subject to change
without notice. Any comments or statements made herein do not
necessarily reflect those of JPMorgan Chase & Co., its subsidiaries
and affiliates.

This transmission may contain information that is privileged,
confidential, legally privileged, and/or exempt from disclosure
under applicable law. If you are not the intended recipient, you
are hereby notified that any disclosure, copying, distribution, or
use of the information contained herein (including any reliance
thereon) is STRICTLY PROHIBITED. Although this transmission and any
attachments are believed to be free of any virus or other defect
that might affect any computer system into which it is received and
opened, it is the responsibility of the recipient to ensure that it
is virus free and no responsibility is accepted by JPMorgan Chase &
Co., its subsidiaries and affiliates, as applicable, for any loss
or damage arising in any way from its use. If you received this
transmission in error, please immediately contact the sender and
destroy the material in its entirety, whether in electronic or hard
copy format. Thank you.

Please refer to http://www.jpmorgan.com/pages/disclosures for
disclosures relating to European legal entities.

Re: Question: BlockSize > 8192 with FusionIO

From

"Strange, John W"

Date:

04 January 2011, 19:31:38

Test,

Sorry trying to fix why my email is getting formatted to bits when posting to the list.

- John

-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Strange, John
W
Sent: Tuesday, January 04, 2011 1:01 PM
To: Ben Chobot; Merlin Moncure
Cc: pgsql-performance@postgresql.org Performance
Subject: Re: [PERFORM] Question: BlockSize > 8192 with FusionIO

This has gotten a lot better with the 2.x drivers as well.

I'm completely aware of the FusionIO and it's advantages/disadvantages.. I'm currently getting the following pgbench
resultsbut still only hitting the array for about 800MB/sec, short of the 3GB/sec that it's capable of.  This is simply
atrash DB for us to store results in for short periods of time.  If something bad was to happen we can regenerate the
results. So performance with limited risk is what we are looking to achieve. 

asgprod@ASH01_riskresults $ pgbench -v -j 4 -t 200000 -c 16 -h localhost -p 4410 pgbench_10000

starting vacuum...end.

starting vacuum pgbench_accounts...end.

transaction type: TPC-B (sort of)

scaling factor: 10000

query mode: simple

number of clients: 16

number of threads: 4

number of transactions per client: 200000

number of transactions actually processed: 3200000/3200000

tps = 16783.841042 (including connections establishing)

tps = 16785.592722 (excluding connections establishing)

-----Original Message-----

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Ben Chobot

Sent: Tuesday, January 04, 2011 12:37 PM

To: Merlin Moncure

Cc: pgsql-performance@postgresql.org Performance

Subject: Re: [PERFORM] Question: BlockSize > 8192 with FusionIO

On Jan 4, 2011, at 8:48 AM, Merlin Moncure wrote:

> 

> most flash drives, especially mlc flash, use huge blocks anyways on

> physical level.  the numbers claimed here

> (http://www.fusionio.com/products/iodrive/)  (141k write iops) are

> simply not believable without write buffering.  i didn't see any note

> of how fault tolerance is maintained through the buffer (anyone

> know?).

FusionIO buffers. They have capacitors onboard to protect against crashing and power failure. They passed our crash
attemptsto corrupt writes to them before we put them into production, for whatever that's worth, but they do take a
longtime to come back online after an unclean shutdown.

-- 

Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)

To make changes to your subscription:

http://www.postgresql.org/mailpref/pgsql-performance

This communication is for informational purposes only. It is not
intended as an offer or solicitation for the purchase or sale of
any financial instrument or as an official confirmation of any
transaction. All market prices, data and other information are not
warranted as to completeness or accuracy and are subject to change
without notice. Any comments or statements made herein do not
necessarily reflect those of JPMorgan Chase & Co., its subsidiaries
and affiliates.

This transmission may contain information that is privileged,
confidential, legally privileged, and/or exempt from disclosure
under applicable law. If you are not the intended recipient, you
are hereby notified that any disclosure, copying, distribution, or
use of the information contained herein (including any reliance
thereon) is STRICTLY PROHIBITED. Although this transmission and any
attachments are believed to be free of any virus or other defect
that might affect any computer system into which it is received and
opened, it is the responsibility of the recipient to ensure that it
is virus free and no responsibility is accepted by JPMorgan Chase &
Co., its subsidiaries and affiliates, as applicable, for any loss
or damage arising in any way from its use. If you received this
transmission in error, please immediately contact the sender and
destroy the material in its entirety, whether in electronic or hard
copy format. Thank you.

Please refer to http://www.jpmorgan.com/pages/disclosures for
disclosures relating to European legal entities.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
This communication is for informational purposes only. It is not
intended as an offer or solicitation for the purchase or sale of
any financial instrument or as an official confirmation of any
transaction. All market prices, data and other information are not
warranted as to completeness or accuracy and are subject to change
without notice. Any comments or statements made herein do not
necessarily reflect those of JPMorgan Chase & Co., its subsidiaries
and affiliates.

This transmission may contain information that is privileged,
confidential, legally privileged, and/or exempt from disclosure
under applicable law. If you are not the intended recipient, you
are hereby notified that any disclosure, copying, distribution, or
use of the information contained herein (including any reliance
thereon) is STRICTLY PROHIBITED. Although this transmission and any
attachments are believed to be free of any virus or other defect
that might affect any computer system into which it is received and
opened, it is the responsibility of the recipient to ensure that it
is virus free and no responsibility is accepted by JPMorgan Chase &
Co., its subsidiaries and affiliates, as applicable, for any loss
or damage arising in any way from its use. If you received this
transmission in error, please immediately contact the sender and
destroy the material in its entirety, whether in electronic or hard
copy format. Thank you.

Please refer to http://www.jpmorgan.com/pages/disclosures for
disclosures relating to European legal entities.

Re: Question: BlockSize > 8192 with FusionIO

From

Scott Carey

Date:

05 January 2011, 05:49:36

On Jan 4, 2011, at 8:48 AM, Merlin Moncure wrote:

> On Mon, Jan 3, 2011 at 9:13 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> Strange, John W wrote:
>>>
>>> Has anyone had a chance to recompile and try larger a larger blocksize
>>> than 8192 with pSQL 8.4.x?
>>
>> While I haven't done the actual experiment you're asking about, the problem
>> working against you here is how WAL data is used to protect against partial
>> database writes.  See the documentation for full_page_writes at
>> http://www.postgresql.org/docs/current/static/runtime-config-wal.html
>>  Because full size copies of the blocks have to get written there, attempts
>> to chunk writes into larger pieces end up requiring a correspondingly larger
>> volume of writes to protect against partial writes to those pages.  You
>> might get a nice efficiency gain on the read side, but the situation when
>> under a heavy write load (the main thing you have to be careful about with
>> these SSDs) is much less clear.
>
> most flash drives, especially mlc flash, use huge blocks anyways on
> physical level.  the numbers claimed here
> (http://www.fusionio.com/products/iodrive/)  (141k write iops) are
> simply not believable without write buffering.  i didn't see any note
> of how fault tolerance is maintained through the buffer (anyone
> know?).

Flash may have very large erase blocks -- 4k to 16M, but you can write to it at much smaller block sizes sequentially.

It has to delete a block in bulk, but it can write to an erased block bit by bit, sequentially (512 or 4096 bytes
typically,but some is 8k and 16k). 

Older MLC NAND flash could be written to at a couple bytes at a time -- but drives today incorporate too much EEC and
uselarger chunks to do that.  The minimum write size now is caused by the EEC requirements and not the physical NAND
flashrequirements.   

So, buffering isn't that big of a requirement with the current LBA > Physical translations which change all writes --
randomor not -- to sequential writes in one erase block. 
 But performance if waiting for the write to complete will not be all that good, especially with MLC.  Turn off the
bufferon an Intel SLC drive for example, and write IOPS is cut by 1/3 or more -- to 'only' 1000 or so iops.