Thread: Performance increase with elevator=deadline

Performance increase with elevator=deadline

From
"Albe Laurenz"
Date:
This refers to the performance problem reported in
http://archives.postgresql.org/pgsql-performance/2008-04/msg00052.php

After some time of trial and error we found that changing the I/O scheduling
algorithm to "deadline" improved I/O performance by a factor 4 (!) for
this specific load test.

It seems that the bottleneck in this case was actually in the Linux kernel.

Since performance statements are useless without a description of
the system and the type of load, I'll send a few details to make this
report more useful for the archives:

The machine is a PC with 8 AMD Opteron 885 CPUs and 32 GB RAM, attached to
a HP EVA 8100 storage system with 72 disks.

We are running 64-bit Linux 2.6.18-53.1.6.el5 from RedHat Enterprise 5.1.
The I/O queue depth is set to 64.

Our benchmark tools show a possible I/O performance of about 11000 transactions
per second for random scattered reads of 8k blocks.


PostgreSQL version is 8.2.4.

The database system is a cluster with 6 databases containing tables up
to a couple of GB in size. The whole database cluster takes about
200 GB of storage.

The database load is a set of read-only statements, several of which have
miserable execution plans and perform table and index scans.


With the default I/O scheduler we observe a performance of about
600 I/O transactions or 7 MB per second.

After booting with elevator=deadline both values increase by a factor
of up to 4 and the query response times sink drastically.

Yours,
Laurenz Albe

Re: Performance increase with elevator=deadline

From
Gregory Stark
Date:
"Albe Laurenz" <laurenz.albe@wien.gv.at> writes:

> This refers to the performance problem reported in
> http://archives.postgresql.org/pgsql-performance/2008-04/msg00052.php
>
> After some time of trial and error we found that changing the I/O scheduling
> algorithm to "deadline" improved I/O performance by a factor 4 (!) for
> this specific load test.

What was the algorithm before?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning

Re: Performance increase with elevator=deadline

From
Jeff
Date:
On Apr 11, 2008, at 7:22 AM, Albe Laurenz wrote:
> After some time of trial and error we found that changing the I/O
> scheduling
> algorithm to "deadline" improved I/O performance by a factor 4 (!) for
> this specific load test.
>
I was inspired once again to look into this - as I'm recently hitting
some glass ceilings with my machines.

I have a little app I wrote called pgiosim (its on pgfoundry - http://pgfoundry.org/projects/pgiosim)
  that basically just opens some files, and does random seeks and 8kB
reads, much like what our beloved PG does.

Using 4 of these with a dataset of about 30GB across a few files
(Machine has 8GB mem) I went from around 100 io/sec to 330 changing to
noop.   Quite an improvement.  If you have a decent controller CFQ is
not what you want.   I tried deadline as well and it was a touch
slower.  The controller is a 3ware 9550sx with 4 disks in a raid10.

I'll be trying this out on the big array later today.  I found it
suprising this info wasn't more widespread (the use of CFQ on a good
controller).

it also seems changing elevators on the fly works fine (echo
schedulername > /sys/block/.../queue/scheduler  I admit I sat there
flipping back and forth going "disk go fast.. disk go slow.. disk go
fast... " :)

--
Jeff Trout <jeff@jefftrout.com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/




Re: Performance increase with elevator=deadline

From
Matthew
Date:
On Fri, 11 Apr 2008, Jeff wrote:
> Using 4 of these with a dataset of about 30GB across a few files (Machine has
> 8GB mem) I went from around 100 io/sec to 330 changing to noop.   Quite an
> improvement.  If you have a decent controller CFQ is not what you want.   I
> tried deadline as well and it was a touch slower.  The controller is a 3ware
> 9550sx with 4 disks in a raid10.

I ran Greg's fadvise test program a while back on a 12-disc array. The
three schedulers (deadline, noop, anticipatory) all performed pretty-much
the same, with the fourth (cfq, the default) being consistently slower.

> it also seems changing elevators on the fly works fine (echo schedulername >
> /sys/block/.../queue/scheduler  I admit I sat there flipping back and forth
> going "disk go fast.. disk go slow.. disk go fast... " :)

Oh Homer Simpson, your legacy lives on.

Matthew

--
I suppose some of you have done a Continuous Maths course. Yes? Continuous
Maths? <menacing stares from audience> Whoah, it was like that, was it!
                                        -- Computer Science Lecturer

Re: Performance increase with elevator=deadline

From
Craig Ringer
Date:
Matthew wrote:
> On Fri, 11 Apr 2008, Jeff wrote:
>> Using 4 of these with a dataset of about 30GB across a few files
>> (Machine has 8GB mem) I went from around 100 io/sec to 330 changing to
>> noop.   Quite an improvement.  If you have a decent controller CFQ is
>> not what you want.   I tried deadline as well and it was a touch
>> slower.  The controller is a 3ware 9550sx with 4 disks in a raid10.
>
> I ran Greg's fadvise test program a while back on a 12-disc array. The
> three schedulers (deadline, noop, anticipatory) all performed
> pretty-much the same, with the fourth (cfq, the default) being
> consistently slower.

I use CFQ on some of my servers, despite the fact that it's often slower
in total throughput terms, because it delivers much more predictable I/O
latencies that help prevent light I/O processes being starved by heavy
I/O processes. In particular, an Linux terminal server used at work has
taken a lot of I/O tuning before it delivers even faintly acceptable I/O
latencies under any sort of load.

Bounded I/O latency at the expense of throughput is not what you usually
want on a DB server, where throughput is king, so I'm not at all
surprised that CFQ performs poorly for PostgreSQL. I've done no testing
on that myself, though, because with my DB size and the nature of my
queries most of them are CPU bound anyway.

Speaking of I/O performance with PostgreSQL, has anybody here done any
testing to compare results with LVM to results with the same filesystem
on a conventionally partitioned or raw volume? I'd probably use LVM even
at a performance cost because of its admin benefits, but I'd be curious
if there is any known cost for use with Pg. I've never been able to
measure one with other workloads.

--
Craig Ringer

Re: Performance increase with elevator=deadline

From
Greg Smith
Date:
On Sat, 12 Apr 2008, Craig Ringer wrote:

> Speaking of I/O performance with PostgreSQL, has anybody here done any
> testing to compare results with LVM to results with the same filesystem on a
> conventionally partitioned or raw volume?

There was some chatter on this topic last year; a quick search finds

http://archives.postgresql.org/pgsql-performance/2007-06/msg00005.php

which is a fair statement of the situation.  I don't recall any specific
benchmarks.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Performance increase with elevator=deadline

From
Florian Weimer
Date:
* Jeff:

> Using 4 of these with a dataset of about 30GB across a few files
> (Machine has 8GB mem) I went from around 100 io/sec to 330 changing to
> noop.   Quite an improvement.  If you have a decent controller CFQ is
> not what you want.   I tried deadline as well and it was a touch
> slower.  The controller is a 3ware 9550sx with 4 disks in a raid10.
>
> I'll be trying this out on the big array later today.  I found it
> suprising this info wasn't more widespread (the use of CFQ on a good
> controller).

3ware might be a bit special because the controller has got very deep
queues on its own, so many assumptions of the kernel I/O schedulers do
not seem to apply.  Toying with the kernel/controller queue depths
might help, but I haven't done real benchmarks to see if it's actually
a difference.

A few days ago, I experienced this: On a machine with a 3ware
controller, a simple getxattr call on a file in an uncontended
directory took several minutes because a PostgreSQL dump process was
running in the background (and some other activity of a legacy
database which caused frequent fdatasync calls).  Clearly, this is
unacceptable, and I've since switched to the deadline scheduler, too.
So far, this particular behavior hasn't occurred again.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: Performance increase with elevator=deadline

From
david@lang.hm
Date:
On Tue, 15 Apr 2008, Florian Weimer wrote:

> * Jeff:
>
>> Using 4 of these with a dataset of about 30GB across a few files
>> (Machine has 8GB mem) I went from around 100 io/sec to 330 changing to
>> noop.   Quite an improvement.  If you have a decent controller CFQ is
>> not what you want.   I tried deadline as well and it was a touch
>> slower.  The controller is a 3ware 9550sx with 4 disks in a raid10.
>>
>> I'll be trying this out on the big array later today.  I found it
>> suprising this info wasn't more widespread (the use of CFQ on a good
>> controller).
>
> 3ware might be a bit special because the controller has got very deep
> queues on its own, so many assumptions of the kernel I/O schedulers do
> not seem to apply.  Toying with the kernel/controller queue depths
> might help, but I haven't done real benchmarks to see if it's actually
> a difference.
>
> A few days ago, I experienced this: On a machine with a 3ware
> controller, a simple getxattr call on a file in an uncontended
> directory took several minutes because a PostgreSQL dump process was
> running in the background (and some other activity of a legacy
> database which caused frequent fdatasync calls).  Clearly, this is
> unacceptable, and I've since switched to the deadline scheduler, too.
> So far, this particular behavior hasn't occurred again.

one other thing to watch out for. up until very recent kernels (2.6.23 or
2.6.24) it was possible for one very busy block device to starve other
block devices. they added isolation of queues for different block devices,
but I've seen reports that the isolation can end up throttling high
performance devices as a result. I haven't had time to really dig into
this, but there are tuning knobs available to adjust the que space
available to different devices and the reports are significantly better
activity on a tuned system.

David Lang

Re: Performance increase with elevator=deadline

From
"Albe Laurenz"
Date:
Gregory Stark wrote:
>> After some time of trial and error we found that changing the I/O scheduling
>> algorithm to "deadline" improved I/O performance by a factor 4 (!) for
>> this specific load test.
>
> What was the algorithm before?

The default algorithm, CFQ I think it is.

Yours,
Laurenz Albe

Re: Performance increase with elevator=deadline

From
Enrico Sirola
Date:
Hi,

Il giorno 11/apr/08, alle ore 20:03, Craig Ringer ha scritto:
>
> Speaking of I/O performance with PostgreSQL, has anybody here done
> any testing to compare results with LVM to results with the same
> filesystem on a conventionally partitioned or raw volume? I'd
> probably use LVM even at a performance cost because of its admin
> benefits, but I'd be curious if there is any known cost for use with
> Pg. I've never been able to measure one with other workloads.

I performed some tests some time ago. LVM is significantly slower.
The disk subsystem is a HP P400/512MB battery-backed controller with 4
disks in raid 10.
See the tests:


ext3 tests:

bonnie++ -s 16000 -u 0 -f -b
=
=
=
=
=
=
=
========================================================================

Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
%CP  /sec %CP
              16000M           153637  50 78895  17           204124
17 700.6   1
                     ------Sequential Create------ --------Random
Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
%CP  /sec %CP
                  16  2233  10 +++++ +++  2606   8  2255  10 +++++ ++
+  2584   7
16000M,,,153637,50,78895,17,,,204124,17,700.6,1,16,2233,10,+++++,+++,
2606,8,2255,10,+++++,+++,2584,7


bonnie++ -s 16000 -u 0 -f
=
=
=
=
=
=
=
========================================================================

Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
%CP  /sec %CP
              16000M           162223  51 77277  17           207055
17 765.3   1
                     ------Sequential Create------ --------Random
Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
%CP  /sec %CP
                  16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
+++++ +++
16000M,,,162223,51,77277,17,,,207055,17,765.3,1,16,+++++,+++,+++++,+++,
+++++,+++,+++++,+++,+++++,+++,+++++,+++

=
=
=
=
=
=
=
========================================================================

LVM tests:

bonnie++ -u 0 -f -s 16000 -b
Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
%CP  /sec %CP
              16000M           153768  52 53420  13           177414
15 699.8   1
                     ------Sequential Create------ --------Random
Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
%CP  /sec %CP
                  16  2158   9 +++++ +++  2490   7  2177   9 +++++ ++
+  2460   7
16000M,,,153768,52,53420,13,,,177414,15,699.8,1,16,2158,9,+++++,+++,
2490,7,2177,9,+++++,+++,2460,7

bonnie++ -u 0 -f -s 16000
Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
%CP  /sec %CP
              16000M           161476  53 54904  13           171693
14 774.3   1
                     ------Sequential Create------ --------Random
Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
%CP  /sec %CP
                  16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
+++++ +++
16000M,,,161476,53,54904,13,,,171693,14,774.3,1,16,+++++,+++,+++++,+++,
+++++,+++,+++++,+++,+++++,+++,+++++,+++

Bye,
e.


I had the opportunity to do more testing on another new server to see whether the kernel's I/O scheduling makes any
difference. Conclusion: On a battery-backed RAID 10 system, the kernel's I/O scheduling algorithm has no effect.  This
makessense, since a battery-backed cache will supercede any I/O rescheduling that the kernel tries to do. 

Hardware:
 Dell 2950
 8 CPU (Intel 2GHz Xeon)
 8 GB memory
 Dell Perc 6i with battery-backed cache
 RAID 10 of 8x 146GB SAS 10K 2.5" disks

Software:
 Linux 2.6.24, 64-bit
    XFS file system
 Postgres 8.3.0
    max_connections = 1000
    shared_buffers = 2000MB
    work_mem = 256MB
    max_fsm_pages = 1000000
    max_fsm_relations = 5000
    synchronous_commit = off
    wal_buffers = 256kB
    checkpoint_segments = 30
    effective_cache_size = 4GB

Each test was run 5 times:
  drop database test
  create database test
  pgbench -i -s 20 -U test
  pgbench -c 10 -t 50000 -v -U test

The I/O scheduler was changed on-the-fly using (for example) "echo cfq >/sys/block/sda/queue/scheduler".

Autovacuum was turned off during the test.

Here are the results.  The numbers are those reported as "tps = xxxx (including connections establishing)" (which were
almostidentical to the "excluding..." tps number). 

I/O Sched     AVG     Test1  Test2  Test3  Test4  Test5
---------    -----    -----  -----  -----  -----  -----
cfq           3355     3646   3207   3132   3204   3584
noop          3163     2901   3190   3293   3124   3308
deadline      3547     3923   3722   3351   3484   3254
anticipatory  3384     3453   3916   2944   3451   3156

As you can see, the averages are very close -- closer than the "noise" between runs.  As far as I can tell, there is no
significantadvantage, or even any significant difference, between the various I/O scheduler algorithms. 

(It also reinforces what the pgbench man page says: Short runs aren't useful.  Even these two-minute runs have a lot of
variability. Before I turned off AutoVacuum, the variability was more like 50% between runs.) 

Craig

On Mon, May 5, 2008 at 5:33 PM, Craig James <craig_james@emolecules.com> wrote:
>
>  (It also reinforces what the pgbench man page says: Short runs aren't
> useful.  Even these two-minute runs have a lot of variability.  Before I
> turned off AutoVacuum, the variability was more like 50% between runs.)

I'd suggest a couple things for more realistic tests.  Run the tests
much longer, say 30 minutes to an hour.  Crank up your scaling factor
until your test db is larger than memory.  Turn on autovacuum, maybe
raising the cost / delay factors so it doesn't affect performance too
negatively.  And lastly tuning the bgwriter so that checkpoints are
short and don't interfere too much.

My guess is if you let it run for a while, you'll get a much more
reliable number.

On Mon, 5 May 2008, Craig James wrote:

> pgbench -i -s 20 -U test

That's way too low to expect you'll see a difference in I/O schedulers.
A scale of 20 is giving you a 320MB database, you can fit the whole thing
in RAM and almost all of it on your controller cache.  What's there to
schedule?  You're just moving between buffers that are generally large
enough to hold most of what they need.

> pgbench -c 10 -t 50000 -v -U test

This is OK, because when you increase the size you're not going to be
pushing 3500 TPS anymore and this test will take quite a while.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

On May 5, 2008, at 7:33 PM, Craig James wrote:

> I had the opportunity to do more testing on another new server to
> see whether the kernel's I/O scheduling makes any difference.
> Conclusion: On a battery-backed RAID 10 system, the kernel's I/O
> scheduling algorithm has no effect.  This makes sense, since a
> battery-backed cache will supercede any I/O rescheduling that the
> kernel tries to do.
>

this goes against my real world experience here.

> pgbench -i -s 20 -U test
> pgbench -c 10 -t 50000 -v -U test
>

You should use a sample size of 2x ram to get a more realistic number,
or try out my pgiosim tool on pgfoundry which "sort of" simulates an
index scan.  I posted numbers from that a month or two ago here.


--
Jeff Trout <jeff@jefftrout.com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/




Re: RAID 10 Benchmark with different I/O schedulers

From
Craig James
Date:
Greg Smith wrote:
> On Mon, 5 May 2008, Craig James wrote:
>
>> pgbench -i -s 20 -U test
>
> That's way too low to expect you'll see a difference in I/O schedulers.
> A scale of 20 is giving you a 320MB database, you can fit the whole
> thing in RAM and almost all of it on your controller cache.  What's
> there to schedule?  You're just moving between buffers that are
> generally large enough to hold most of what they need.

Test repeated with:
autovacuum enabled
database destroyed and recreated between runs
pgbench -i -s 600 ...
pgbench -c 10 -t 50000 -n ...

I/O Sched     AVG     Test1  Test2
---------    -----    -----  -----
cfq            705      695    715
noop           758      769    747
deadline       741      705    775
anticipatory   494      477    511

I only did two runs of each, which took about 24 minutes.  Like the first round of tests, the "noise" in the
measurements(about 10%) exceeds the difference between scheduler-algorithm performance, except that "anticipatory"
seemsto be measurably slower. 

So it still looks like cfq, noop and deadline are more or less equivalent when used with a battery-backed RAID.

Craig

Re: RAID 10 Benchmark with different I/O schedulers

From
Greg Smith
Date:
On Tue, 6 May 2008, Craig James wrote:

> I only did two runs of each, which took about 24 minutes.  Like the first
> round of tests, the "noise" in the measurements (about 10%) exceeds the
> difference between scheduler-algorithm performance, except that
> "anticipatory" seems to be measurably slower.

Those are much better results.  Any test that says anticipatory is
anything other than useless for database system use with a good controller
I presume is broken, so that's how I know you're in the right ballpark now
but weren't before.

In order to actually get some useful data out of the noise that is
pgbench, you need a lot more measurements of longer runs.  As perspective,
the last time I did something in this area, in order to get enough data to
get a clear picture I ran tests for 12 hours.  I'm hoping to repeat that
soon with some more common hardware that gives useful results I can give
out.

> So it still looks like cfq, noop and deadline are more or less equivalent
> when used with a battery-backed RAID.

I think it's fair to say they're within 10% of one another on raw
throughput.  The thing you're not measuring here is worst-case latency,
and that's where there might be a more interesting difference.  Most tests
I've seen suggest deadline is the best in that regard, cfq the worst, and
where noop fits in depends on the underlying controller.

pgbench produces log files with latency measurements if you pass it "-l".
Here's a snippet of shell that runs pgbench then looks at the resulting
latency results for the worst 5 numbers:

pgbench ... -l &
p=$!
wait $p
mv pgbench_log.${p} pgbench.log
echo Worst latency results:
cat pgbench.log | cut -f 3 -d " " | sort -n | tail -n 5

However, that may not give you much useful info either--in most cases
checkpoint issues kind of swamp the worst-base behavior in PostgreSQL,
and to quantify I/O schedulers you need to look more complicated
statistics on latency.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: RAID 10 Benchmark with different I/O schedulers

From
Craig James
Date:
Greg Smith wrote:
> On Tue, 6 May 2008, Craig James wrote:
>
>> I only did two runs of each, which took about 24 minutes.  Like the
>> first round of tests, the "noise" in the measurements (about 10%)
>> exceeds the difference between scheduler-algorithm performance, except
>> that "anticipatory" seems to be measurably slower.
>
> Those are much better results.  Any test that says anticipatory is
> anything other than useless for database system use with a good
> controller I presume is broken, so that's how I know you're in the right
> ballpark now but weren't before.
>
> In order to actually get some useful data out of the noise that is
> pgbench, you need a lot more measurements of longer runs.  As
> perspective, the last time I did something in this area, in order to get
> enough data to get a clear picture I ran tests for 12 hours.  I'm hoping
> to repeat that soon with some more common hardware that gives useful
> results I can give out.

This data is good enough for what I'm doing.  There were reports from non-RAID users that the I/O scheduling could make
asmuch as a 4x difference in performance (which makes sense for non-RAID), but these tests show me that three of the
fourI/O schedulers are within 10% of each other.  Since this matches my intuition of how battery-backed RAID will work,
I'msatisfied.  If our servers get overloaded to the point where 10% matters, then I need a much more dramatic solution,
likefaster machines or more machines. 

Craig


Re: RAID 10 Benchmark with different I/O schedulers

From
"Albe Laurenz *EXTERN*"
Date:
Craig James wrote:
> This data is good enough for what I'm doing.  There were
> reports from non-RAID users that the I/O scheduling could
> make as much as a 4x difference in performance (which makes
> sense for non-RAID), but these tests show me that three of
> the four I/O schedulers are within 10% of each other.  Since
> this matches my intuition of how battery-backed RAID will
> work, I'm satisfied.  If our servers get overloaded to the
> point where 10% matters, then I need a much more dramatic
> solution, like faster machines or more machines.

I should comment on this as I am the one who reported the
big performance increase with the deadline scheduler.
I was very surprised at this increase myself as I had not seen
any similar reports, so I thought I should share it for whatever
it is worth.

Our SAN *is* a RAID-5 with lots of cache, so there must be a flaw
in your intuition.

Performance measures depend a lot on your hardware and
software setup (e.g. kernel version in this case) and on the
specific load. The load we used was a real life load, collected
over seveal hours and extracted from the log files.

My opinion is that performance observations can rarely be
generalized - I am not surprised that with a different system
and a different load you observe hardly any difference between
"cfq" and "deadline".

For the record, in our test case "noop" performed practically
as good as "deadline", while the other two did way worse.

Like yourself, I have wondered why different I/O scheduling
algorithms should make so much difference.
Here is my home-spun theory of what may happen; tear it apart
and replace it with a better one at your convenience:

Our SAN probably (we're investigating) has its own brains to
optimize I/O, and I guess that any optimization that the kernel
does can only deteriorate performance because the two algorithms
might "step on each other's toes". This is backed by "noop"
performing well.
I believe that caching will not make much difference, because the
cache is way smaller than the database, and whatever is neither in
the shared buffer nor in the kernel filesystem cache is also not
likely to be in the storage system's cache. Remember that our load
was read-only.

Yours,
Laurenz Albe

Re: RAID 10 Benchmark with different I/O schedulers

From
Matthew Wakeling
Date:
On Tue, 6 May 2008, Craig James wrote:
> I/O Sched     AVG     Test1  Test2
> ---------    -----    -----  -----
> cfq            705      695    715
> noop           758      769    747
> deadline       741      705    775
> anticipatory   494      477    511

Interesting. That contrasts with some tests I did a while back on a
16-disc RAID-0, where noop, deadline, and anticipatory were all identical
in performance, with cfq being significantly slower. Admittedly, the disc
test was single-process, which is probably why the anticipatory behaviour
didn't kick in. You are seeing a little bit of degradation with cfq - I
guess it's worse the bigger the disc subsystem you have.

Matthew

--
Matthew: That's one of things about Cambridge - all the roads keep changing
          names as you walk along them, like Hills Road in particular.
Sagar:   Yes, Sidney Street is a bit like that too.
Matthew: Sidney Street *is* Hills Road.

Typecast bug?

From
Craig James
Date:
This seems like a bug to me, but it shows up as a performance problem.  Since the column being queried is an integer,
thesecond query (see below) can't possibly match, yet Postgres uses a typecast, forcing a full table scan for a value
thatcan't possibly be in the table. 

The application could intercept these bogus queries, but that requires building schema-specific and postgres-specific
knowledgeinto the application (i.e. "What is the maximum legal integer for this column?"). 

Craig


explain analyze select version_id, parent_id from version where version_id = 99999;
                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using version_pkey on version  (cost=0.00..9.89 rows=1 width=8) (actual time=0.054..0.054 rows=0 loops=1)
  Index Cond: (version_id = 99999)
Total runtime: 0.130 ms
(3 rows)

emol_warehouse_1=> explain analyze select version_id, parent_id from version where version_id =
999999999999999999999999999;
                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Seq Scan on version  (cost=0.00..253431.77 rows=48393 width=8) (actual time=3135.530..3135.530 rows=0 loops=1)
  Filter: ((version_id)::numeric = 999999999999999999999999999::numeric)
Total runtime: 3135.557 ms
(3 rows)


\d version
Table "emol_warehouse_1.version"
  Column   |  Type   | Modifiers
------------+---------+-----------
version_id | integer | not null
parent_id  | integer | not null
... more columns
Indexes:
   "version_pkey" PRIMARY KEY, btree (version_id)





Re: RAID 10 Benchmark with different I/O schedulers

From
Craig James
Date:
This seems like a bug to me, but it shows up as a performance problem.  Since the column being queried is an integer,
thesecond query (see below) can't possibly match, yet Postgres uses a typecast, forcing a full table scan for a value
thatcan't possibly be in the table. 

The application could intercept these bogus queries, but that requires building schema-specific and postgres-specific
knowledgeinto the application (i.e. "What is the maximum legal integer for this column?"). 

Craig


explain analyze select version_id, parent_id from version where version_id = 99999;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Index Scan using version_pkey on version  (cost=0.00..9.89 rows=1 width=8) (actual time=0.054..0.054 rows=0 loops=1)
   Index Cond: (version_id = 99999)
 Total runtime: 0.130 ms
(3 rows)

emol_warehouse_1=> explain analyze select version_id, parent_id from version where version_id =
999999999999999999999999999;
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Seq Scan on version  (cost=0.00..253431.77 rows=48393 width=8) (actual time=3135.530..3135.530 rows=0 loops=1)
   Filter: ((version_id)::numeric = 999999999999999999999999999::numeric)
 Total runtime: 3135.557 ms
(3 rows)


 \d version
 Table "emol_warehouse_1.version"
   Column   |  Type   | Modifiers
------------+---------+-----------
 version_id | integer | not null
 parent_id  | integer | not null
 ... more columns
Indexes:
    "version_pkey" PRIMARY KEY, btree (version_id)




Re: Typecast bug?

From
Tom Lane
Date:
Craig James <craig_james@emolecules.com> writes:
> This seems like a bug to me, but it shows up as a performance problem.

> emol_warehouse_1=> explain analyze select version_id, parent_id from version where version_id =
999999999999999999999999999;

If you actually *need* so many 9's here as to force it out of the range
of bigint, then why is your id column not declared numeric?

This seems to me to be about on par with complaining that "intcol = 4.2e1"
won't be indexed.  We have a numeric data type hierarchy, learn to
work with it ...

            regards, tom lane

Re: Typecast bug?

From
Craig James
Date:
Tom Lane wrote:
> Craig James <craig_james@emolecules.com> writes:
>> This seems like a bug to me, but it shows up as a performance problem.
>
>> emol_warehouse_1=> explain analyze select version_id, parent_id from version where version_id =
999999999999999999999999999;
>
> If you actually *need* so many 9's here as to force it out of the range
> of bigint, then why is your id column not declared numeric?
>
> This seems to me to be about on par with complaining that "intcol = 4.2e1"
> won't be indexed.  We have a numeric data type hierarchy, learn to
> work with it ...

Your suggestion of "learn to work with it" doesn't fly.  A good design separates the database schema details from the
applicationto the greatest extent possible.  What you're suggesting is that every application that queries against a
Postgresdatabase should know the exact range of every numeric data type of every indexed column in the schema, simply
becausePostgres can't recognize an out-of-range numeric value. 

In this case, the optimizer could have instantly returned zero results with no further work, since the query was out of
rangefor that column. 

This seems like a pretty simple optimization to me, and it seems like a helpful suggestion to make to this forum.

BTW, this query came from throwing lots of junk at a web app in an effort to uncover exactly this sort of problem.
It'snot a real query, but then, hackers don't use real queries.  The app checks that its input is a well-formed integer
expression,but then assumes Postgres can deal with it from there. 

Craig

Re: Typecast bug?

From
"Frank Joerdens"
Date:
On 6/26/08, Craig James <craig_james@emolecules.com> wrote:
> This seems like a bug to me, but it shows up as a performance problem.
> Since the column being queried is an integer, the second query (see below)
> can't possibly match, yet Postgres uses a typecast, forcing a full table
> scan for a value that can't possibly be in the table.

Which version are you using? 8.3 removes a lot of implicit casts (all?
not sure), so this may already be your fix.

Cheers,

Frank

Re: Typecast bug?

From
"Guillaume Smet"
Date:
On Thu, Jun 26, 2008 at 9:02 AM, Frank Joerdens <frank@joerdens.de> wrote:
> Which version are you using? 8.3 removes a lot of implicit casts (all?
> not sure), so this may already be your fix.

8.3 only removed implicit casts from non text types to text (date ->
text, int -> text, interval -> text...) to avoid unexpected
behaviours.

--
Guillaume