Thread: Performance increase with elevator=deadline
This refers to the performance problem reported in http://archives.postgresql.org/pgsql-performance/2008-04/msg00052.php After some time of trial and error we found that changing the I/O scheduling algorithm to "deadline" improved I/O performance by a factor 4 (!) for this specific load test. It seems that the bottleneck in this case was actually in the Linux kernel. Since performance statements are useless without a description of the system and the type of load, I'll send a few details to make this report more useful for the archives: The machine is a PC with 8 AMD Opteron 885 CPUs and 32 GB RAM, attached to a HP EVA 8100 storage system with 72 disks. We are running 64-bit Linux 2.6.18-53.1.6.el5 from RedHat Enterprise 5.1. The I/O queue depth is set to 64. Our benchmark tools show a possible I/O performance of about 11000 transactions per second for random scattered reads of 8k blocks. PostgreSQL version is 8.2.4. The database system is a cluster with 6 databases containing tables up to a couple of GB in size. The whole database cluster takes about 200 GB of storage. The database load is a set of read-only statements, several of which have miserable execution plans and perform table and index scans. With the default I/O scheduler we observe a performance of about 600 I/O transactions or 7 MB per second. After booting with elevator=deadline both values increase by a factor of up to 4 and the query response times sink drastically. Yours, Laurenz Albe
"Albe Laurenz" <laurenz.albe@wien.gv.at> writes: > This refers to the performance problem reported in > http://archives.postgresql.org/pgsql-performance/2008-04/msg00052.php > > After some time of trial and error we found that changing the I/O scheduling > algorithm to "deadline" improved I/O performance by a factor 4 (!) for > this specific load test. What was the algorithm before? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
On Apr 11, 2008, at 7:22 AM, Albe Laurenz wrote: > After some time of trial and error we found that changing the I/O > scheduling > algorithm to "deadline" improved I/O performance by a factor 4 (!) for > this specific load test. > I was inspired once again to look into this - as I'm recently hitting some glass ceilings with my machines. I have a little app I wrote called pgiosim (its on pgfoundry - http://pgfoundry.org/projects/pgiosim) that basically just opens some files, and does random seeks and 8kB reads, much like what our beloved PG does. Using 4 of these with a dataset of about 30GB across a few files (Machine has 8GB mem) I went from around 100 io/sec to 330 changing to noop. Quite an improvement. If you have a decent controller CFQ is not what you want. I tried deadline as well and it was a touch slower. The controller is a 3ware 9550sx with 4 disks in a raid10. I'll be trying this out on the big array later today. I found it suprising this info wasn't more widespread (the use of CFQ on a good controller). it also seems changing elevators on the fly works fine (echo schedulername > /sys/block/.../queue/scheduler I admit I sat there flipping back and forth going "disk go fast.. disk go slow.. disk go fast... " :) -- Jeff Trout <jeff@jefftrout.com> http://www.stuarthamm.net/ http://www.dellsmartexitin.com/
On Fri, 11 Apr 2008, Jeff wrote: > Using 4 of these with a dataset of about 30GB across a few files (Machine has > 8GB mem) I went from around 100 io/sec to 330 changing to noop. Quite an > improvement. If you have a decent controller CFQ is not what you want. I > tried deadline as well and it was a touch slower. The controller is a 3ware > 9550sx with 4 disks in a raid10. I ran Greg's fadvise test program a while back on a 12-disc array. The three schedulers (deadline, noop, anticipatory) all performed pretty-much the same, with the fourth (cfq, the default) being consistently slower. > it also seems changing elevators on the fly works fine (echo schedulername > > /sys/block/.../queue/scheduler I admit I sat there flipping back and forth > going "disk go fast.. disk go slow.. disk go fast... " :) Oh Homer Simpson, your legacy lives on. Matthew -- I suppose some of you have done a Continuous Maths course. Yes? Continuous Maths? <menacing stares from audience> Whoah, it was like that, was it! -- Computer Science Lecturer
Matthew wrote: > On Fri, 11 Apr 2008, Jeff wrote: >> Using 4 of these with a dataset of about 30GB across a few files >> (Machine has 8GB mem) I went from around 100 io/sec to 330 changing to >> noop. Quite an improvement. If you have a decent controller CFQ is >> not what you want. I tried deadline as well and it was a touch >> slower. The controller is a 3ware 9550sx with 4 disks in a raid10. > > I ran Greg's fadvise test program a while back on a 12-disc array. The > three schedulers (deadline, noop, anticipatory) all performed > pretty-much the same, with the fourth (cfq, the default) being > consistently slower. I use CFQ on some of my servers, despite the fact that it's often slower in total throughput terms, because it delivers much more predictable I/O latencies that help prevent light I/O processes being starved by heavy I/O processes. In particular, an Linux terminal server used at work has taken a lot of I/O tuning before it delivers even faintly acceptable I/O latencies under any sort of load. Bounded I/O latency at the expense of throughput is not what you usually want on a DB server, where throughput is king, so I'm not at all surprised that CFQ performs poorly for PostgreSQL. I've done no testing on that myself, though, because with my DB size and the nature of my queries most of them are CPU bound anyway. Speaking of I/O performance with PostgreSQL, has anybody here done any testing to compare results with LVM to results with the same filesystem on a conventionally partitioned or raw volume? I'd probably use LVM even at a performance cost because of its admin benefits, but I'd be curious if there is any known cost for use with Pg. I've never been able to measure one with other workloads. -- Craig Ringer
On Sat, 12 Apr 2008, Craig Ringer wrote: > Speaking of I/O performance with PostgreSQL, has anybody here done any > testing to compare results with LVM to results with the same filesystem on a > conventionally partitioned or raw volume? There was some chatter on this topic last year; a quick search finds http://archives.postgresql.org/pgsql-performance/2007-06/msg00005.php which is a fair statement of the situation. I don't recall any specific benchmarks. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
* Jeff: > Using 4 of these with a dataset of about 30GB across a few files > (Machine has 8GB mem) I went from around 100 io/sec to 330 changing to > noop. Quite an improvement. If you have a decent controller CFQ is > not what you want. I tried deadline as well and it was a touch > slower. The controller is a 3ware 9550sx with 4 disks in a raid10. > > I'll be trying this out on the big array later today. I found it > suprising this info wasn't more widespread (the use of CFQ on a good > controller). 3ware might be a bit special because the controller has got very deep queues on its own, so many assumptions of the kernel I/O schedulers do not seem to apply. Toying with the kernel/controller queue depths might help, but I haven't done real benchmarks to see if it's actually a difference. A few days ago, I experienced this: On a machine with a 3ware controller, a simple getxattr call on a file in an uncontended directory took several minutes because a PostgreSQL dump process was running in the background (and some other activity of a legacy database which caused frequent fdatasync calls). Clearly, this is unacceptable, and I've since switched to the deadline scheduler, too. So far, this particular behavior hasn't occurred again. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
On Tue, 15 Apr 2008, Florian Weimer wrote: > * Jeff: > >> Using 4 of these with a dataset of about 30GB across a few files >> (Machine has 8GB mem) I went from around 100 io/sec to 330 changing to >> noop. Quite an improvement. If you have a decent controller CFQ is >> not what you want. I tried deadline as well and it was a touch >> slower. The controller is a 3ware 9550sx with 4 disks in a raid10. >> >> I'll be trying this out on the big array later today. I found it >> suprising this info wasn't more widespread (the use of CFQ on a good >> controller). > > 3ware might be a bit special because the controller has got very deep > queues on its own, so many assumptions of the kernel I/O schedulers do > not seem to apply. Toying with the kernel/controller queue depths > might help, but I haven't done real benchmarks to see if it's actually > a difference. > > A few days ago, I experienced this: On a machine with a 3ware > controller, a simple getxattr call on a file in an uncontended > directory took several minutes because a PostgreSQL dump process was > running in the background (and some other activity of a legacy > database which caused frequent fdatasync calls). Clearly, this is > unacceptable, and I've since switched to the deadline scheduler, too. > So far, this particular behavior hasn't occurred again. one other thing to watch out for. up until very recent kernels (2.6.23 or 2.6.24) it was possible for one very busy block device to starve other block devices. they added isolation of queues for different block devices, but I've seen reports that the isolation can end up throttling high performance devices as a result. I haven't had time to really dig into this, but there are tuning knobs available to adjust the que space available to different devices and the reports are significantly better activity on a tuned system. David Lang
Gregory Stark wrote: >> After some time of trial and error we found that changing the I/O scheduling >> algorithm to "deadline" improved I/O performance by a factor 4 (!) for >> this specific load test. > > What was the algorithm before? The default algorithm, CFQ I think it is. Yours, Laurenz Albe
Hi, Il giorno 11/apr/08, alle ore 20:03, Craig Ringer ha scritto: > > Speaking of I/O performance with PostgreSQL, has anybody here done > any testing to compare results with LVM to results with the same > filesystem on a conventionally partitioned or raw volume? I'd > probably use LVM even at a performance cost because of its admin > benefits, but I'd be curious if there is any known cost for use with > Pg. I've never been able to measure one with other workloads. I performed some tests some time ago. LVM is significantly slower. The disk subsystem is a HP P400/512MB battery-backed controller with 4 disks in raid 10. See the tests: ext3 tests: bonnie++ -s 16000 -u 0 -f -b = = = = = = = ======================================================================== Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP 16000M 153637 50 78895 17 204124 17 700.6 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2233 10 +++++ +++ 2606 8 2255 10 +++++ ++ + 2584 7 16000M,,,153637,50,78895,17,,,204124,17,700.6,1,16,2233,10,+++++,+++, 2606,8,2255,10,+++++,+++,2584,7 bonnie++ -s 16000 -u 0 -f = = = = = = = ======================================================================== Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP 16000M 162223 51 77277 17 207055 17 765.3 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ 16000M,,,162223,51,77277,17,,,207055,17,765.3,1,16,+++++,+++,+++++,+++, +++++,+++,+++++,+++,+++++,+++,+++++,+++ = = = = = = = ======================================================================== LVM tests: bonnie++ -u 0 -f -s 16000 -b Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP 16000M 153768 52 53420 13 177414 15 699.8 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2158 9 +++++ +++ 2490 7 2177 9 +++++ ++ + 2460 7 16000M,,,153768,52,53420,13,,,177414,15,699.8,1,16,2158,9,+++++,+++, 2490,7,2177,9,+++++,+++,2460,7 bonnie++ -u 0 -f -s 16000 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP 16000M 161476 53 54904 13 171693 14 774.3 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ 16000M,,,161476,53,54904,13,,,171693,14,774.3,1,16,+++++,+++,+++++,+++, +++++,+++,+++++,+++,+++++,+++,+++++,+++ Bye, e.
RAID 10 Benchmark with different I/O schedulers (was: Performance increase with elevator=deadline)
From
Craig James
Date:
I had the opportunity to do more testing on another new server to see whether the kernel's I/O scheduling makes any difference. Conclusion: On a battery-backed RAID 10 system, the kernel's I/O scheduling algorithm has no effect. This makessense, since a battery-backed cache will supercede any I/O rescheduling that the kernel tries to do. Hardware: Dell 2950 8 CPU (Intel 2GHz Xeon) 8 GB memory Dell Perc 6i with battery-backed cache RAID 10 of 8x 146GB SAS 10K 2.5" disks Software: Linux 2.6.24, 64-bit XFS file system Postgres 8.3.0 max_connections = 1000 shared_buffers = 2000MB work_mem = 256MB max_fsm_pages = 1000000 max_fsm_relations = 5000 synchronous_commit = off wal_buffers = 256kB checkpoint_segments = 30 effective_cache_size = 4GB Each test was run 5 times: drop database test create database test pgbench -i -s 20 -U test pgbench -c 10 -t 50000 -v -U test The I/O scheduler was changed on-the-fly using (for example) "echo cfq >/sys/block/sda/queue/scheduler". Autovacuum was turned off during the test. Here are the results. The numbers are those reported as "tps = xxxx (including connections establishing)" (which were almostidentical to the "excluding..." tps number). I/O Sched AVG Test1 Test2 Test3 Test4 Test5 --------- ----- ----- ----- ----- ----- ----- cfq 3355 3646 3207 3132 3204 3584 noop 3163 2901 3190 3293 3124 3308 deadline 3547 3923 3722 3351 3484 3254 anticipatory 3384 3453 3916 2944 3451 3156 As you can see, the averages are very close -- closer than the "noise" between runs. As far as I can tell, there is no significantadvantage, or even any significant difference, between the various I/O scheduler algorithms. (It also reinforces what the pgbench man page says: Short runs aren't useful. Even these two-minute runs have a lot of variability. Before I turned off AutoVacuum, the variability was more like 50% between runs.) Craig
Re: RAID 10 Benchmark with different I/O schedulers (was: Performance increase with elevator=deadline)
From
"Scott Marlowe"
Date:
On Mon, May 5, 2008 at 5:33 PM, Craig James <craig_james@emolecules.com> wrote: > > (It also reinforces what the pgbench man page says: Short runs aren't > useful. Even these two-minute runs have a lot of variability. Before I > turned off AutoVacuum, the variability was more like 50% between runs.) I'd suggest a couple things for more realistic tests. Run the tests much longer, say 30 minutes to an hour. Crank up your scaling factor until your test db is larger than memory. Turn on autovacuum, maybe raising the cost / delay factors so it doesn't affect performance too negatively. And lastly tuning the bgwriter so that checkpoints are short and don't interfere too much. My guess is if you let it run for a while, you'll get a much more reliable number.
Re: RAID 10 Benchmark with different I/O schedulers (was: Performance increase with elevator=deadline)
From
Greg Smith
Date:
On Mon, 5 May 2008, Craig James wrote: > pgbench -i -s 20 -U test That's way too low to expect you'll see a difference in I/O schedulers. A scale of 20 is giving you a 320MB database, you can fit the whole thing in RAM and almost all of it on your controller cache. What's there to schedule? You're just moving between buffers that are generally large enough to hold most of what they need. > pgbench -c 10 -t 50000 -v -U test This is OK, because when you increase the size you're not going to be pushing 3500 TPS anymore and this test will take quite a while. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Re: RAID 10 Benchmark with different I/O schedulers (was: Performance increase with elevator=deadline)
From
Jeff
Date:
On May 5, 2008, at 7:33 PM, Craig James wrote: > I had the opportunity to do more testing on another new server to > see whether the kernel's I/O scheduling makes any difference. > Conclusion: On a battery-backed RAID 10 system, the kernel's I/O > scheduling algorithm has no effect. This makes sense, since a > battery-backed cache will supercede any I/O rescheduling that the > kernel tries to do. > this goes against my real world experience here. > pgbench -i -s 20 -U test > pgbench -c 10 -t 50000 -v -U test > You should use a sample size of 2x ram to get a more realistic number, or try out my pgiosim tool on pgfoundry which "sort of" simulates an index scan. I posted numbers from that a month or two ago here. -- Jeff Trout <jeff@jefftrout.com> http://www.stuarthamm.net/ http://www.dellsmartexitin.com/
Greg Smith wrote: > On Mon, 5 May 2008, Craig James wrote: > >> pgbench -i -s 20 -U test > > That's way too low to expect you'll see a difference in I/O schedulers. > A scale of 20 is giving you a 320MB database, you can fit the whole > thing in RAM and almost all of it on your controller cache. What's > there to schedule? You're just moving between buffers that are > generally large enough to hold most of what they need. Test repeated with: autovacuum enabled database destroyed and recreated between runs pgbench -i -s 600 ... pgbench -c 10 -t 50000 -n ... I/O Sched AVG Test1 Test2 --------- ----- ----- ----- cfq 705 695 715 noop 758 769 747 deadline 741 705 775 anticipatory 494 477 511 I only did two runs of each, which took about 24 minutes. Like the first round of tests, the "noise" in the measurements(about 10%) exceeds the difference between scheduler-algorithm performance, except that "anticipatory" seemsto be measurably slower. So it still looks like cfq, noop and deadline are more or less equivalent when used with a battery-backed RAID. Craig
On Tue, 6 May 2008, Craig James wrote: > I only did two runs of each, which took about 24 minutes. Like the first > round of tests, the "noise" in the measurements (about 10%) exceeds the > difference between scheduler-algorithm performance, except that > "anticipatory" seems to be measurably slower. Those are much better results. Any test that says anticipatory is anything other than useless for database system use with a good controller I presume is broken, so that's how I know you're in the right ballpark now but weren't before. In order to actually get some useful data out of the noise that is pgbench, you need a lot more measurements of longer runs. As perspective, the last time I did something in this area, in order to get enough data to get a clear picture I ran tests for 12 hours. I'm hoping to repeat that soon with some more common hardware that gives useful results I can give out. > So it still looks like cfq, noop and deadline are more or less equivalent > when used with a battery-backed RAID. I think it's fair to say they're within 10% of one another on raw throughput. The thing you're not measuring here is worst-case latency, and that's where there might be a more interesting difference. Most tests I've seen suggest deadline is the best in that regard, cfq the worst, and where noop fits in depends on the underlying controller. pgbench produces log files with latency measurements if you pass it "-l". Here's a snippet of shell that runs pgbench then looks at the resulting latency results for the worst 5 numbers: pgbench ... -l & p=$! wait $p mv pgbench_log.${p} pgbench.log echo Worst latency results: cat pgbench.log | cut -f 3 -d " " | sort -n | tail -n 5 However, that may not give you much useful info either--in most cases checkpoint issues kind of swamp the worst-base behavior in PostgreSQL, and to quantify I/O schedulers you need to look more complicated statistics on latency. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > On Tue, 6 May 2008, Craig James wrote: > >> I only did two runs of each, which took about 24 minutes. Like the >> first round of tests, the "noise" in the measurements (about 10%) >> exceeds the difference between scheduler-algorithm performance, except >> that "anticipatory" seems to be measurably slower. > > Those are much better results. Any test that says anticipatory is > anything other than useless for database system use with a good > controller I presume is broken, so that's how I know you're in the right > ballpark now but weren't before. > > In order to actually get some useful data out of the noise that is > pgbench, you need a lot more measurements of longer runs. As > perspective, the last time I did something in this area, in order to get > enough data to get a clear picture I ran tests for 12 hours. I'm hoping > to repeat that soon with some more common hardware that gives useful > results I can give out. This data is good enough for what I'm doing. There were reports from non-RAID users that the I/O scheduling could make asmuch as a 4x difference in performance (which makes sense for non-RAID), but these tests show me that three of the fourI/O schedulers are within 10% of each other. Since this matches my intuition of how battery-backed RAID will work, I'msatisfied. If our servers get overloaded to the point where 10% matters, then I need a much more dramatic solution, likefaster machines or more machines. Craig
Craig James wrote: > This data is good enough for what I'm doing. There were > reports from non-RAID users that the I/O scheduling could > make as much as a 4x difference in performance (which makes > sense for non-RAID), but these tests show me that three of > the four I/O schedulers are within 10% of each other. Since > this matches my intuition of how battery-backed RAID will > work, I'm satisfied. If our servers get overloaded to the > point where 10% matters, then I need a much more dramatic > solution, like faster machines or more machines. I should comment on this as I am the one who reported the big performance increase with the deadline scheduler. I was very surprised at this increase myself as I had not seen any similar reports, so I thought I should share it for whatever it is worth. Our SAN *is* a RAID-5 with lots of cache, so there must be a flaw in your intuition. Performance measures depend a lot on your hardware and software setup (e.g. kernel version in this case) and on the specific load. The load we used was a real life load, collected over seveal hours and extracted from the log files. My opinion is that performance observations can rarely be generalized - I am not surprised that with a different system and a different load you observe hardly any difference between "cfq" and "deadline". For the record, in our test case "noop" performed practically as good as "deadline", while the other two did way worse. Like yourself, I have wondered why different I/O scheduling algorithms should make so much difference. Here is my home-spun theory of what may happen; tear it apart and replace it with a better one at your convenience: Our SAN probably (we're investigating) has its own brains to optimize I/O, and I guess that any optimization that the kernel does can only deteriorate performance because the two algorithms might "step on each other's toes". This is backed by "noop" performing well. I believe that caching will not make much difference, because the cache is way smaller than the database, and whatever is neither in the shared buffer nor in the kernel filesystem cache is also not likely to be in the storage system's cache. Remember that our load was read-only. Yours, Laurenz Albe
On Tue, 6 May 2008, Craig James wrote: > I/O Sched AVG Test1 Test2 > --------- ----- ----- ----- > cfq 705 695 715 > noop 758 769 747 > deadline 741 705 775 > anticipatory 494 477 511 Interesting. That contrasts with some tests I did a while back on a 16-disc RAID-0, where noop, deadline, and anticipatory were all identical in performance, with cfq being significantly slower. Admittedly, the disc test was single-process, which is probably why the anticipatory behaviour didn't kick in. You are seeing a little bit of degradation with cfq - I guess it's worse the bigger the disc subsystem you have. Matthew -- Matthew: That's one of things about Cambridge - all the roads keep changing names as you walk along them, like Hills Road in particular. Sagar: Yes, Sidney Street is a bit like that too. Matthew: Sidney Street *is* Hills Road.
This seems like a bug to me, but it shows up as a performance problem. Since the column being queried is an integer, thesecond query (see below) can't possibly match, yet Postgres uses a typecast, forcing a full table scan for a value thatcan't possibly be in the table. The application could intercept these bogus queries, but that requires building schema-specific and postgres-specific knowledgeinto the application (i.e. "What is the maximum legal integer for this column?"). Craig explain analyze select version_id, parent_id from version where version_id = 99999; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Index Scan using version_pkey on version (cost=0.00..9.89 rows=1 width=8) (actual time=0.054..0.054 rows=0 loops=1) Index Cond: (version_id = 99999) Total runtime: 0.130 ms (3 rows) emol_warehouse_1=> explain analyze select version_id, parent_id from version where version_id = 999999999999999999999999999; QUERY PLAN ---------------------------------------------------------------------------------------------------------------- Seq Scan on version (cost=0.00..253431.77 rows=48393 width=8) (actual time=3135.530..3135.530 rows=0 loops=1) Filter: ((version_id)::numeric = 999999999999999999999999999::numeric) Total runtime: 3135.557 ms (3 rows) \d version Table "emol_warehouse_1.version" Column | Type | Modifiers ------------+---------+----------- version_id | integer | not null parent_id | integer | not null ... more columns Indexes: "version_pkey" PRIMARY KEY, btree (version_id)
This seems like a bug to me, but it shows up as a performance problem. Since the column being queried is an integer, thesecond query (see below) can't possibly match, yet Postgres uses a typecast, forcing a full table scan for a value thatcan't possibly be in the table. The application could intercept these bogus queries, but that requires building schema-specific and postgres-specific knowledgeinto the application (i.e. "What is the maximum legal integer for this column?"). Craig explain analyze select version_id, parent_id from version where version_id = 99999; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Index Scan using version_pkey on version (cost=0.00..9.89 rows=1 width=8) (actual time=0.054..0.054 rows=0 loops=1) Index Cond: (version_id = 99999) Total runtime: 0.130 ms (3 rows) emol_warehouse_1=> explain analyze select version_id, parent_id from version where version_id = 999999999999999999999999999; QUERY PLAN ---------------------------------------------------------------------------------------------------------------- Seq Scan on version (cost=0.00..253431.77 rows=48393 width=8) (actual time=3135.530..3135.530 rows=0 loops=1) Filter: ((version_id)::numeric = 999999999999999999999999999::numeric) Total runtime: 3135.557 ms (3 rows) \d version Table "emol_warehouse_1.version" Column | Type | Modifiers ------------+---------+----------- version_id | integer | not null parent_id | integer | not null ... more columns Indexes: "version_pkey" PRIMARY KEY, btree (version_id)
Craig James <craig_james@emolecules.com> writes: > This seems like a bug to me, but it shows up as a performance problem. > emol_warehouse_1=> explain analyze select version_id, parent_id from version where version_id = 999999999999999999999999999; If you actually *need* so many 9's here as to force it out of the range of bigint, then why is your id column not declared numeric? This seems to me to be about on par with complaining that "intcol = 4.2e1" won't be indexed. We have a numeric data type hierarchy, learn to work with it ... regards, tom lane
Tom Lane wrote: > Craig James <craig_james@emolecules.com> writes: >> This seems like a bug to me, but it shows up as a performance problem. > >> emol_warehouse_1=> explain analyze select version_id, parent_id from version where version_id = 999999999999999999999999999; > > If you actually *need* so many 9's here as to force it out of the range > of bigint, then why is your id column not declared numeric? > > This seems to me to be about on par with complaining that "intcol = 4.2e1" > won't be indexed. We have a numeric data type hierarchy, learn to > work with it ... Your suggestion of "learn to work with it" doesn't fly. A good design separates the database schema details from the applicationto the greatest extent possible. What you're suggesting is that every application that queries against a Postgresdatabase should know the exact range of every numeric data type of every indexed column in the schema, simply becausePostgres can't recognize an out-of-range numeric value. In this case, the optimizer could have instantly returned zero results with no further work, since the query was out of rangefor that column. This seems like a pretty simple optimization to me, and it seems like a helpful suggestion to make to this forum. BTW, this query came from throwing lots of junk at a web app in an effort to uncover exactly this sort of problem. It'snot a real query, but then, hackers don't use real queries. The app checks that its input is a well-formed integer expression,but then assumes Postgres can deal with it from there. Craig
On 6/26/08, Craig James <craig_james@emolecules.com> wrote: > This seems like a bug to me, but it shows up as a performance problem. > Since the column being queried is an integer, the second query (see below) > can't possibly match, yet Postgres uses a typecast, forcing a full table > scan for a value that can't possibly be in the table. Which version are you using? 8.3 removes a lot of implicit casts (all? not sure), so this may already be your fix. Cheers, Frank
On Thu, Jun 26, 2008 at 9:02 AM, Frank Joerdens <frank@joerdens.de> wrote: > Which version are you using? 8.3 removes a lot of implicit casts (all? > not sure), so this may already be your fix. 8.3 only removed implicit casts from non text types to text (date -> text, int -> text, interval -> text...) to avoid unexpected behaviours. -- Guillaume