Thread: Postgres 7.3.1 poor insert/update/search performance esp WRT Oracle
I'm trying to get converted over to Postgres from Oracle (Postgres is billions of times more straightforward and pragmatically clean than Oracle), but I'm having some severe performance problems on what naively appears to be a very straightforward dead-simple test. The test is comprised of two parts: a write part which attempts to accumulate (sum) numbers by distinct keys, and a read part which searches for keys in the database (some of which will be present, some of which will not). In a more realistic scenario, both will be happening all of the time, but we can start off easy. However, performance is terrible: around 39 write transactions/second and 69 searches/second. Oracle, by comparison, writes at 314 and reads at 395--practically an order of magnitude better performance. Both are using the same hardware (obviously not at the same time) which is a dual-processor AMD 2000+ with 3GB memory and both oracle and postgres loaded on a 105GB ``MD'' striped (no redundancy) 2 SCSI disks running ext3 fs (no special flags) with Linux 2.4.18-10smp. I actually have seven different schemes for performing the writes using Postgres: ---------------------------------------------------------------------- "normal" C libpq 39 t/s "normal" Perl DBI 39 t/s "DBI Prepared Statement" Perl DBI 39 t/s "Batching" Perl DBI 45 t/s "arrays" Perl DBI 26 t/s "server-side function" Perl DBI 39 t/s "server-side trigger" Perl DBI 39 t/s "normal" Perl DBI read 69 t/s "normal" Perl DBI for Oracle 314 t/s "normal" Perl DBI read for Oracle 395 t/s ---------------------------------------------------------------------- Only batching had a statistically significant improvement, and it wasn't that major. I couldn't use true Postgres prepared statements since you cannot determine the success/failure of the statements yet. I was planning on using arrays as well, but the additional 33% performance impact is not amusing (though I suppose it is only an additional 3% if you consider the 87% performance drop of Postgres from Oracle). I'll include all methods in the attached file, but since there was no significant difference, I'll concentrate on the basic one: Example table: ---------------------------------------------------------------------- CREATE TABLE test ( val BIGINT PRIMARY KEY, # "vals" may be between 0 and 2^32-1 accum INTEGER ); ---------------------------------------------------------------------- Basic algorithm for writes ---------------------------------------------------------------------- while (<>) { chomp; @A = split; if (dosql($dbh, "UPDATE test SET accum = accum + $A[1] WHERE val = '$A[0]';",0) eq "0E0") { dosql($dbh, "INSERT INTO test VALUES ( $A[0], $A[1] );"); } } ---------------------------------------------------------------------- Basic algorithm for reads ---------------------------------------------------------------------- while (<>) { chomp; @A = split; $sth = querysql($dbh,"SELECT accum FROM test WHERE val = $A[0];"); $hit++ if ($sth && ($row = $sth->fetchrow_arrayref)); $tot++; } ---------------------------------------------------------------------- What could be simpler. In my randomly generated write data, I usually have about 18K inserts and 82K updates. In my randomly generated read data, I have 100K keys which will be found and 100K keys which will not be found. The postgresql.conf file is default (my sysadmin nuked all of my changes when he upgraded to 7.3.1--grr) and there are some shared memory configs: kernel.sem = 250 32000 100 128, kernel.shmmax = 2147483648, kernel.shmmni = 100, kernel.shmmax = 134217728 The WAL is not seperated (but see below). A "vacuum analyze" is performed between the write phase and the read phase. However, for your analysis pleasure, here are the results of a full verbose analyze and some explain results (both before and after). /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ seth=> explain update test set accum = accum + 53 where val = '5'; QUERY PLAN ----------------------------------------------------- Seq Scan on test (cost=0.00..0.00 rows=1 width=18) Filter: (val = 5::bigint) (2 rows) seth=> explain insert into test values (5, 53); QUERY PLAN ------------------------------------------ Result (cost=0.00..0.01 rows=1 width=0) (1 row) seth=> vacuum full verbose analyze test; INFO: --Relation public.test-- INFO: Pages 541: Changed 2, reaped 540, Empty 0, New 0; Tup 18153: Vac 81847, Keep/VTL 0/0, UnUsed 0, MinLen 40, MaxLen40; Re-using: Free/Avail. Space 3294932/3294932; EndEmpty/Avail. Pages 0/541. CPU 0.00s/0.03u sec elapsed 0.02 sec. INFO: Index test_pkey: Pages 355; Tuples 18153: Deleted 81847. CPU 0.03s/0.34u sec elapsed 0.65 sec. INFO: Rel test: Pages: 541 --> 99; Tuple(s) moved: 18123. CPU 1.01s/0.31u sec elapsed 9.65 sec. INFO: Index test_pkey: Pages 355; Tuples 18153: Deleted 18123. CPU 0.02s/0.06u sec elapsed 0.19 sec. INFO: Analyzing public.test VACUUM seth=> explain select accum from test where val = 5; QUERY PLAN ---------------------------------------------------------------------- Index Scan using test_pkey on test (cost=0.00..5.99 rows=1 width=4) Index Cond: (val = 5) (2 rows) seth=> explain update test set accum = accum + 53 where val = '5'; QUERY PLAN ----------------------------------------------------------------------- Index Scan using test_pkey on test (cost=0.00..5.99 rows=1 width=18) Index Cond: (val = 5::bigint) (2 rows) seth=> explain insert into test values (5, 53); QUERY PLAN ------------------------------------------ Result (cost=0.00..0.01 rows=1 width=0) (1 row) /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ I certainly understand that using an index scan might well speed things up WRT the update policy, but considering the search performance is post-analyze (pre-analyze it is even more deadly slow), I am dubious that doing it during the updates will get me within striking distance of Oracle since read performance has got to be better than write performance, right?. This is also why I am dubious that moving the WAL to another filesystem or futzing with the fsync policy will do anything. I will include below a compressed tarball of the programs I used (and the corresponding RUNME script) in case you wish to play along at home. I don't claim they are pretty, BTW :-) -Seth Robertson
Attachment
On Tue, 21 Jan 2003, Seth Robertson wrote: > The postgresql.conf file is default (my sysadmin nuked all of my > changes when he upgraded to 7.3.1--grr) and there are some shared > memory configs: kernel.sem = 250 32000 100 128, kernel.shmmax = > 2147483648, kernel.shmmni = 100, kernel.shmmax = 134217728 The > WAL is not seperated (but see below). You almost certainly want to raise shared_buffers from the default (64?) to say 1k-10k. I'm not sure how much that'll help but it should help some. > A "vacuum analyze" is performed between the write phase and the read > phase. However, for your analysis pleasure, here are the results > of a full verbose analyze and some explain results (both before and after). BTW: what does explain analyze (rather than plain explain) show?
In message <20030121134242.Q84028-100000@megazone23.bigpanda.com>, Stephan Szabo writes: On Tue, 21 Jan 2003, Seth Robertson wrote: > The postgresql.conf file is default (my sysadmin nuked all of my > changes when he upgraded to 7.3.1--grr) and there are some shared > memory configs: kernel.sem = 250 32000 100 128, kernel.shmmax = > 2147483648, kernel.shmmni = 100, kernel.shmmax = 134217728 The > WAL is not seperated (but see below). You almost certainly want to raise shared_buffers from the default (64?) to say 1k-10k. I'm not sure how much that'll help but it should help some. I'll try that and report back later, but I was under the (false?) impression that it was primarily important when you had multiple database connections using the same table. > A "vacuum analyze" is performed between the write phase and the > read phase. However, for your analysis pleasure, here are the > results of a full verbose analyze and some explain results (both > before and after). BTW: what does explain analyze (rather than plain explain) show? /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ seth=> explain analyze select accum from test where val = 5; QUERY PLAN ---------------------------------------------------------------------------------------------- Seq Scan on test (cost=0.00..0.00 rows=1 width=4) (actual time=94.55..94.55 rows=0 loops=1) Filter: (val = 5) Total runtime: 99.20 msec (3 rows) seth=> explain analyze update test set accum = accum + 53 where val = '5'; QUERY PLAN ----------------------------------------------------------------------------------------------- Seq Scan on test (cost=0.00..0.00 rows=1 width=18) (actual time=31.95..31.95 rows=0 loops=1) Filter: (val = 5::bigint) Total runtime: 32.04 msec (3 rows) seth=> explain analyze insert into test values (5, 53); QUERY PLAN ---------------------------------------------------------------------------------- Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.01..0.01 rows=1 loops=1) Total runtime: 7.50 msec (2 rows) seth=> vacuum full verbose analyze test seth-> ; INFO: --Relation public.test-- INFO: Pages 541: Changed 1, reaped 539, Empty 0, New 0; Tup 18071: Vac 81930, Keep/VTL 0/0, UnUsed 0, MinLen 40, MaxLen40; Re-using: Free/Avail. Space 3298208/3298176; EndEmpty/Avail. Pages 0/540. CPU 0.03s/0.00u sec elapsed 0.02 sec. INFO: Index test_pkey: Pages 355; Tuples 18071: Deleted 81930. CPU 0.04s/0.41u sec elapsed 1.96 sec. INFO: Rel test: Pages: 541 --> 98; Tuple(s) moved: 18046. CPU 0.95s/0.42u sec elapsed 12.74 sec. INFO: Index test_pkey: Pages 355; Tuples 18071: Deleted 18046. CPU 0.02s/0.05u sec elapsed 0.31 sec. INFO: Analyzing public.test VACUUM seth=> explain analyze select accum from test where val = 5; QUERY PLAN ----------------------------------------------------------------------------------------------- Seq Scan on test (cost=0.00..323.89 rows=1 width=4) (actual time=0.13..14.20 rows=1 loops=1) Filter: (val = 5) Total runtime: 14.26 msec (3 rows) seth=> explain analyze select accum from test where val = 2147483648; QUERY PLAN -------------------------------------------------------------------------------------------------------------- Index Scan using test_pkey on test (cost=0.00..5.99 rows=1 width=4) (actual time=0.11..0.11 rows=0 loops=1) Index Cond: (val = 2147483648::bigint) Total runtime: 0.16 msec (3 rows) seth=> explain analyze update test set accum = accum + 53 where val = '5'; QUERY PLAN --------------------------------------------------------------------------------------------------------------- Index Scan using test_pkey on test (cost=0.00..5.99 rows=1 width=18) (actual time=0.24..0.24 rows=1 loops=1) Index Cond: (val = 5::bigint) Total runtime: 0.39 msec (3 rows) seth=> explain analyze insert into test values (6, 53); QUERY PLAN ---------------------------------------------------------------------------------- Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.01..0.01 rows=1 loops=1) Total runtime: 0.08 msec (2 rows) seth=> explain analyze insert into test values (2147483647, 53); QUERY PLAN ---------------------------------------------------------------------------------- Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.01..0.01 rows=1 loops=1) Total runtime: 0.33 msec (2 rows) /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ -Seth Robertson
Seth Robertson <pgsql-performance@sysd.com> writes: > I'll try that and report back later, but I was under the (false?) > impression that it was primarily important when you had multiple > database connections using the same table. Definitely false. shared_buffers needs to be 1000 or so for production-grade performance. There are varying schools of thought about whether it's useful to raise it even higher, but in any case 64 is just a toy-installation setting. > seth=> explain analyze select accum from test where val = 5; > QUERY PLAN > ----------------------------------------------------------------------------------------------- > Seq Scan on test (cost=0.00..323.89 rows=1 width=4) (actual time=0.13..14.20 rows=1 loops=1) > Filter: (val = 5) > Total runtime: 14.26 msec > (3 rows) > seth=> explain analyze update test set accum = accum + 53 where val = '5'; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------- > Index Scan using test_pkey on test (cost=0.00..5.99 rows=1 width=18) (actual time=0.24..0.24 rows=1 loops=1) > Index Cond: (val = 5::bigint) > Total runtime: 0.39 msec > (3 rows) The quotes are important when you are dealing with BIGINT indexes. You won't get an indexscan if the constant looks like int4 rather than int8. regards, tom lane
Tom and others: There has been a lot of talk about shared memory size recently, along with many conflicting statements from various people. Earlier threads said that setting the shared buffer to a high values (like 512MB on a 2GB dedicated DB server) is not a good idea. A couple of reasons were mentioned. a) potential inefficiencies with the kernel and VM system b) modern kernels aggressive caching with all free memory and c) the shared memory stealing from memory the kernel would use to cache, etc. So my question is: if the kernel is caching all this data, what's the benefit of setting this to 1000 or higher? Why wouldn't i just set it to 0 if I believe my kernel is doing a good job. From all the discussion on this topic, it's still not clear to me how to calculate what value this should be set at and why. I've read these documents and others and have yet to find explanations and recommendations that i can use. http://www.postgresql.org/docs/momjian/hw_performance.pdf http://www.postgresql.org/idocs/index.php?runtime-config.html http://www.postgresql.org/idocs/index.php?kernel-resources.html http://www.postgresql.org/idocs/index.php?performance-tips.html http://www.ca.postgresql.org/docs/momjian/hw_performance/node6.html http://www.ca.postgresql.org/docs/momjian/hw_performance/node5.html http://www.ca.postgresql.org/docs/faq-english.html#3.6 This is such a common topic, it would be nice to see a more definitive and comprehensive section in the docs for tuning. Google searches for "shared_buffers site:www.postgresql.org" and "tuning site:www.postgresql.org" come up with little info. FYI: I've been running our database which is mostly read only with 1500 buffers. On a whole, we see very little IO. postgresql performs many many million queries a day, many simple, many complex. Though the database is relatively small, around 3GB. --brian On Tuesday, January 21, 2003, at 03:31 PM, Tom Lane wrote: > Seth Robertson <pgsql-performance@sysd.com> writes: >> I'll try that and report back later, but I was under the (false?) >> impression that it was primarily important when you had multiple >> database connections using the same table. > > Definitely false. shared_buffers needs to be 1000 or so for > production-grade performance. There are varying schools of thought > about whether it's useful to raise it even higher, but in any case > 64 is just a toy-installation setting. > >> seth=> explain analyze select accum from test where val = 5; >> QUERY PLAN >> ---------------------------------------------------------------------- >> ------------------------- >> Seq Scan on test (cost=0.00..323.89 rows=1 width=4) (actual >> time=0.13..14.20 rows=1 loops=1) >> Filter: (val = 5) >> Total runtime: 14.26 msec >> (3 rows) > >> seth=> explain analyze update test set accum = accum + 53 where val = >> '5'; >> QUERY PLAN >> ---------------------------------------------------------------------- >> ----------------------------------------- >> Index Scan using test_pkey on test (cost=0.00..5.99 rows=1 >> width=18) (actual time=0.24..0.24 rows=1 loops=1) >> Index Cond: (val = 5::bigint) >> Total runtime: 0.39 msec >> (3 rows) > > The quotes are important when you are dealing with BIGINT indexes. > You won't get an indexscan if the constant looks like int4 rather than > int8. > > regards, tom lane > > ---------------------------(end of > broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to > majordomo@postgresql.org)
In message <13165.1043188295@sss.pgh.pa.us>, Tom Lane writes: Seth Robertson <pgsql-performance@sysd.com> writes: > I'll try that and report back later, but I was under the (false?) > impression that it was primarily important when you had multiple > database connections using the same table. Definitely false. shared_buffers needs to be 1000 or so for production-grade performance. There are varying schools of thought about whether it's useful to raise it even higher, but in any case 64 is just a toy-installation setting. Increasing the setting to 4096 improved write performance by 20%. Increasing the setting to 8192 had no additional effect. I could try a few more probes if anyone cared. The quotes are important when you are dealing with BIGINT indexes. You won't get an indexscan if the constant looks like int4 rather than int8. You are not kidding!!!! Changing this increased the search performance to 2083 transactions/second. This is 30 times faster than before, and 5 times faster than Oracle! Go Tom Lane!!! Unfortunately, the update accidentally already used the quoting, so this top did not directly help the write case. However, it did inspire me to check some other suggestions I have read since obviously performance was to be had. ---------------------------------------------------------------------- Oracle read performance: 395 Original read performance: 69 shared_buffer = 4096 118 + quoted where (WHERE val = '5') 2083 ---------------------------------------------------------------------- ---------------------------------------------------------------------- Oracle write performance: 314 Original write performance: 39 shared_buffer = 4096: 47 + Occassional (@ 10K & 60K vectors) vacuum analyze in bg: 121 + Periodic (every 10K vectors) vacuum analyze in background: 124 + wal_buffers = 24: 125 + wal_method = fdatasync 127 + wal_method = open_sync 248 + wal_method = open_datasync Not Supported + fsync=false: 793 ---------------------------------------------------------------------- Just to round out my report, using the fastest safe combination I was able to find (open_sync *is* safe, isn't it?), I reran all 7 performance tests to see if there was any different using the different access methods: ---------------------------------------------------------------------- "normal" C libpq 256 t/s "normal" Perl DBI 251 t/s "DBI Prepared Statement" Perl DBI 254 t/s "Batching" Perl DBI 1149 t/s "arrays" Perl DBI 43 t/s "server-side function" Perl DBI 84 t/s "server-side trigger" Perl DBI 84 t/s "normal" Perl DBI read 1960 t/s "normal" Perl DBI for Oracle 314 t/s "normal" Perl DBI read for Oracle 395 t/s ---------------------------------------------------------------------- With a batching update of 1149 transactions per second (2900% improvement), I am willing to call it a day unless anyone else has any brilliant ideas. However, it looks like my hope to use arrays is doomed though, I'm not sure I can handle the performance penalty. -Seth Robertson
On Tue, Jan 21, 2003 at 06:44:57PM -0700, Brian Hirt wrote: > > So my question is: if the kernel is caching all this data, what's the > benefit of setting this to 1000 or higher? Why wouldn't i just set it > to 0 if I believe my kernel is doing a good job. If Postgres tries to fetch a bit of data which is in its own shared buffer, it does not even need to make a system call in order to fetch it. The data fetch is extremely fast. The problem is that managing that shared memory comes at some cost. If the data is not in a shared buffer, then Postgres makes exactly the same call, no matter what, to the OS kernel, asking for the data from disk. It might happen, however, that the kernel will have the data in its disk cache, however. The total cost of the operation, therefore, is much lower in case the data is in the kernel's disk cache than in the case where it is actually on the disk. It is nevertheless still higher (atomically speaking) than fetching the data from Postgres's own shared buffer. So the question is this: where is the "sweet spot" where it costs little enough for Postgres to manage the shared buffer that the reduced cost of a system call is worth it. (As you point out, this caclulation is complicated by the potential to waste memory by caching the data twice -- once in the shared buffer and once in the disk cache. Some systems, like Solaris, allow you to turn off the disk cache, so the problem may not be one you face.) The trouble is that there is no agreement on the answer to that question, and precious little evidence which seems to settle the question. The only way to determine the best setting, then, is to use your system with more-or-less simulated production loads, and see where the best setting lies. You have to do this over time, because sometimes inefficiencies turn up only after running for a while. In an experiment we tried, we used a 2G shared buffer on a 12G machine. It looked brilliantly fast at first, but 48 hours later was _crawling_; that indicates a problem with shared-buffer management on the part of Postgres, I guess, but it was hard to say more than that. We ultimately settled on a value somewhere less than 1 G as appropriate for our use. But if I had to justify the number I picked (as opposed to one a few hundred higher or lower), I'd have a tough time. > From all the discussion on this topic, it's still not clear to me how > to calculate what value this should be set at and why. I've read these > documents and others and have yet to find explanations and > recommendations that i can use. I'm afraid what I'm saying is that it's a bit of a black art. The pg_autotune project is an attempt to help make this a little more scientific. It relies on pgbench, which has its own problems, however. Hope that's helpful, but I fear it doesn't give you the answer you'd like. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Brian Hirt <bhirt@mobygames.com> writes: > So my question is: if the kernel is caching all this data, what's the > benefit of setting this to 1000 or higher? Why wouldn't i just set it > to 0 if I believe my kernel is doing a good job. Well, setting it to 0 won't work ;-). There's some minimum number of buffers needed for PG to work at all; depending on complexity of your queries and number of active backends it's probably around 25-100. (You'll get "out of buffers" failures if too few.) But more to the point, when shared_buffers is too small you'll waste CPU cycles on unnecessary data transfers between kernel and user memory. It seems to be pretty well established that 64 is too small for most applications. I'm not sure how much is enough, but I suspect that a few thousand is plenty to get past the knee of the performance curve in most scenarios. regards, tom lane
Seth Robertson writes: However, it looks like my hope to use arrays is doomed though, I'm not sure I can handle the performance penalty. Just in case I get the person who implemented arrays annoyed or worried, I did not properly modify the "array" test and was vacuum'ing the wrong table every 10000 vectors during the test. I realized that this morning and the new array results are listed below. I also experimented with batching read operations, and I was surprised to find that this helps a great deal as well. ---------------------------------------------------------------------- "normal" C libpq 256 t/s "normal" Perl DBI 251 t/s "DBI Prepared Statement" Perl DBI 254 t/s "Batching" Perl DBI 1149 t/s "arrays" Perl DBI 250 t/s (*) "arrays with batching" Perl DBI 1020 t/s (*) "server-side function" Perl DBI 84 t/s "server-side trigger" Perl DBI 84 t/s "normal" Perl DBI read 1960 t/s "batched" Perl DBI read 3076 t/s (*) "array" Perl DBI read 1754 t/s (*) "batched array" Perl DBI read 2702 t/s (*) "normal" Perl DBI for Oracle 314 t/s "normal" Perl DBI read for Oracle 395 t/s ---------------------------------------------------------------------- (*) New/updated from this morning This brings array code to within 11% of the performance of batched non-arrays, and close enough to be an option. I may well be doing something wrong with the server-side functions, but I don't see anything quite so obviously wrong. -Seth Robertson
On Wed, 2003-01-22 at 07:05, Andrew Sullivan wrote: > (As you point out, this caclulation is complicated by the potential to > waste memory by caching the data twice If we had a good buffer replacement algorithm (which we currently do not), ISTM that hot pages retained in PostgreSQL's buffer cache would never get loaded from the OS's IO cache, thus causing those pages to eventually be evicted from the OS's cache. So the "cache the data twice" problem doesn't apply in all circumstances. > Some systems, like Solaris, allow you to turn off the > disk cache, so the problem may not be one you face.) I think it would be interesting to investigate disabling the OS' cache for all relation I/O (i.e. heap files, index files). That way we could both improve performance (by moving all the caching into PostgreSQL's domain, where there is more room for optimization), as well as make configuration simpler: in an ideal world, it would remove the need to consider the OS' caching when configuring the amount of shared memory to allocate to PostgreSQL. Can this be done using O_DIRECT? If so, is it portable? BTW, if anyone has any information on actually *using* O_DIRECT, I'd be interested in it. I tried to quickly hack PostgreSQL to use it, without success... Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
On Fri, 30 Jan 2003, Neil Conway wrote: > If we had a good buffer replacement algorithm (which we currently do > not), ISTM that hot pages retained in PostgreSQL's buffer cache would > never get loaded from the OS's IO cache, thus causing those pages to > eventually be evicted from the OS's cache. So the "cache the data twice" > problem doesn't apply in all circumstances. No, but it does apply to every block at some point, since during the initial load it's present in both caches, and it has to be flushed from the OS's cache at some point. > > Some systems, like Solaris, allow you to turn off the > > disk cache, so the problem may not be one you face.) > > I think it would be interesting to investigate disabling the OS' cache > for all relation I/O (i.e. heap files, index files). That way we could > both improve performance (by moving all the caching into PostgreSQL's > domain, where there is more room for optimization)... I'm not so sure that there is that all much more room for optimization. But take a look at what Solaris and FFS do now, and consider how much work it would be to rewrite it, and then see if you even want to do that before adding stuff to improve performance. > , as well as make configuration simpler: in an ideal world, it would > remove the need to consider the OS' caching when configuring the > amount of shared memory to allocate to PostgreSQL. We could do that much more simply by using mmap. > Can this be done using O_DIRECT? It can, but you're doing to lose some of the advantages that you'd get from using raw devices instead. In particular, you have no way to know the physical location of blocks on the disk, because those locations are often different from the location in the file. > If so, is it portable? O_DIRECT is not all that portable, I don't think. Certainly not as portable as mmap. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson wrote: > > > Some systems, like Solaris, allow you to turn off the > > > disk cache, so the problem may not be one you face.) > > > > I think it would be interesting to investigate disabling the OS' cache > > for all relation I/O (i.e. heap files, index files). That way we could > > both improve performance (by moving all the caching into PostgreSQL's > > domain, where there is more room for optimization)... > > I'm not so sure that there is that all much more room for optimization. > But take a look at what Solaris and FFS do now, and consider how much > work it would be to rewrite it, and then see if you even want to do that > before adding stuff to improve performance. We need free-behind for large sequential scans, like Solaris has. Do we have LRU-2 or LRU-K now? > > If so, is it portable? > > O_DIRECT is not all that portable, I don't think. Certainly not as > portable as mmap. As I remember, DIRECT doesn't return until the data hits the disk (because there is no OS cache), so if you want to write a page so you can reused the buffer, DIRECT would be quite slow. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Sun, 2003-02-02 at 05:39, Bruce Momjian wrote: > We need free-behind for large sequential scans, like Solaris has. Do we > have LRU-2 or LRU-K now? No. > As I remember, DIRECT doesn't return until the data hits the disk > (because there is no OS cache), so if you want to write a page so you > can reused the buffer, DIRECT would be quite slow. Why? If there is a finite amount of memory for doing buffering, the data needs to be written to disk at *some* point, anyway. And if we didn't use the OS cache, the size of the PostgreSQL shared buffer would be much larger (I'd think 80% or more of the physical RAM in a typical high-end machine, for dedicated PostgreSQL usage). One possible problem would be the fact that it might mean that writing out dirty pages would become part of some key code paths in PostgreSQL (rather than assuming that the OS can write out dirty pages in the background, as it chooses to). But there are lots of ways to work around this, notably by using a daemon to periodically write out some of the pages in the background. Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
On Mon, 2 Feb 2003, Neil Conway wrote: > > As I remember, DIRECT doesn't return until the data hits the disk > > (because there is no OS cache), so if you want to write a page so you > > can reused the buffer, DIRECT would be quite slow. > ... > One possible problem would be the fact that it might mean that writing > out dirty pages would become part of some key code paths in PostgreSQL > (rather than assuming that the OS can write out dirty pages in the > background, as it chooses to). But there are lots of ways to work around > this, notably by using a daemon to periodically write out some of the > pages in the background. If you're doing blocking direct I/O, you really have to have and use what I guess I'd call "scatter-scatter I/O": you need to chose a large number of blocks scattered over various positions in all your open files, and be able to request a write for all of them at once. If you write one by one, with each write going to disk before your request returns, you're going to be forcing the physical order of the writes with no knowledge of where the blocks physically reside on the disk, and you stand a snowball's chance in you-know-where of getting a write ordering that will maximize your disk throughput. This is why systems that use direct I/O, for the most part, use a raw partition and their own "filesystem" as well; you need to know the physical layout of the blocks to create efficient write strategies. (MS SQL Server on Windows NT is a notable exception to this. They do, however, make you pre-create the data file in advance, and they suggest doing it on an empty partition, which at the very least would get you long stretches of the file in contiguous order. They may also be using tricks to make sure the file gets created in contiguous order, or they may be able to get information from the OS about the physical block numbers corresponding to logical block numbers in the file.) cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt, > (MS SQL Server on Windows NT is a notable exception to this. They do, > however, make you pre-create the data file in advance, and they suggest > doing it on an empty partition, which at the very least would get you > long stretches of the file in contiguous order. They may also be using > tricks to make sure the file gets created in contiguous order, or they > may be able to get information from the OS about the physical block > numbers corresponding to logical block numbers in the file.) MSSQL is, in fact, doing some kind of direct-block-addressing. If you attempt to move a partition on the disk containing MSSQL databases, SQL Server will crash on restart and be unrecoverable ... even if the other files on that disk are fine. Nor can you back up MSSQL by using disk imaging, unless you can recover to an identical model disk/array. -- Josh Berkus Aglio Database Solutions San Francisco
Neil Conway wrote: > On Sun, 2003-02-02 at 05:39, Bruce Momjian wrote: > > We need free-behind for large sequential scans, like Solaris has. Do we > > have LRU-2 or LRU-K now? > > No. > > > As I remember, DIRECT doesn't return until the data hits the disk > > (because there is no OS cache), so if you want to write a page so you > > can reused the buffer, DIRECT would be quite slow. > > Why? If there is a finite amount of memory for doing buffering, the data > needs to be written to disk at *some* point, anyway. And if we didn't > use the OS cache, the size of the PostgreSQL shared buffer would be much > larger (I'd think 80% or more of the physical RAM in a typical high-end > machine, for dedicated PostgreSQL usage). > > One possible problem would be the fact that it might mean that writing > out dirty pages would become part of some key code paths in PostgreSQL > (rather than assuming that the OS can write out dirty pages in the > background, as it chooses to). But there are lots of ways to work around > this, notably by using a daemon to periodically write out some of the > pages in the background. Right. This is what we _don't_ want to do. If we need a buffer, we need it now. We can't wait for some other process to write the buffer directly to disk, nor do we want to group the writes somehow. And the other person mentioning we have to group writes again causes the same issues --- we are bypassing the kernel buffers which know more than we do. I can see advantage of preventing double buffering _quickly_ being overtaken by the extra overhead of direct i/o. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073