Thread: WAL Bypass for indexes
<div align="LEFT">I have followed the discussion from 3 months ago on WAL bypass and wanted to offer some more information.</div><divalign="LEFT"> </div><div align="LEFT">I have long believed that the bottleneck in transaction-orientedsystems is the writing of the indexes, complete with splits and merges. A single update to one fieldof a heavily-indexed table could cause dozens of index writes to cascade.</div><div align="LEFT"> </div><div align="LEFT">Thepoint is that speeding up index writing should offer the most "bang for the buck" relative to performance.As a bonus, a corrupted index can ALWAYS be recovered, while a corrupted table cannot.</div><div align="LEFT"> </div><divalign="LEFT">I did some informal testing using pgbench on v8.07. First, I ran pgbench normally with75 users doing 100 transactions, full vacuuming between runs. My machine consistently gave me 92 tps.</div><div align="LEFT"> </div><divalign="LEFT">As an experiment, I commented out of the btree index source all of the XLOG code I couldfind. I basically replaced the test for a temp table with "if (0)" and then recompiled.</div><div align="LEFT"> </div><divalign="LEFT">Running again pgbench with 75 users and 100 transactions, I received a consistent rateof 132 tps, a 40% increase in throughput.</div><div align="LEFT"> </div><div align="LEFT">It seems to me that major performancegains can be had by allowing some indexes to be created with some "UNSAFE-FAIL" flag, which would eliminate WALlogging and fsync() for the index files.</div><div align="LEFT"> </div><div align="LEFT">Upon recovery, the index getsrebuilt. The only downside is potentially long rebuild times during recovery.</div><div align="LEFT"> </div><div align="LEFT">Thoughts?</div><divalign="LEFT"> </div><div align="LEFT">Sincerely,</div><div align="LEFT">Marty</div>
"Martin Scholes" <marty@iicolo.com> writes: > I did some informal testing using pgbench on v8.07. First, I ran pgbench = > normally with 75 users doing 100 transactions, full vacuuming between runs. = > My machine consistently gave me 92 tps. > As an experiment, I commented out of the btree index source all of the XLOG = > code I could find. I basically replaced the test for a temp table with "if = > (0)" and then recompiled. It'd be more interesting if you'd done this testing on 8.1, or better CVS HEAD, as we took several steps to improve WAL performance in 8.1 (notably, abandoning the 64-bit CRC code). Also, when you don't say what configuration you were testing, the test results don't mean a lot. The cost of WAL logging is *very* heavily influenced by things such as checkpoint frequency, whether you have a separate drive for WAL, etc. > It seems to me that major performance gains can be had by allowing some = > indexes to be created with some "UNSAFE-FAIL" flag, This might be worth doing, but I'd want to see a more convincing demonstration before putting effort into it ... regards, tom lane
On 4/2/06, Martin Scholes <marty@iicolo.com> wrote: > I have long believed that the bottleneck in transaction-oriented systems is > the writing of the indexes, complete with splits and merges. A single update > to one field of a heavily-indexed table could cause dozens of index writes > to cascade. This is not the case with *most* OLTP systems. It is, however, an issue with systems that use a MVCC system which employ an MVTO-like algorithm where indexes must be updated because row locations are never the same. > Running again pgbench with 75 users and 100 transactions, I received a > consistent rate of 132 tps, a 40% increase in throughput. Less I/O = increased throughput. Generally, when you use less CPU or less I/O, you're going to get better performance. The question is, at what cost? > Upon recovery, the index gets rebuilt. The only downside is potentially long > rebuild times during recovery. Yes, this is unacceptable for large systems. While retaining the main MVCC implementation PostgreSQL currently has, we're working on a prototype to reduce WAL I/O and index updates in a large percentage of OLTP situations by employing an update-in-place under *safe* conditions. However, I don't see how your recommendation would actually be feasable. Similarly, from my point of view, a database should NEVER become corrupt. Likewise, I think your tests may have been a little biased... you may want to look through the xlog and index code to see how temp tables are handled (or not). In my opinion, I don't think we should have an option to allow the indexes to become corrupt. -- Jonah H. Harris, Database Internals Architect EnterpriseDB Corporation 732.331.1324
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > we're working on a prototype to reduce WAL I/O and index updates in a > large percentage of OLTP situations by employing an update-in-place > under *safe* conditions. AFAICS there are no circumstances, ever, in which update-in-place is "safe". (No transaction can guarantee that it will commit.) > In my opinion, I don't think we should have an option to allow the > indexes to become corrupt. Martin's proposal at least looks sensible; he just hasn't quite made the case that it's worth doing. If you're running a system that hardly ever crashes, you might be willing to accept index rebuilds during crash recovery, especially for indexes on relatively small, but frequently updated, tables (which should have reasonably short rebuild times). Obviously this would have to be configurable per-index, or at least per-table, and I agree that it likely would never be the default. But it could be a good tradeoff for some cases. regards, tom lane
> AFAICS there are no circumstances, ever, in which update-in-place is > "safe". (No transaction can guarantee that it will commit.) In our case, it is totally safe. I'd certainly like to discuss it with you sometime at the anniversary. > Martin's proposal at least looks sensible; he just hasn't quite made the > case that it's worth doing ... I agree that it likely would never be the > default. But it could be a good tradeoff for some cases. I guess I can think of a few instances, but none that I would've chosen to use it in. IIRC, it's also more likely to increase the cost of checkpointing and/or require a good amount of bgwriter tuning. As long as it's optional, I guess it's OK to let the administrator deal with recovery. Of course, in addition to no-fsync, we'll have another *possibly* dangerous option. BTW, I've seen no-fsync used far too many times because people think they're hardware is invincible. My only suggestion is to make sure it's a very well documented option. -- Jonah H. Harris, Database Internals Architect EnterpriseDB Corporation 732.331.1324
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > I guess I can think of a few instances, but none that I would've > chosen to use it in. IIRC, it's also more likely to increase the cost > of checkpointing and/or require a good amount of bgwriter tuning. How so? AFAICS it'd just eliminate WAL output. > As long as it's optional, I guess it's OK to let the administrator > deal with recovery. As I understood it, the proposal was for a feature that would arrange for the required index rebuild to happen *automatically* during crash recovery. I agree it'd be unacceptable if it requires manual intervention at restart. It occurs to me that if we had such a behavior, we could use it to "fix" hash indexes to be crash-safe, with less effort than WAL-ifying the hash code: just put in a small kluge to mark all hash indexes as needing rebuild during recovery. Not that I'm against teaching hash to do WAL, but no one's stepped up to the plate on that yet. regards, tom lane
> Martin's proposal at least looks sensible; he just hasn't quite made the > case that it's worth doing. If you're running a system that hardly ever > crashes, you might be willing to accept index rebuilds during crash > recovery, especially for indexes on relatively small, but frequently > updated, tables (which should have reasonably short rebuild times). > Obviously this would have to be configurable per-index, or at least > per-table, and I agree that it likely would never be the default. > But it could be a good tradeoff for some cases. My web system hasn't crashed in years, and last time I upgraded the index rebuild time was maybe 30 mins? So, I think a typical web application doesn't _really_ have that much data, and would greatly benefit from cranking the TPS. Chris
<div align="LEFT">Ok Tom, I stand corrected.</div><div align="LEFT"> </div><div align="LEFT">I downloaded the latest snapshotand both scenarios (normal and WAL bypass for indexes) produced between 185 and 230 tps on my machine.</div><divalign="LEFT"> </div><div align="LEFT">The lesson here is that whatever WAL magic has been performed onthe latest release gives over 100% speedup, and the speedup is so good that skipping WAL for indexes does basically nothing.</div><divalign="LEFT"> </div><div align="LEFT">Kudos.</div><div align="LEFT"> </div><div align="LEFT">Cheers,</div><divalign="LEFT">M</div><div align="LEFT">_____ Original message _____</div><div align="LEFT">Subject:Re: [HACKERS] WAL Bypass for indexes </div><div align="LEFT">Author: Tom Lane <tgl@sss.pgh.pa.us></div><divalign="LEFT">Date: 02nd April 2006 5:17:50 PM</div><div align="LEFT"> </div><div align="LEFT">"MartinScholes" <marty@iicolo.com> writes:<br />> I did some informal testing using pgbench on v8.07.First, I ran pgbench =<br />> normally with 75 users doing 100 transactions, full vacuuming between runs. =<br />>My machine consistently gave me 92 tps.<br /></div><div align="LEFT">> As an experiment, I commented out of thebtree index source all of the XLOG =<br />> code I could find. I basically replaced the test for a temp table with"if =<br />> (0)" and then recompiled.<br /></div><div align="LEFT">It'd be more interesting if you'd done this testingon 8.1, or better<br />CVS HEAD, as we took several steps to improve WAL performance in 8.1<br />(notably, abandoningthe 64-bit CRC code). Also, when you don't say</div><div align="LEFT">what configuration you were testing, thetest results don't mean a lot.<br />The cost of WAL logging is *very* heavily influenced by things such as<br />checkpointfrequency, whether you have a separate drive for WAL, etc.<br /></div><div align="LEFT">> It seems to me thatmajor performance gains can be had by allowing some =<br />> indexes to be created with some "UNSAFE-FAIL" flag,<br/></div><div align="LEFT">This might be worth doing, but I'd want to see a more convincing<br />demonstration beforeputting effort into it ...<br /></div><div align="LEFT"> regards, tom lane<br /></div>
On 4/2/06, Martin Scholes <marty@iicolo.com> wrote: > The lesson here is that whatever WAL magic has been performed on the latest > release gives over 100% speedup, and the speedup is so good that skipping > WAL for indexes does basically nothing. No need for me to reply to earlier messages :) -- Jonah H. Harris, Database Internals Architect EnterpriseDB Corporation 732.331.1324
"Martin Scholes" <marty@iicolo.com> writes: > Ok Tom, I stand corrected. > I downloaded the latest snapshot and both scenarios (normal and WAL bypass = > for indexes) produced between 185 and 230 tps on my machine. > The lesson here is that whatever WAL magic has been performed on the latest = > release gives over 100% speedup, and the speedup is so good that skipping = > WAL for indexes does basically nothing. [ scratches head ... ] Actually, I'd have expected that you could still measure a difference. I thought it might be reduced to the point where we arguably shouldn't spend major effort on eliminating it. But no difference at all really does not compute. Could you recheck your test conditions? You still haven't been very clear what they are. regards, tom lane
On Sun, 2006-04-02 at 20:39 -0700, Martin Scholes wrote: > The lesson here is that whatever WAL magic has been performed on the > latest release gives over 100% speedup That is good news. > and the speedup is so good that skipping WAL for indexes does > basically nothing. I don't agree with this conclusion. Your original idea has possibilities and these are not proved pointless by that test result. ISTM that any reduction in WAL will give a performance increase on a correctly configured system, and if it doesn't there's something else wrong also. The idea of WALBypass for indexes is a valid one and would seem likely to have good benefit with small tables that are very frequently updated or inserted into, or for databases in a replication group where longer recovery time doesn't influence overall availability. If we have this as a per-index option, that would allow some indexes to be more important than others if multiple workloads were supported on the same set of tables. Plus its easier to add a CREATE INDEX option, but would never apply to catalog indexes. We would automatically rebuild all marked indexes at the end of recovery - hence we'd need catalog indexes to be functioning - other parts of the system would not yet be available. Perhaps optimized with a unique hash table that is inserted into during recovery to remember all indexes that have been modified. I'd be interested in implementing this unless someone beats me to it. Thinking about this some more, I ask myself: why is it we log index inserts at all? We log heap inserts, which contain all the information we need to replay all index inserts also, so why bother? We would clearly need to still log full page writes of any changed index pages, but we wouldn't need to log each individual change. My only answer is that WAL records are required because index insertion is not completely deterministic because of the use of random(). Perhaps we could make the process completely deterministic but pseudo-random by using some aspect of the data/structure to determine the "random" insertion point? (I would still want to have full logging for catalog indexes). [I'll leave UPDATEs out of the discussion for now] Best Regards, Simon Riggs
<div align="LEFT">Tom Lane wrote:</div><div align="LEFT"> </div><div align="LEFT">> [ scratches head ... ] Actually, I'dhave expected that you could still<br />> measure a difference. I thought it might be reduced to the point where<br/> > we arguably shouldn't spend major effort on eliminating it. But no</div><div align="LEFT">> differenceat all really does not compute.</div><div align="LEFT"> </div><div align="LEFT">I agree completely and was baffledby the observations. I will do some more testing tonight to see if I can pin things down.</div><div align="LEFT"> </div><divalign="LEFT">My machine is a cheapo Wal-Mart Compaq Presario, Sempron 3000+ (2Ghz, socket A, 512KBcache), 768 MB, with the stock 40 GB IDE drive mirrored to a 160 GB SATA drive.</div><div align="LEFT"> </div><div align="LEFT">Irun Mandrake 2006 with the latest patches and ran the Pg snapshot with all of the defaults.</div><div align="LEFT"> </div><divalign="LEFT">I should have seen a difference in speed, but did not. One possible explanation is thatboth tests had the TPS flopping around between 185 and 225. It is possible that the improvement was so small comparedto the variance that it was hard to see. I will run multiple tests and post the actual numbers.</div><div align="LEFT"> </div><divalign="LEFT">Cheers,</div><div align="LEFT">M</div><div align="LEFT"> </div><div align="LEFT">_____Original message _____</div><div align="LEFT">Subject: Re: [HACKERS] WAL Bypass for indexes </div><divalign="LEFT">Author: Tom Lane <tgl@sss.pgh.pa.us></div><div align="LEFT">Date: 02nd April 2006 10:48:19 PM</div><divalign="LEFT"> </div><div align="LEFT">"Martin Scholes" <marty@iicolo.com> writes:<br />> Ok Tom, I standcorrected.<br /></div><div align="LEFT">> I downloaded the latest snapshot and both scenarios (normal and WAL bypass=<br />> for indexes) produced between 185 and 230 tps on my machine.<br /></div><div align="LEFT">> The lessonhere is that whatever WAL magic has been performed on the latest =<br />> release gives over 100% speedup, and thespeedup is so good that skipping =<br />> WAL for indexes does basically nothing.<br /></div><div align="LEFT">[ scratcheshead ... ] Actually, I'd have expected that you could still<br />measure a difference. I thought it might be reducedto the point where<br />we arguably shouldn't spend major effort on eliminating it. But no</div><div align="LEFT">differenceat all really does not compute. Could you recheck your test<br />conditions? You still haven't beenvery clear what they are.<br /></div><div align="LEFT"> regards, tom lane<br /></div><div align="LEFT">---------------------------(endof broadcast)---------------------------<br />TIP 5: don't forget to increaseyour free space map settings<br /></div>
Simon Riggs <simon@2ndquadrant.com> writes: > Thinking about this some more, I ask myself: why is it we log index > inserts at all? We log heap inserts, which contain all the information > we need to replay all index inserts also, so why bother? (1) We can't run user-defined functions during log replay. Quite aside from any risk of nondeterminism, the normal transaction infrastructure isn't functioning in that environment. (2) Some of the index code is itself deliberately nondeterministic. I'm thinking in particular of the move-right-or-not choice in _bt_insertonpg() when there are many equal keys, but randomization is in general a useful algorithmic technique that we'd have to forswear. (3) In the presence of concurrency, the sequence of heap-insert WAL records isn't enough info, because it doesn't tell you what order the index inserts occurred in. The btree code, at least, is sufficiently concurrent that even knowing the sequence of leaf-key insertions isn't full information --- it's not hard to imagine cases where decisions about where to split upper-level pages are dependent on which process manages to obtain lock on a page first. There are probably some other reasons that I forgot. Check the archives; this point has been debated before. Basically the problem here is that you can't mix logged and non-logged operations --- if you're going to WAL-log any operations on an index then you have to be sure that the replay will regenerate exactly the same series of index states that happened the first time. So none of this is an argument against "rebuild the index at end of replay"; but I don't see any workable half measures. regards, tom lane
On Mon, 2006-04-03 at 09:55 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Thinking about this some more, I ask myself: why is it we log index > > inserts at all? We log heap inserts, which contain all the information > > we need to replay all index inserts also, so why bother? > I don't see any workable half measures. Yep, looks that way. > (1) We can't run user-defined functions during log replay. Quite > aside from any risk of nondeterminism, the normal transaction > infrastructure isn't functioning in that environment. Didn't think of that one, but we could special case it. > (2) Some of the index code is itself deliberately nondeterministic. > I'm thinking in particular of the move-right-or-not choice in > _bt_insertonpg() when there are many equal keys, but randomization is > in general a useful algorithmic technique that we'd have to forswear. Understood. > (3) In the presence of concurrency, the sequence of heap-insert WAL > records isn't enough info, because it doesn't tell you what order the > index inserts occurred in. The btree code, at least, is sufficiently > concurrent that even knowing the sequence of leaf-key insertions isn't > full information --- it's not hard to imagine cases where decisions > about where to split upper-level pages are dependent on which process > manages to obtain lock on a page first. In the presence of concurrency, it could be the OS who decides who gets first crack at a page because of scheduling. We can never assume anything there, so definitely a killer argument. But that does open up an opportunity if we locked the table AccessExclusive for a COPY LOCK into an existing table with indexes. One day maybe, but not interesting enough yet. Best Regards, Simon Riggs
Ühel kenal päeval, E, 2006-04-03 kell 01:48, kirjutas Tom Lane: > "Martin Scholes" <marty@iicolo.com> writes: > > Ok Tom, I stand corrected. > > > I downloaded the latest snapshot and both scenarios (normal and WAL bypass = > > for indexes) produced between 185 and 230 tps on my machine. > > > The lesson here is that whatever WAL magic has been performed on the latest = > > release gives over 100% speedup, and the speedup is so good that skipping = > > WAL for indexes does basically nothing. > > [ scratches head ... ] Actually, I'd have expected that you could still > measure a difference. I thought it might be reduced to the point where > we arguably shouldn't spend major effort on eliminating it. But no > difference at all really does not compute. Could you recheck your test > conditions? You still haven't been very clear what they are. Actually I can imagine a condition where there is no difference: When the additional WAL records for indexes are written in one swoop, and you write 1 batch of WAL records per disk rotation, there is very small difference between writing 1k WAL record and 32k wal record. In other words, if you are writing the same number of smaller records per commit, you get *exactly the same performance as OLTP is constrained mainly by iops, not throughput. ------------------ Hannu
Ühel kenal päeval, E, 2006-04-03 kell 09:55, kirjutas Tom Lane: > (2) Some of the index code is itself deliberately nondeterministic. > I'm thinking in particular of the move-right-or-not choice in > _bt_insertonpg() when there are many equal keys, but randomization is > in general a useful algorithmic technique that we'd have to forswear. Why can't we just order "many equal keys" by ctid ? This would align the run of equal keys with table order, likely making index scans run faster, especially on very long equal runs. ------------ Hannu
Hannu Krosing <hannu@skype.net> writes: > Ühel kenal päeval, E, 2006-04-03 kell 09:55, kirjutas Tom Lane: >> (2) Some of the index code is itself deliberately nondeterministic. >> I'm thinking in particular of the move-right-or-not choice in >> _bt_insertonpg() when there are many equal keys, but randomization is >> in general a useful algorithmic technique that we'd have to forswear. > Why can't we just order "many equal keys" by ctid ? Why is that better? The bit of code mentioned above certainly isn't going to benefit --- it will lose the opportunity to try to avoid a page split. regards, tom lane
Jonah H. Harris wrote: > As long as it's optional, I guess it's OK to let the administrator > deal with recovery. Of course, in addition to no-fsync, we'll have > another *possibly* dangerous option. BTW, I've seen no-fsync used far > too many times because people think they're hardware is invincible. Use cases differ. I have used postgres in a system where without the no-fsync option the project would have been forced to use a different storage system. (Berkeley-DB was being considered for the alternative.) We cared much more about throughput than corruption, so long as we would *know* when corruption occurred. The cost of occasionally reprocessing data after a corruption was much lower than the cost of always fsyncing to avoid it. As I recall, during that project, we wanted some way to make the indexes run faster, even at the expense of occasional corruption. If Martin's idea has any noticeable performance improvement, I'm sure it would be welcomed by some. mark
<div align="LEFT">I wrote:</div><div align="LEFT"> </div><div align="LEFT">> I will run multiple tests and post the actualnumbers.<br /></div><div align="LEFT">I did run more extensive tests and did not bother writing down the numbers, andhere's why: the unmodified Pg ran pgbench with a whopping average of 6.3% time in IO wait.</div><div align="LEFT"> </div><divalign="LEFT">If I was able to totally eliminate that time (which is impossible), then the best wecould hope for is a 7% increase in performance by skipping WAL of indexes.</div><div align="LEFT"> </div><div align="LEFT">Ona related note, we currently have some indexes that are unsafe during recovery (GIST and Hash come to mind).</div><divalign="LEFT"> </div><div align="LEFT">In the spirit of making Pg "safe at any speed," would it make senseto put some code in the recovery section that rebuilds all indexes whose integrity cannot be assured?</div><div align="LEFT"> </div><divalign="LEFT">M</div>
"Martin Scholes" <marty@iicolo.com> writes: > On a related note, we currently have some indexes that are unsafe during > recovery (GIST and Hash come to mind). > In the spirit of making Pg "safe at any speed," would it make sense to put > some code in the recovery section that rebuilds all indexes whose integrity > cannot be assured? This is no longer true for GIST, and hash really ought to be fixed to support WAL. The combination of a performance advantage and being able to postpone fixing hash might be enough of a reason to add a post-recovery index rebuild feature, but I don't think the latter alone is --- fixing hash would be less work as well as being the "right" answer. regards, tom lane
On Wed, 2006-04-05 at 09:40 -0700, Martin Scholes wrote: > > I will run multiple tests and post the actual numbers. > > I did run more extensive tests and did not bother writing down the > numbers, and here's why: the unmodified Pg ran pgbench with a whopping > average of 6.3% time in IO wait. > > If I was able to totally eliminate that time (which is impossible), > then the best we could hope for is a 7% increase in performance by > skipping WAL of indexes. The WAL becomes more of a hotspot as you scale up numbers of CPUs. I guess this idea doesn't make much difference for smaller systems. I think your idea still has merit, Martin. I'll do some tests also. Best Regards, Simon Riggs
<div align="LEFT">Simon,</div><div align="LEFT"> </div><div align="LEFT">>The WAL becomes more of a hotspot as you scaleup numbers of CPUs.</div><div align="LEFT"> </div><div align="LEFT">I tend to agree and the original idea came whenI was working on a Sun quad-CPU system for a highly parallelized web application. Each page was broken into several dynamicimages, each created "on the fly" from dozens of vector objects and all of the data, as well as state informationwas held in Pg.</div><div align="LEFT"> </div><div align="LEFT">As a complex app, each page load would spawn adozen or so processes, each hitting the DB pretty hard, where all of the business logic resided.</div><div align="LEFT"> </div><divalign="LEFT">It didn't take too many concurrent users to bring the server to its knees.</div><divalign="LEFT"> </div><div align="LEFT">Here's the point: some inspection showed that a lot of time was beingspent on index output. At that point I realized that there had to be a better way.</div><div align="LEFT"> </div><divalign="LEFT">My simple home system is not capable of recreating those conditions. If you have anSMP box, please run some tests.</div><div align="LEFT"> </div><div align="LEFT">M</div><div align="LEFT"> </div><div align="LEFT">_____Original message _____</div><div align="LEFT">Subject: Re: [HACKERS] WAL Bypass for indexes</div><div align="LEFT">Author:Simon Riggs <simon@2ndquadrant.com></div><div align="LEFT">Date: 05th April 2006 11:0:34 AM</div><divalign="LEFT"> </div><div align="LEFT">On Wed, 2006-04-05 at 09:40 -0700, Martin Scholes wrote:<br /></div><divalign="LEFT">> > I will run multiple tests and post the actual numbers.<br />> <br />> I did runmore extensive tests and did not bother writing down the<br />> numbers, and here's why: the unmodified Pg ran pgbenchwith a whopping<br />> average of 6.3% time in IO wait.<br />> <br />> If I was able to totally eliminatethat time (which is impossible),<br />> then the best we could hope for is a 7% increase in performance by<br/>> skipping WAL of indexes.</div><div align="LEFT"> </div><div align="LEFT">The WAL becomes more of a hotspot asyou scale up numbers of CPUs. I<br />guess this idea doesn't make much difference for smaller systems.<br /></div><divalign="LEFT">I think your idea still has merit, Martin. I'll do some tests also.<br /></div><div align="LEFT">BestRegards, Simon Riggs<br /></div><div align="LEFT"> </div><div align="LEFT">---------------------------(endof broadcast)---------------------------<br />TIP 2: Don't 'kill -9' the postmaster<br/></div>
2006/4/3, Tom Lane <tgl@sss.pgh.pa.us>: > "Jonah H. Harris" <jonah.harris@gmail.com> writes: > > we're working on a prototype to reduce WAL I/O and index updates in a > > large percentage of OLTP situations by employing an update-in-place > > under *safe* conditions. > > AFAICS there are no circumstances, ever, in which update-in-place is > "safe". (No transaction can guarantee that it will commit.) Updates to row values that did not "escape" the currect transaction yet (ie, rows that were created by or have their last value written by the current transaction). Nicolas -- Nicolas Barbier http://www.gnu.org/philosophy/no-word-attachments.html
"Nicolas Barbier" <nicolas.barbier@gmail.com> writes: > 2006/4/3, Tom Lane <tgl@sss.pgh.pa.us>: >> AFAICS there are no circumstances, ever, in which update-in-place is >> "safe". (No transaction can guarantee that it will commit.) > Updates to row values that did not "escape" the currect transaction > yet (ie, rows that were created by or have their last value written by > the current transaction). Wrong, because MVCC also applies within transactions: a snapshot having a lower command counter than what you are executing at should not see your change. A typical example of this is that an update scan must not see tuples generated by triggers fired by that scan. regards, tom lane