Thread: WAL Bypass for indexes

WAL Bypass for indexes

From
"Martin Scholes"
Date:
<div align="LEFT">I have followed the discussion from 3 months ago on WAL bypass and wanted to offer some more
information.</div><divalign="LEFT"> </div><div align="LEFT">I have long believed that the bottleneck in
transaction-orientedsystems is the writing of the indexes, complete with splits and merges. A single update to one
fieldof a heavily-indexed table could cause dozens of index writes to cascade.</div><div align="LEFT"> </div><div
align="LEFT">Thepoint is that speeding up index writing should offer the most "bang for the buck" relative to
performance.As a bonus, a corrupted index can ALWAYS be recovered, while a corrupted table cannot.</div><div
align="LEFT"> </div><divalign="LEFT">I did some informal testing using pgbench on v8.07. First, I ran pgbench normally
with75 users doing 100 transactions, full vacuuming between runs. My machine consistently gave me 92 tps.</div><div
align="LEFT"> </div><divalign="LEFT">As an experiment, I commented out of the btree index source all of the XLOG code I
couldfind. I basically replaced the test for a temp table with "if (0)" and then recompiled.</div><div
align="LEFT"> </div><divalign="LEFT">Running again pgbench with 75 users and 100 transactions, I received a consistent
rateof 132 tps, a 40% increase in throughput.</div><div align="LEFT"> </div><div align="LEFT">It seems to me that major
performancegains can be had by allowing some indexes to be created with some "UNSAFE-FAIL" flag, which would eliminate
WALlogging and fsync() for the index files.</div><div align="LEFT"> </div><div align="LEFT">Upon recovery, the index
getsrebuilt. The only downside is potentially long rebuild times during recovery.</div><div align="LEFT"> </div><div
align="LEFT">Thoughts?</div><divalign="LEFT"> </div><div align="LEFT">Sincerely,</div><div align="LEFT">Marty</div> 

Re: WAL Bypass for indexes

From
Tom Lane
Date:
"Martin Scholes" <marty@iicolo.com> writes:
> I did some informal testing using pgbench on v8.07.  First, I ran pgbench =
> normally with 75 users doing 100 transactions, full vacuuming between runs. =
>  My machine consistently gave me 92 tps.

> As an experiment, I commented out of the btree index source all of the XLOG =
> code I could find.  I basically replaced the test for a temp table with "if =
> (0)" and then recompiled.

It'd be more interesting if you'd done this testing on 8.1, or better
CVS HEAD, as we took several steps to improve WAL performance in 8.1
(notably, abandoning the 64-bit CRC code).  Also, when you don't say
what configuration you were testing, the test results don't mean a lot.
The cost of WAL logging is *very* heavily influenced by things such as
checkpoint frequency, whether you have a separate drive for WAL, etc.

> It seems to me that major performance gains can be had by allowing some =
> indexes to be created with some "UNSAFE-FAIL" flag,

This might be worth doing, but I'd want to see a more convincing
demonstration before putting effort into it ...
        regards, tom lane


Re: WAL Bypass for indexes

From
"Jonah H. Harris"
Date:
On 4/2/06, Martin Scholes <marty@iicolo.com> wrote:
> I have long believed that the bottleneck in transaction-oriented systems is
> the writing of the indexes, complete with splits and merges. A single update
> to one field of a heavily-indexed table could cause dozens of index writes
> to cascade.

This is not the case with *most* OLTP systems.  It is, however, an
issue with systems that use a MVCC system which employ an MVTO-like
algorithm where indexes must be updated because row locations are
never the same.

> Running again pgbench with 75 users and 100 transactions, I received a
> consistent rate of 132 tps, a 40% increase in throughput.

Less I/O = increased throughput.  Generally, when you use less CPU or
less I/O, you're going to get better performance.  The question is, at
what cost?

> Upon recovery, the index gets rebuilt. The only downside is potentially long
> rebuild times during recovery.

Yes, this is unacceptable for large systems.

While retaining the main MVCC implementation PostgreSQL currently has,
we're working on a prototype to reduce WAL I/O and index updates in a
large percentage of OLTP situations by employing an update-in-place
under *safe* conditions.  However, I don't see how your recommendation
would actually be feasable.  Similarly, from my point of view, a
database should NEVER become corrupt.

Likewise, I think your tests may have been a little biased... you may
want to look through the xlog and index code to see how temp tables
are handled (or not).

In my opinion, I don't think we should have an option to allow the
indexes to become corrupt.

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324


Re: WAL Bypass for indexes

From
Tom Lane
Date:
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
> we're working on a prototype to reduce WAL I/O and index updates in a
> large percentage of OLTP situations by employing an update-in-place
> under *safe* conditions.

AFAICS there are no circumstances, ever, in which update-in-place is
"safe".  (No transaction can guarantee that it will commit.)

> In my opinion, I don't think we should have an option to allow the
> indexes to become corrupt.

Martin's proposal at least looks sensible; he just hasn't quite made the
case that it's worth doing.  If you're running a system that hardly ever
crashes, you might be willing to accept index rebuilds during crash
recovery, especially for indexes on relatively small, but frequently
updated, tables (which should have reasonably short rebuild times).
Obviously this would have to be configurable per-index, or at least
per-table, and I agree that it likely would never be the default.
But it could be a good tradeoff for some cases.
        regards, tom lane


Re: WAL Bypass for indexes

From
"Jonah H. Harris"
Date:
> AFAICS there are no circumstances, ever, in which update-in-place is
> "safe".  (No transaction can guarantee that it will commit.)

In our case, it is totally safe.  I'd certainly like to discuss it
with you sometime at the anniversary.

> Martin's proposal at least looks sensible; he just hasn't quite made the
> case that it's worth doing ... I agree that it likely would never be the
> default. But it could be a good tradeoff for some cases.

I guess I can think of a few instances, but none that I would've
chosen to use it in.  IIRC, it's also more likely to increase the cost
of checkpointing and/or require a good amount of bgwriter tuning.

As long as it's optional, I guess it's OK to let the administrator
deal with recovery.  Of course, in addition to no-fsync, we'll have
another *possibly* dangerous option.  BTW, I've seen no-fsync used far
too many times because people think they're hardware is invincible.

My only suggestion is to make sure it's a very well documented option.

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324


Re: WAL Bypass for indexes

From
Tom Lane
Date:
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
> I guess I can think of a few instances, but none that I would've
> chosen to use it in.  IIRC, it's also more likely to increase the cost
> of checkpointing and/or require a good amount of bgwriter tuning.

How so?  AFAICS it'd just eliminate WAL output.

> As long as it's optional, I guess it's OK to let the administrator
> deal with recovery.

As I understood it, the proposal was for a feature that would arrange
for the required index rebuild to happen *automatically* during crash
recovery.  I agree it'd be unacceptable if it requires manual
intervention at restart.

It occurs to me that if we had such a behavior, we could use it to "fix"
hash indexes to be crash-safe, with less effort than WAL-ifying the hash
code: just put in a small kluge to mark all hash indexes as needing
rebuild during recovery.  Not that I'm against teaching hash to do WAL,
but no one's stepped up to the plate on that yet.
        regards, tom lane


Re: WAL Bypass for indexes

From
Christopher Kings-Lynne
Date:
> Martin's proposal at least looks sensible; he just hasn't quite made the
> case that it's worth doing.  If you're running a system that hardly ever
> crashes, you might be willing to accept index rebuilds during crash
> recovery, especially for indexes on relatively small, but frequently
> updated, tables (which should have reasonably short rebuild times).
> Obviously this would have to be configurable per-index, or at least
> per-table, and I agree that it likely would never be the default.
> But it could be a good tradeoff for some cases.


My web system hasn't crashed in years, and last time I upgraded the 
index rebuild time was maybe 30 mins?  So, I think a typical web 
application doesn't _really_ have that much data, and would greatly 
benefit from cranking the TPS.

Chris



Re: WAL Bypass for indexes

From
"Martin Scholes"
Date:
<div align="LEFT">Ok Tom, I stand corrected.</div><div align="LEFT"> </div><div align="LEFT">I downloaded the latest
snapshotand both scenarios (normal and WAL bypass for indexes) produced between 185 and 230 tps on my
machine.</div><divalign="LEFT"> </div><div align="LEFT">The lesson here is that whatever WAL magic has been performed
onthe latest release gives over 100% speedup, and the speedup is so good that skipping WAL for indexes does basically
nothing.</div><divalign="LEFT"> </div><div align="LEFT">Kudos.</div><div align="LEFT"> </div><div
align="LEFT">Cheers,</div><divalign="LEFT">M</div><div align="LEFT">_____ Original message _____</div><div
align="LEFT">Subject:Re: [HACKERS] WAL Bypass for indexes </div><div align="LEFT">Author: Tom Lane
<tgl@sss.pgh.pa.us></div><divalign="LEFT">Date: 02nd April 2006 5:17:50 PM</div><div align="LEFT"> </div><div
align="LEFT">"MartinScholes" <marty@iicolo.com> writes:<br />> I did some informal testing using pgbench on
v8.07.First, I ran pgbench =<br />> normally with 75 users doing 100 transactions, full vacuuming between runs. =<br
/>>My machine consistently gave me 92 tps.<br /></div><div align="LEFT">> As an experiment, I commented out of
thebtree index source all of the XLOG =<br />> code I could find. I basically replaced the test for a temp table
with"if =<br />> (0)" and then recompiled.<br /></div><div align="LEFT">It'd be more interesting if you'd done this
testingon 8.1, or better<br />CVS HEAD, as we took several steps to improve WAL performance in 8.1<br />(notably,
abandoningthe 64-bit CRC code). Also, when you don't say</div><div align="LEFT">what configuration you were testing,
thetest results don't mean a lot.<br />The cost of WAL logging is *very* heavily influenced by things such as<br
/>checkpointfrequency, whether you have a separate drive for WAL, etc.<br /></div><div align="LEFT">> It seems to me
thatmajor performance gains can be had by allowing some =<br />> indexes to be created with some "UNSAFE-FAIL"
flag,<br/></div><div align="LEFT">This might be worth doing, but I'd want to see a more convincing<br />demonstration
beforeputting effort into it ...<br /></div><div align="LEFT"> regards, tom lane<br /></div> 

Re: WAL Bypass for indexes

From
"Jonah H. Harris"
Date:
On 4/2/06, Martin Scholes <marty@iicolo.com> wrote:
> The lesson here is that whatever WAL magic has been performed on the latest
> release gives over 100% speedup, and the speedup is so good that skipping
> WAL for indexes does basically nothing.

No need for me to reply to earlier messages :)

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324


Re: WAL Bypass for indexes

From
Tom Lane
Date:
"Martin Scholes" <marty@iicolo.com> writes:
> Ok Tom, I stand corrected.

> I downloaded the latest snapshot and both scenarios (normal and WAL bypass =
> for indexes) produced between 185 and 230 tps on my machine.

> The lesson here is that whatever WAL magic has been performed on the latest =
> release gives over 100% speedup, and the speedup is so good that skipping =
> WAL for indexes does basically nothing.

[ scratches head ... ]  Actually, I'd have expected that you could still
measure a difference.  I thought it might be reduced to the point where
we arguably shouldn't spend major effort on eliminating it.  But no
difference at all really does not compute.  Could you recheck your test
conditions?  You still haven't been very clear what they are.
        regards, tom lane


Re: WAL Bypass for indexes

From
Simon Riggs
Date:
On Sun, 2006-04-02 at 20:39 -0700, Martin Scholes wrote:
> The lesson here is that whatever WAL magic has been performed on the
> latest release gives over 100% speedup

That is good news.

> and the speedup is so good that skipping WAL for indexes does
> basically nothing.

I don't agree with this conclusion. Your original idea has possibilities
and these are not proved pointless by that test result. ISTM that any
reduction in WAL will give a performance increase on a correctly
configured system, and if it doesn't there's something else wrong also.

The idea of WALBypass for indexes is a valid one and would seem likely
to have good benefit with small tables that are very frequently updated
or inserted into, or for databases in a replication group where longer
recovery time doesn't influence overall availability.

If we have this as a per-index option, that would allow some indexes to
be more important than others if multiple workloads were supported on
the same set of tables. Plus its easier to add a CREATE INDEX option,
but would never apply to catalog indexes. We would automatically rebuild
all marked indexes at the end of recovery - hence we'd need catalog
indexes to be functioning - other parts of the system would not yet be
available. Perhaps optimized with a unique hash table that is inserted
into during recovery to remember all indexes that have been modified.
I'd be interested in implementing this unless someone beats me to it.

Thinking about this some more, I ask myself: why is it we log index
inserts at all? We log heap inserts, which contain all the information
we need to replay all index inserts also, so why bother? We would
clearly need to still log full page writes of any changed index pages,
but we wouldn't need to log each individual change. My only answer is
that WAL records are required because index insertion is not completely
deterministic because of the use of random(). Perhaps we could make the
process completely deterministic but pseudo-random by using some aspect
of the data/structure to determine the "random" insertion point? (I
would still want to have full logging for catalog indexes).

[I'll leave UPDATEs out of the discussion for now]

Best Regards, Simon Riggs




Re: WAL Bypass for indexes

From
"Martin Scholes"
Date:
<div align="LEFT">Tom Lane wrote:</div><div align="LEFT"> </div><div align="LEFT">> [ scratches head ... ] Actually,
I'dhave expected that you could still<br />> measure a difference. I thought it might be reduced to the point
where<br/> > we arguably shouldn't spend major effort on eliminating it. But no</div><div align="LEFT">>
differenceat all really does not compute.</div><div align="LEFT"> </div><div align="LEFT">I agree completely and was
baffledby the observations. I will do some more testing tonight to see if I can pin things down.</div><div
align="LEFT"> </div><divalign="LEFT">My machine is a cheapo Wal-Mart Compaq Presario, Sempron 3000+ (2Ghz, socket A,
512KBcache), 768 MB, with the stock 40 GB IDE drive mirrored to a 160 GB SATA drive.</div><div align="LEFT"> </div><div
align="LEFT">Irun Mandrake 2006 with the latest patches and ran the Pg snapshot with all of the defaults.</div><div
align="LEFT"> </div><divalign="LEFT">I should have seen a difference in speed, but did not. One possible explanation is
thatboth tests had the TPS flopping around between 185 and 225. It is possible that the improvement was so small
comparedto the variance that it was hard to see. I will run multiple tests and post the actual numbers.</div><div
align="LEFT"> </div><divalign="LEFT">Cheers,</div><div align="LEFT">M</div><div align="LEFT"> </div><div
align="LEFT">_____Original message _____</div><div align="LEFT">Subject: Re: [HACKERS] WAL Bypass for indexes
</div><divalign="LEFT">Author: Tom Lane <tgl@sss.pgh.pa.us></div><div align="LEFT">Date: 02nd April 2006 10:48:19
PM</div><divalign="LEFT"> </div><div align="LEFT">"Martin Scholes" <marty@iicolo.com> writes:<br />> Ok Tom, I
standcorrected.<br /></div><div align="LEFT">> I downloaded the latest snapshot and both scenarios (normal and WAL
bypass=<br />> for indexes) produced between 185 and 230 tps on my machine.<br /></div><div align="LEFT">> The
lessonhere is that whatever WAL magic has been performed on the latest =<br />> release gives over 100% speedup, and
thespeedup is so good that skipping =<br />> WAL for indexes does basically nothing.<br /></div><div align="LEFT">[
scratcheshead ... ] Actually, I'd have expected that you could still<br />measure a difference. I thought it might be
reducedto the point where<br />we arguably shouldn't spend major effort on eliminating it. But no</div><div
align="LEFT">differenceat all really does not compute. Could you recheck your test<br />conditions? You still haven't
beenvery clear what they are.<br /></div><div align="LEFT"> regards, tom lane<br /></div><div
align="LEFT">---------------------------(endof broadcast)---------------------------<br />TIP 5: don't forget to
increaseyour free space map settings<br /></div> 

Re: WAL Bypass for indexes

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Thinking about this some more, I ask myself: why is it we log index
> inserts at all? We log heap inserts, which contain all the information
> we need to replay all index inserts also, so why bother?

(1) We can't run user-defined functions during log replay.  Quite
aside from any risk of nondeterminism, the normal transaction
infrastructure isn't functioning in that environment.

(2) Some of the index code is itself deliberately nondeterministic.
I'm thinking in particular of the move-right-or-not choice in
_bt_insertonpg() when there are many equal keys, but randomization is
in general a useful algorithmic technique that we'd have to forswear.

(3) In the presence of concurrency, the sequence of heap-insert WAL
records isn't enough info, because it doesn't tell you what order the
index inserts occurred in.  The btree code, at least, is sufficiently
concurrent that even knowing the sequence of leaf-key insertions isn't
full information --- it's not hard to imagine cases where decisions
about where to split upper-level pages are dependent on which process
manages to obtain lock on a page first.

There are probably some other reasons that I forgot.  Check the
archives; this point has been debated before.

Basically the problem here is that you can't mix logged and non-logged
operations --- if you're going to WAL-log any operations on an index
then you have to be sure that the replay will regenerate exactly the
same series of index states that happened the first time.  So none of
this is an argument against "rebuild the index at end of replay"; but
I don't see any workable half measures.
        regards, tom lane


Re: WAL Bypass for indexes

From
Simon Riggs
Date:
On Mon, 2006-04-03 at 09:55 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Thinking about this some more, I ask myself: why is it we log index
> > inserts at all? We log heap inserts, which contain all the information
> > we need to replay all index inserts also, so why bother?

> I don't see any workable half measures.

Yep, looks that way.

> (1) We can't run user-defined functions during log replay.  Quite
> aside from any risk of nondeterminism, the normal transaction
> infrastructure isn't functioning in that environment.

Didn't think of that one, but we could special case it.

> (2) Some of the index code is itself deliberately nondeterministic.
> I'm thinking in particular of the move-right-or-not choice in
> _bt_insertonpg() when there are many equal keys, but randomization is
> in general a useful algorithmic technique that we'd have to forswear.

Understood.

> (3) In the presence of concurrency, the sequence of heap-insert WAL
> records isn't enough info, because it doesn't tell you what order the
> index inserts occurred in.  The btree code, at least, is sufficiently
> concurrent that even knowing the sequence of leaf-key insertions isn't
> full information --- it's not hard to imagine cases where decisions
> about where to split upper-level pages are dependent on which process
> manages to obtain lock on a page first.

In the presence of concurrency, it could be the OS who decides who gets
first crack at a page because of scheduling. We can never assume
anything there, so definitely a killer argument.

But that does open up an opportunity if we locked the table
AccessExclusive for a COPY LOCK into an existing table with indexes. One
day maybe, but not interesting enough yet.

Best Regards, Simon Riggs



Re: WAL Bypass for indexes

From
Hannu Krosing
Date:
Ühel kenal päeval, E, 2006-04-03 kell 01:48, kirjutas Tom Lane:
> "Martin Scholes" <marty@iicolo.com> writes:
> > Ok Tom, I stand corrected.
> 
> > I downloaded the latest snapshot and both scenarios (normal and WAL bypass =
> > for indexes) produced between 185 and 230 tps on my machine.
> 
> > The lesson here is that whatever WAL magic has been performed on the latest =
> > release gives over 100% speedup, and the speedup is so good that skipping =
> > WAL for indexes does basically nothing.
> 
> [ scratches head ... ]  Actually, I'd have expected that you could still
> measure a difference.  I thought it might be reduced to the point where
> we arguably shouldn't spend major effort on eliminating it.  But no
> difference at all really does not compute.  Could you recheck your test
> conditions?  You still haven't been very clear what they are.

Actually I can imagine a condition where there is no difference:

When the additional WAL records for indexes are written in one swoop,
and you write 1 batch of WAL records per disk rotation, there is very
small difference between writing 1k WAL record and 32k wal record.

In other words, if you are writing the same number of smaller records
per commit, you get *exactly the same performance as OLTP is constrained
mainly by iops, not throughput.

------------------
Hannu




Re: WAL Bypass for indexes

From
Hannu Krosing
Date:
Ühel kenal päeval, E, 2006-04-03 kell 09:55, kirjutas Tom Lane:

> (2) Some of the index code is itself deliberately nondeterministic.
> I'm thinking in particular of the move-right-or-not choice in
> _bt_insertonpg() when there are many equal keys, but randomization is
> in general a useful algorithmic technique that we'd have to forswear.

Why can't we just order "many equal keys" by ctid ? This would align the
run of equal keys with table order, likely making index scans run
faster, especially on very long equal runs.

------------
Hannu




Re: WAL Bypass for indexes

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> Ühel kenal päeval, E, 2006-04-03 kell 09:55, kirjutas Tom Lane:
>> (2) Some of the index code is itself deliberately nondeterministic.
>> I'm thinking in particular of the move-right-or-not choice in
>> _bt_insertonpg() when there are many equal keys, but randomization is
>> in general a useful algorithmic technique that we'd have to forswear.

> Why can't we just order "many equal keys" by ctid ?

Why is that better?  The bit of code mentioned above certainly isn't
going to benefit --- it will lose the opportunity to try to avoid a page
split.
        regards, tom lane


Re: WAL Bypass for indexes

From
Mark Dilger
Date:
Jonah H. Harris wrote:

> As long as it's optional, I guess it's OK to let the administrator
> deal with recovery.  Of course, in addition to no-fsync, we'll have
> another *possibly* dangerous option.  BTW, I've seen no-fsync used far
> too many times because people think they're hardware is invincible.

Use cases differ.  I have used postgres in a system where without the no-fsync
option the project would have been forced to use a different storage system.
(Berkeley-DB was being considered for the alternative.)  We cared much more
about throughput than corruption, so long as we would *know* when corruption
occurred.  The cost of occasionally reprocessing data after a corruption was
much lower than the cost of always fsyncing to avoid it.

As I recall, during that project, we wanted some way to make the indexes run
faster, even at the expense of occasional corruption.  If Martin's idea has any
noticeable performance improvement, I'm sure it would be welcomed by some.

mark


Re: WAL Bypass for indexes

From
"Martin Scholes"
Date:
<div align="LEFT">I wrote:</div><div align="LEFT"> </div><div align="LEFT">> I will run multiple tests and post the
actualnumbers.<br /></div><div align="LEFT">I did run more extensive tests and did not bother writing down the numbers,
andhere's why: the unmodified Pg ran pgbench with a whopping average of 6.3% time in IO wait.</div><div
align="LEFT"> </div><divalign="LEFT">If I was able to totally eliminate that time (which is impossible), then the best
wecould hope for is a 7% increase in performance by skipping WAL of indexes.</div><div align="LEFT"> </div><div
align="LEFT">Ona related note, we currently have some indexes that are unsafe during recovery (GIST and Hash come to
mind).</div><divalign="LEFT"> </div><div align="LEFT">In the spirit of making Pg "safe at any speed," would it make
senseto put some code in the recovery section that rebuilds all indexes whose integrity cannot be assured?</div><div
align="LEFT"> </div><divalign="LEFT">M</div> 

Re: WAL Bypass for indexes

From
Tom Lane
Date:
"Martin Scholes" <marty@iicolo.com> writes:
> On a related note, we currently have some indexes that are unsafe during
> recovery (GIST and Hash come to mind).

> In the spirit of making Pg "safe at any speed," would it make sense to put
> some code in the recovery section that rebuilds all indexes whose integrity
> cannot be assured?

This is no longer true for GIST, and hash really ought to be fixed to
support WAL.  The combination of a performance advantage and being able
to postpone fixing hash might be enough of a reason to add a
post-recovery index rebuild feature, but I don't think the latter
alone is --- fixing hash would be less work as well as being the "right"
answer.
        regards, tom lane


Re: WAL Bypass for indexes

From
Simon Riggs
Date:
On Wed, 2006-04-05 at 09:40 -0700, Martin Scholes wrote:
> > I will run multiple tests and post the actual numbers.
> 
> I did run more extensive tests and did not bother writing down the
> numbers, and here's why: the unmodified Pg ran pgbench with a whopping
> average of 6.3% time in IO wait.
>  
> If I was able to totally eliminate that time (which is impossible),
> then the best we could hope for is a 7% increase in performance by
> skipping WAL of indexes.

The WAL becomes more of a hotspot as you scale up numbers of CPUs. I
guess this idea doesn't make much difference for smaller systems.

I think your idea still has merit, Martin. I'll do some tests also.

Best Regards, Simon Riggs



Re: WAL Bypass for indexes

From
"Martin Scholes"
Date:
<div align="LEFT">Simon,</div><div align="LEFT"> </div><div align="LEFT">>The WAL becomes more of a hotspot as you
scaleup numbers of CPUs.</div><div align="LEFT"> </div><div align="LEFT">I tend to agree and the original idea came
whenI was working on a Sun quad-CPU system for a highly parallelized web application. Each page was broken into several
dynamicimages, each created "on the fly" from dozens of vector objects and all of the data, as well as state
informationwas held in Pg.</div><div align="LEFT"> </div><div align="LEFT">As a complex app, each page load would spawn
adozen or so processes, each hitting the DB pretty hard, where all of the business logic resided.</div><div
align="LEFT"> </div><divalign="LEFT">It didn't take too many concurrent users to bring the server to its
knees.</div><divalign="LEFT"> </div><div align="LEFT">Here's the point: some inspection showed that a lot of time was
beingspent on index output. At that point I realized that there had to be a better way.</div><div
align="LEFT"> </div><divalign="LEFT">My simple home system is not capable of recreating those conditions. If you have
anSMP box, please run some tests.</div><div align="LEFT"> </div><div align="LEFT">M</div><div align="LEFT"> </div><div
align="LEFT">_____Original message _____</div><div align="LEFT">Subject: Re: [HACKERS] WAL Bypass for indexes</div><div
align="LEFT">Author:Simon Riggs <simon@2ndquadrant.com></div><div align="LEFT">Date: 05th April 2006 11:0:34
AM</div><divalign="LEFT"> </div><div align="LEFT">On Wed, 2006-04-05 at 09:40 -0700, Martin Scholes wrote:<br
/></div><divalign="LEFT">> > I will run multiple tests and post the actual numbers.<br />> <br />> I did
runmore extensive tests and did not bother writing down the<br />> numbers, and here's why: the unmodified Pg ran
pgbenchwith a whopping<br />> average of 6.3% time in IO wait.<br />> <br />> If I was able to totally
eliminatethat time (which is impossible),<br />> then the best we could hope for is a 7% increase in performance
by<br/>> skipping WAL of indexes.</div><div align="LEFT"> </div><div align="LEFT">The WAL becomes more of a hotspot
asyou scale up numbers of CPUs. I<br />guess this idea doesn't make much difference for smaller systems.<br
/></div><divalign="LEFT">I think your idea still has merit, Martin. I'll do some tests also.<br /></div><div
align="LEFT">BestRegards, Simon Riggs<br /></div><div align="LEFT"> </div><div
align="LEFT">---------------------------(endof broadcast)---------------------------<br />TIP 2: Don't 'kill -9' the
postmaster<br/></div> 

Re: WAL Bypass for indexes

From
"Nicolas Barbier"
Date:
2006/4/3, Tom Lane <tgl@sss.pgh.pa.us>:

> "Jonah H. Harris" <jonah.harris@gmail.com> writes:
> > we're working on a prototype to reduce WAL I/O and index updates in a
> > large percentage of OLTP situations by employing an update-in-place
> > under *safe* conditions.
>
> AFAICS there are no circumstances, ever, in which update-in-place is
> "safe".  (No transaction can guarantee that it will commit.)

Updates to row values that did not "escape" the currect transaction
yet (ie, rows that were created by or have their last value written by
the current transaction).

Nicolas

--
Nicolas Barbier
http://www.gnu.org/philosophy/no-word-attachments.html


Re: WAL Bypass for indexes

From
Tom Lane
Date:
"Nicolas Barbier" <nicolas.barbier@gmail.com> writes:
> 2006/4/3, Tom Lane <tgl@sss.pgh.pa.us>:
>> AFAICS there are no circumstances, ever, in which update-in-place is
>> "safe".  (No transaction can guarantee that it will commit.)

> Updates to row values that did not "escape" the currect transaction
> yet (ie, rows that were created by or have their last value written by
> the current transaction).

Wrong, because MVCC also applies within transactions: a snapshot having
a lower command counter than what you are executing at should not see
your change.  A typical example of this is that an update scan must not
see tuples generated by triggers fired by that scan.
        regards, tom lane