Thread: Performance and WAL on big inserts/updates

Performance and WAL on big inserts/updates

From

Marty Scholes

Date:

11 March 2004, 17:49:58

I combed the archives but could not find a discussion on this and am 
amazed this hasn't been discussed.

My experience with Oracle (and now limited experience with Pg) is that 
the major choke point in performance is not the CPU or read I/O, it is 
the log performance of big update and select statements.

Essentially, the data is written twice: first to the log and then the 
data files.  This would be ok except the transaction is regularly frozen 
while the log files sync to disk with a bunch of tiny (8KB for Oracle 
and Pg) write requests.

I realize that the logs must be there to ensure crash recovery and that 
PITR is on the way to supplement this.

If a transaction will do large updates or inserts, why don't we just log 
the parsed statements in the WAL instead of the individual data blocks 
that have changed?  Then, the data files could be fsync()ed every 
checkpoint, but essentially no write I/O takes places in the interim.

Outside of checkpoints, large updates would only need to fsync() a very 
small addition to the log files.

Recovery could be similar to how I understand it currently is:
1. Roll back in-flight changes
2. Roll forward log entries in order, either direct changes to the data 
or re-execute the parsed command in the log.

Some informal testing suggests that we get a factor of 8 improvement in 
speed here if we completely disable fsync() in large updates under Pg.

I would suspect that a big portion of those gains would be preserved if 
fsync() calls were limited to checkpoints and saving the parsed SQL 
command in the log.

Why have I not seen this in any database?

There must be a reason.

Thanks in advance.

Sincerely,
Marty

Re: Performance and WAL on big inserts/updates

From

Rod Taylor

Date:

11 March 2004, 18:06:22

> If a transaction will do large updates or inserts, why don't we just log 
> the parsed statements in the WAL instead of the individual data blocks 

UPDATE table SET col = random();

Re: Performance and WAL on big inserts/updates

From

Sailesh Krishnamurthy

Date:

11 March 2004, 18:23:31

>>>>> "Marty" == Marty Scholes <marty@outputservices.com> writes:
   Marty> Why have I not seen this in any database?   Marty> There must be a reason.

For ARIES-style systems, logging parsed statements (commonly called
"logical" logging) is not preferred compared to logging data items
("physical" or "physiological" logging). 

A major reason for this is that logical logs make recovery contingent
on being able to execute the "parsed statements". This execution
might, however, not be possible if the system is itself not in a
consistent state .. as is normally the case during recovery. 

What if, for instance, it's the catalog tables that were hosed when
the system went down ? It may be difficult to execute the parsed
statements without the catalogs. 

For this reason, a major goal of ARIES was to have each and every data
object (tables/indexes) individually recoverable. So ARIES follows
page-oriented redo logging.

Having said that, page-oriented undo logging can be a pain when B-tree
pages split. For higher concurrency, ARIES uses logical undo
logging. In this case, the logs are akin to your "parsed statement"
idea.

In any case, the only place that parsed statements are useful, imo are
with searched updates that cause a large number of records to change
and with "insert into from select" statements.

Then, there is also the case that this, the "parsed statements"
approach, is not a general solution. How would you handle the "update
current of cursor" scenarios ? In this case, there is some application
logic that determines the precise records that change and how they
change. 

Ergo, it is my claim that while logical redo logging does have some
benefits, it is not a viable general solution.

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

Re: Performance and WAL on big inserts/updates

From

Tom Lane

Date:

11 March 2004, 19:17:22

Marty Scholes <marty@outputservices.com> writes:
> My experience with Oracle (and now limited experience with Pg) is that 
> the major choke point in performance is not the CPU or read I/O, it is 
> the log performance of big update and select statements.

If your load is primarily big update statements, maybe so...

> Essentially, the data is written twice: first to the log and then the 
> data files.  This would be ok except the transaction is regularly frozen 
> while the log files sync to disk with a bunch of tiny (8KB for Oracle 
> and Pg) write requests.

I don't think I buy that claim.  We don't normally fsync the log file
except at transaction commit (and read-only transactions don't generate
any commit record, so they don't cause an fsync).  If a single
transaction is generating lots of log data, it doesn't have to wait for
that data to hit disk before it can do more stuff.

But having said that --- on some platforms our default WAL sync method
is open_datasync, which could result in the sort of behavior you are
talking about.  Try experimenting with the other possible values of
wal_sync_method to see if you like them better.

> If a transaction will do large updates or inserts, why don't we just log 
> the parsed statements in the WAL instead of the individual data blocks 
> that have changed?

As already pointed out, this would not give enough information to
reproduce the database state.

> Some informal testing suggests that we get a factor of 8 improvement in 
> speed here if we completely disable fsync() in large updates under Pg.

That probably gets you into a situation where no I/O is really happening
at all, it's just being absorbed by kernel disk buffers.  Unfortunately
that doesn't have a lot to do with the performance you can get if you
want to be sure you don't lose data ...

BTW, one thing you can do to reduce the WAL I/O volume in Postgres is
to increase the inter-checkpoint interval (there are two settings to
increase, one time-based and one volume-based).  The first write of a
given data page after a checkpoint dumps the whole page into WAL, as a
safety measure to deal with partial page writes during power failures.
So right after a checkpoint the WAL volume goes way up.  With a longer
interval between checkpoints you don't pay that price as often.
        regards, tom lane

Re: Performance and WAL on big inserts/updates

From

Marty Scholes

Date:

11 March 2004, 22:04:34

I can see that and considered it.

The seed state would need to be saved, or any particular command that is 
not reproducible would need to be exempted from this sort of logging.

Again, this would apply only to situations where a small SQL command 
created huge changes.

Marty

Rod Taylor wrote:
>>If a transaction will do large updates or inserts, why don't we just log 
>>the parsed statements in the WAL instead of the individual data blocks 
> 
> 
> UPDATE table SET col = random();
> 
>

Re: Performance and WAL on big inserts/updates

From

Rod Taylor

Date:

11 March 2004, 22:12:16

On Thu, 2004-03-11 at 21:04, Marty Scholes wrote:
> I can see that and considered it.
> 
> The seed state would need to be saved, or any particular command that is 
> not reproducible would need to be exempted from this sort of logging.
> 
> Again, this would apply only to situations where a small SQL command 
> created huge changes.

I would say a majority of SQL queries in most designed systems
(timestamp). Not to mention the fact the statement itself may use very
expensive functions -- perhaps even user functions that don't repeatably
do the same thing or depend on data from other tables.

Consider a successful write to table X for transaction 2, but an
unsuccessful write to table Y for transaction 1. Transaction 1 calls a
function that uses information from table X -- but it'll get different
information this time around.

Anyway, it really doesn't matter. You're trying to save a large amount
of time that simply isn't spent in this area in PostgreSQL. fsync()
happens once with commit -- and on a busy system, a single fsync call
may be sufficient for a number of parallel backends.

Re: Performance and WAL on big inserts/updates

From

Marty Scholes

Date:

11 March 2004, 22:30:45

> A major reason for this is that logical logs make recovery contingent> on being able to execute the "parsed
statements".This execution> might, however, not be possible if the system is itself not in a> consistent state .. as is
normallythe case during recovery.>
 

I am not sure I follow you here.  The logs should (IMHO) save both types 
of data: physical pages (what happens now), or the SQL statement if it 
is small and generates a bunch of changes.

If the DB state cannot be put back to a consistent state prior to a SQL 
statement in the log, then NO amount of logging will help.  The idea is 
that the state can be put back to what it was prior to a particular log 
entry, be it raw datafile blocks or a SQL statement.
> What if, for instance, it's the catalog tables that were hosed when> the system went down ? It may be difficult to
executethe parsed> statements without the catalogs.>
 

See above.  If this cannot be resolved prior to re-executing a statement 
in the log, then the problem is beyond ANY subsequent logging.
> Having said that, page-oriented undo logging can be a pain when B-tree> pages split. For higher concurrency, ARIES
useslogical undo> logging. In this case, the logs are akin to your "parsed statement"> idea.>
 

Yes, my experience exactly.  Maybe we are the only company on the planet 
that experiences this sort of thing.  Maybe not.
> In any case, the only place that parsed statements are useful, imo are> with searched updates that cause a large
numberof records to change> and with "insert into from select" statements.>
 

Yes.  Correlated UPDATE, INSERT INTO with subselects AND mass DELETE on 
heavily indexed tables.  Index creation...  The list goes on and on.  I 
have experienced and live it all on a daily basis with Oracle.  And I 
despise it.

The difference is, of course, I can't even have this kind of discussion 
with Oracle, but I can here.  ;-)
> Then, there is also the case that this, the "parsed statements"> approach, is not a general solution. How would you
handlethe "update> current of cursor" scenarios ? In this case, there is some application> logic that determines the
preciserecords that change and how they> change.>> Ergo, it is my claim that while logical redo logging does have some>
benefits,it is not a viable general solution.>
 

Agreed, this is not a general solution.  What it is, however, is a 
tremendous improvement over the current situation for transactions that 
do massive changes to heavily indexed datasets.

I am working on an application right now that will require current 
postal information on EVERY address in the U.S. -- street name, street 
address, directional, subunit, 5 digit zip, 3 digit zip, city, state, 
delivery point barcode, carrier route, lattitude, longitude, etc.  Most 
of these fields will need to be indexed, because they will be searched 
in real time via a web application several thousand times per day.

To keep the address current, we will be updating them all (150+ million)  on a programmed basis, so we will go through
andupdate several 
 
million addresses EVERY DAY, while needing to ensure that the address 
updates happen atomically so that they don't disrupt web activity.

Maybe this is not a "traditional" RDBMS app, but I am not in the mood to 
write my own storage infrastructure for it.

Then again, maybe I don't know what I am talking about...

Marty


Sailesh Krishnamurthy wrote:
>>>>>>"Marty" == Marty Scholes <marty@outputservices.com> writes:
>>>>>
> 
>     Marty> Why have I not seen this in any database?
>     Marty> There must be a reason.
> 
> For ARIES-style systems, logging parsed statements (commonly called
> "logical" logging) is not preferred compared to logging data items
> ("physical" or "physiological" logging). 
> 
> A major reason for this is that logical logs make recovery contingent
> on being able to execute the "parsed statements". This execution
> might, however, not be possible if the system is itself not in a
> consistent state .. as is normally the case during recovery. 
> 
> What if, for instance, it's the catalog tables that were hosed when
> the system went down ? It may be difficult to execute the parsed
> statements without the catalogs. 
> 
> For this reason, a major goal of ARIES was to have each and every data
> object (tables/indexes) individually recoverable. So ARIES follows
> page-oriented redo logging.
> 
> Having said that, page-oriented undo logging can be a pain when B-tree
> pages split. For higher concurrency, ARIES uses logical undo
> logging. In this case, the logs are akin to your "parsed statement"
> idea.
> 
> In any case, the only place that parsed statements are useful, imo are
> with searched updates that cause a large number of records to change
> and with "insert into from select" statements.
> 
> Then, there is also the case that this, the "parsed statements"
> approach, is not a general solution. How would you handle the "update
> current of cursor" scenarios ? In this case, there is some application
> logic that determines the precise records that change and how they
> change. 
> 
> Ergo, it is my claim that while logical redo logging does have some
> benefits, it is not a viable general solution.
>

Re: Performance and WAL on big inserts/updates

From

Marty Scholes

Date:

11 March 2004, 22:45:35

> If your load is primarily big update statements, maybe so...

It is.  Maybe we are anomalous here.
> I don't think I buy that claim.  We don't normally fsync the log file> except at transaction commit (and read-only
transactions>don't generate> any commit record, so they don't cause an fsync).  If a single> transaction is generating
lotsof log data, it doesn't have> to wait for> that data to hit disk before it can do more stuff.
 

I have glanced at the code, and I agree that reads do not generate 
fsync() calls.  Since I watched my mirrored RAID 5 arrays hit 2,000 iops 
with an average request of 4 KB on a recent batch update with Pg, I 
still think that Pg may be fsync()-ing a bit too often.

Since I haven't digested all of the code, I am speaking a bit out of turn.
> But having said that --- on some platforms our default WAL sync method> is open_datasync, which could result in the
sortof behavior you are> talking about.  Try experimenting with the other possible values of> wal_sync_method to see if
youlike them better.
 

I will have to check that.  I am running Sparc Solaris 8.
> That probably gets you into a situation where no I/O> is really happening> at all, it's just being absorbed by kernel
disk>buffers.
 

Few things would please me more.
> Unfortunately> that doesn't have a lot to do with the performance you can get if you> want to be sure you don't lose
data...
 

I am not sure these are as mutually exclusive as it looks here.
>> BTW, one thing you can do to reduce the WAL I/O volume in Postgres is> to increase the inter-checkpoint interval
(thereare two settings to> increase, one time-based and one volume-based).  The first write of a> given data page after
acheckpoint dumps the whole page into WAL, as a> safety measure to deal with partial page writes during power
failures.>So right after a checkpoint the WAL volume goes way up.  With a longer> interval between checkpoints you
don'tpay that price as often.
 

I did that and it helped tremendously.  Without proper tuning, I just 
made the numbers pretty large:

shared_buffers = 100000
sort_mem = 131072
vacuum_mem = 65536
wal_buffers = 8192
checkpoint_segments = 32

Thanks for your feedback.

Sincerely,
Marty

Tom Lane wrote:
> Marty Scholes <marty@outputservices.com> writes:
> 
>>My experience with Oracle (and now limited experience with Pg) is that 
>>the major choke point in performance is not the CPU or read I/O, it is 
>>the log performance of big update and select statements.
> 
> 
> If your load is primarily big update statements, maybe so...
> 
> 
>>Essentially, the data is written twice: first to the log and then the 
>>data files.  This would be ok except the transaction is regularly frozen 
>>while the log files sync to disk with a bunch of tiny (8KB for Oracle 
>>and Pg) write requests.
> 
> 
> I don't think I buy that claim.  We don't normally fsync the log file
> except at transaction commit (and read-only transactions don't generate
> any commit record, so they don't cause an fsync).  If a single
> transaction is generating lots of log data, it doesn't have to wait for
> that data to hit disk before it can do more stuff.
> 
> But having said that --- on some platforms our default WAL sync method
> is open_datasync, which could result in the sort of behavior you are
> talking about.  Try experimenting with the other possible values of
> wal_sync_method to see if you like them better.
> 
> 
>>If a transaction will do large updates or inserts, why don't we just log 
>>the parsed statements in the WAL instead of the individual data blocks 
>>that have changed?
> 
> 
> As already pointed out, this would not give enough information to
> reproduce the database state.
> 
> 
>>Some informal testing suggests that we get a factor of 8 improvement in 
>>speed here if we completely disable fsync() in large updates under Pg.
> 
> 
> That probably gets you into a situation where no I/O is really happening
> at all, it's just being absorbed by kernel disk buffers.  Unfortunately
> that doesn't have a lot to do with the performance you can get if you
> want to be sure you don't lose data ...
> 
> BTW, one thing you can do to reduce the WAL I/O volume in Postgres is
> to increase the inter-checkpoint interval (there are two settings to
> increase, one time-based and one volume-based).  The first write of a
> given data page after a checkpoint dumps the whole page into WAL, as a
> safety measure to deal with partial page writes during power failures.
> So right after a checkpoint the WAL volume goes way up.  With a longer
> interval between checkpoints you don't pay that price as often.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: Performance and WAL on big inserts/updates

From

Marty Scholes

Date:

11 March 2004, 22:47:34

> Anyway, it really doesn't matter. You're trying to save a large amount> of time that simply isn't spent in this area
inPostgreSQL. fsync()> happens once with commit -- and on a busy system, a single fsync call> may be sufficient for a
numberof parallel backends.
 

I think you may be right.  I suspect that most "busy" installations run 
a large number of "light" update/delete/insert statements.

In this scenario, the kind of logging I am talking about would make 
things worse, much worse.

Marty

Rod Taylor wrote:
> On Thu, 2004-03-11 at 21:04, Marty Scholes wrote:
> 
>>I can see that and considered it.
>>
>>The seed state would need to be saved, or any particular command that is 
>>not reproducible would need to be exempted from this sort of logging.
>>
>>Again, this would apply only to situations where a small SQL command 
>>created huge changes.
> 
> 
> I would say a majority of SQL queries in most designed systems
> (timestamp). Not to mention the fact the statement itself may use very
> expensive functions -- perhaps even user functions that don't repeatably
> do the same thing or depend on data from other tables.
> 
> Consider a successful write to table X for transaction 2, but an
> unsuccessful write to table Y for transaction 1. Transaction 1 calls a
> function that uses information from table X -- but it'll get different
> information this time around.
> 
> 
> Anyway, it really doesn't matter. You're trying to save a large amount
> of time that simply isn't spent in this area in PostgreSQL. fsync()
> happens once with commit -- and on a busy system, a single fsync call
> may be sufficient for a number of parallel backends.
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Re: Performance and WAL on big inserts/updates

From

Sailesh Krishnamurthy

Date:

11 March 2004, 23:20:50

(Just a note: my comments are not pg-specific .. indeed I don't know
much about pg recovery).

>>>>> "Marty" == Marty Scholes <marty@outputservices.com> writes:
   Marty> If the DB state cannot be put back to a consistent state   Marty> prior to a SQL statement in the log, then
NOamount of   Marty> logging will help.  The idea is that the state can be put   Marty> back to what it was prior to a
particularlog entry, be it   Marty> raw datafile blocks or a SQL statement.

The point is that with redo logging, you can just blindly apply the
log to the data pages in question, without even really restarting the
database.

Note that in ARIES, recovery follows: (1) analysis, (2) redo
_everything_ since last checkpoint, (3) undo losers. 

So logging carefully will indeed help get the system to a consistent
state - actually after phase (2) above the system will be in precisely
the state during the crash .. and all that's left to do is undo all
the live transactions (losers). 

BTW, logging raw datafile blocks would be pretty gross (physical
logging) and so ARIES logs the changes to each tuple in "logical"
fashion .. so if only one column changes only that value (before and
after images) are logged. This is what's called "physiological
logging".
   Marty> See above.  If this cannot be resolved prior to   Marty> re-executing a statement in the log, then the
problemis   Marty> beyond ANY subsequent logging.

Not true ! By applying the log entries carefully you should be able to
restore the system to a consistent state. 
   >> Having said that, page-oriented undo logging can be a pain when   >> B-tree pages split. For higher concurrency,
ARIESuses logical   >> undo logging. In this case, the logs are akin to your "parsed   >> statement" idea.   >> 

   Marty> Yes, my experience exactly.  Maybe we are the only company   Marty> on the planet that experiences this sort
ofthing.  Maybe

Well, logical undo is still at a much lower level than parsed
statements. Each logical undo log is something like "delete key 5 from
index xyz". 
   Marty> Maybe this is not a "traditional" RDBMS app, but I am not   Marty> in the mood to write my own storage
infrastructurefor it.

I agree that your app has a lot of updates .. it's just that I'm not
convinced that logical logging is a clean solution. 

I also don't have a solution for your problem :-)

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

Re: Performance and WAL on big inserts/updates

From

Marty Scholes

Date:

11 March 2004, 23:38:17

> The point is that with redo logging, you can just blindly apply the> log to the data pages in question, without even
reallyrestarting the> database.
 

I also am not a recovery expert, but I have watched it happen more than 
once.

You bring up a good point.  My (perhaps false) understanding with 
recovery/redo was that the last checkpointed state was recreated, then 
log changes that finished with a commit were applied and the rest discarded.

Actually, this approach would not be ideal, since the last checkpointed 
state would be unavailable if any data file writes took place between 
the checkpoint and the crash.

Further, post-checkpoint log entries would surely contain multiple 
copies of the same data block, and there is no point in applying any but 
the last log entry for a particular block.

Ok.  To recap:

* Logging parsed statements don't apply to light updates and most 
installations live mostly on light updates
* Logging parsed statements would not work if the checkpoint state could 
not be recreated, which is likely the case.

So, this soluition won't work, and even if it did, would not apply to 
the vast majority of users.

Hmmm...  No wonder it hasn't been implemented yet.  ;-)

Thanks again,
Marty

Sailesh Krishnamurthy wrote:
> (Just a note: my comments are not pg-specific .. indeed I don't know
> much about pg recovery).
> 
> 
>>>>>>"Marty" == Marty Scholes <marty@outputservices.com> writes:
>>>>>
> 
>     Marty> If the DB state cannot be put back to a consistent state
>     Marty> prior to a SQL statement in the log, then NO amount of
>     Marty> logging will help.  The idea is that the state can be put
>     Marty> back to what it was prior to a particular log entry, be it
>     Marty> raw datafile blocks or a SQL statement.
> 
> The point is that with redo logging, you can just blindly apply the
> log to the data pages in question, without even really restarting the
> database.
> 
> Note that in ARIES, recovery follows: (1) analysis, (2) redo
> _everything_ since last checkpoint, (3) undo losers. 
> 
> So logging carefully will indeed help get the system to a consistent
> state - actually after phase (2) above the system will be in precisely
> the state during the crash .. and all that's left to do is undo all
> the live transactions (losers). 
> 
> BTW, logging raw datafile blocks would be pretty gross (physical
> logging) and so ARIES logs the changes to each tuple in "logical"
> fashion .. so if only one column changes only that value (before and
> after images) are logged. This is what's called "physiological
> logging".
> 
>     Marty> See above.  If this cannot be resolved prior to
>     Marty> re-executing a statement in the log, then the problem is
>     Marty> beyond ANY subsequent logging.
> 
> Not true ! By applying the log entries carefully you should be able to
> restore the system to a consistent state. 
> 
>     >> Having said that, page-oriented undo logging can be a pain when
>     >> B-tree pages split. For higher concurrency, ARIES uses logical
>     >> undo logging. In this case, the logs are akin to your "parsed
>     >> statement" idea.
>     >> 
> 
>     Marty> Yes, my experience exactly.  Maybe we are the only company
>     Marty> on the planet that experiences this sort of thing.  Maybe
> 
> Well, logical undo is still at a much lower level than parsed
> statements. Each logical undo log is something like "delete key 5 from
> index xyz". 
> 
>     Marty> Maybe this is not a "traditional" RDBMS app, but I am not
>     Marty> in the mood to write my own storage infrastructure for it.
> 
> I agree that your app has a lot of updates .. it's just that I'm not
> convinced that logical logging is a clean solution. 
> 
> I also don't have a solution for your problem :-)
>

Re: Performance and WAL on big inserts/updates

From

Tom Lane

Date:

12 March 2004, 00:33:43

Sailesh Krishnamurthy <sailesh@cs.berkeley.edu> writes:
> (Just a note: my comments are not pg-specific .. indeed I don't know
> much about pg recovery).
> ...
> BTW, logging raw datafile blocks would be pretty gross (physical
> logging) and so ARIES logs the changes to each tuple in "logical"
> fashion .. so if only one column changes only that value (before and
> after images) are logged. This is what's called "physiological
> logging".

What PG currently uses is sort of a hybrid approach.  The basic WAL
entry normally gives tuple-level detail: a tuple image and location
for an INSERT, a new tuple image and old and new locations for UPDATE,
a tuple location for DELETE, etc.  This might be a bit bulkier than
necessary for UPDATEs that change few columns, but it's consistent and
easy to replay.

However, as I mentioned to Marty, the very first change to any disk page
after a checkpoint will dump an entire (post-change) image of that page
into the log, in place of the tuple image or other detail that would
allow us to redo the change to that page.  The purpose of this is to
ensure that if a page is only partially written during a power failure,
we have the ability to recreate the entire correct page contents from
WAL by replaying everything since the last checkpoint.  Replay is thus
a write-only operation as far as the data files are concerned --- we
only write full pages or modify previously written pages.

(You may be thinking "well, what about a partial write to WAL" --- but
if that happens, that's where we stop replaying WAL.  The checksums on
WAL entries allow us to detect the invalid data before we use it.
So we will still have a consistent valid state on disk, showing the
effects of everything up through the last undamaged WAL record.)

BTW this is a one-directional log.  We do not care about UNDO, only
replay.  There is nothing that need be undone from a failed transaction;
it can only have littered the database with dead tuples, which will
eventually be reclaimed by VACUUM.

> Having said that, page-oriented undo logging can be a pain when
> B-tree pages split.

We don't bother to undo index page splits either ...
        regards, tom lane

Re: Performance and WAL on big inserts/updates

From

Tom Lane

Date:

12 March 2004, 12:24:39

sailesh@EECS.Berkeley.EDU writes:
> - Re uni-directional logs

> Of course. I forgot about PG's non-in-place update mechanisms and the
> use of VACCUUM .. with versioning there are really no undo logging
> necessary. I guess that means that during VACCUUM you might have to
> significant work in indexes ? I'm assuming that you never merge index
> pages.

Yes, VACUUM has to delete dead index entries as well as dead heap
tuples, and there are some fine points about making sure that happens
in a safe order.

I believe the current state of index space recovery is

* btree: recycles emptied index pages via a freelist; can return empty
pages to the OS if they're at the end of the index file, but will not
move pages around to make it possible to return more empty pages.
(This is all new behavior as of 7.4, before we didn't do anything about
reclaiming dead space in btrees.)  Does not try to merge partially-full
pages (this is a possible future improvement but there are concurrency
problems that would need to be solved).

* hash: recycles empty pages via a freelist, never returns them to OS
short of a REINDEX.  I think it does merge partially-empty pages within
each bucket chain.  No provision for reducing the number of buckets
if the index population shrinks (again, short of REINDEX).

* rtree, gist: no page recycling that I know of, but I've not looked
carefully.
        regards, tom lane