Thread: Reducing relation locking overhead

Reducing relation locking overhead

From
Tom Lane
Date:
In looking at the current pgbench results, I notice that one
considerable contribution to LWLock contention is access to the
heavyweight-lock manager.  Almost all of that comes from taking
relation-level locks, so we could cut down the contention if we could
reduce the number of distinct locks taken.  (The lmgr code already
avoids touching shared memory again if, say, you request another
AccessShareLock on a relation you've already AccessShareLocked in
the current transaction.)

I see several places where we might possibly make this kind of
savings:

1. In an UPDATE or DELETE query, we take RowExclusiveLock on the target
relation, and also take AccessShareLock in the scan node that reads the
relation.  It wouldn't take much extra code to make the scan nodes avoid
taking AccessShareLock if the relation is already opened as the query
result relation.  AFAICS there is no downside to this; any lock that
would conflict with AccessShareLock will also conflict with
RowExclusiveLock, so there's no real need to hold both lock types.
(Moreover, we already make a similar kind of optimization when accessing
system catalogs: those routines only take one lock not two.)
Does anyone have an objection to it?

2. In the same way, an index that is used to scan the target relation
will be locked in both RowExclusiveLock and AccessShareLock modes
(corresponding to its roles as both source and update target).  We could
avoid taking both lock types, but this would require some restructuring
because the code responsible for locking the indexes doesn't currently
have access to the EState to find out whether an index belongs to a
target relation.  What I'm thinking here is that maybe there's no point
in maintaining the read versus write distinction at all for indexes ---
we could just take AccessShareLock in both cases.  Any thoughts on the
pros and cons there?

3. We could also save some trips to the lock manager if we adopt the
same position for user indexes as we do for user tables, namely that
locks once taken are held till end of transaction.  Any thoughts about
that?

4. The only reason we need to take relation-level locks on indexes
at all is to make the world safe for REINDEX being done concurrently
with read-only accesses to the table (that don't use the index being
reindexed).  If we went back to requiring exclusive lock for reindex we
could forget all about both #2 and #3.  Particularly for updates of
relations with lots of indexes, this could be a pretty significant win.
However we'd definitely be giving up something that was seen as a
feature at one point, so I'm not sold on this idea ... unless someone
can see a way to reduce the overhead without giving up concurrent
REINDEX.
        regards, tom lane


Re: Reducing relation locking overhead

From
Christopher Kings-Lynne
Date:
> 4. The only reason we need to take relation-level locks on indexes
> at all is to make the world safe for REINDEX being done concurrently
> with read-only accesses to the table (that don't use the index being
> reindexed).  If we went back to requiring exclusive lock for reindex we
> could forget all about both #2 and #3.  Particularly for updates of
> relations with lots of indexes, this could be a pretty significant win.
> However we'd definitely be giving up something that was seen as a
> feature at one point, so I'm not sold on this idea ... unless someone
> can see a way to reduce the overhead without giving up concurrent
> REINDEX.

Surely in the real world REINDEX is run so rarely compared to all those 
other operations it'd be a win...

Chris



Re: Reducing relation locking overhead

From
Stephen Frost
Date:
* Christopher Kings-Lynne (chriskl@familyhealth.com.au) wrote:
> >4. The only reason we need to take relation-level locks on indexes
> >at all is to make the world safe for REINDEX being done concurrently
> >with read-only accesses to the table (that don't use the index being
> >reindexed).  If we went back to requiring exclusive lock for reindex we
> >could forget all about both #2 and #3.  Particularly for updates of
> >relations with lots of indexes, this could be a pretty significant win.
> >However we'd definitely be giving up something that was seen as a
> >feature at one point, so I'm not sold on this idea ... unless someone
> >can see a way to reduce the overhead without giving up concurrent
> >REINDEX.
>
> Surely in the real world REINDEX is run so rarely compared to all those
> other operations it'd be a win...

Yeah, except that in the real world you don't want to bring everything
to a halt while you do a REINDEX.  For my use cases it'd be fine but I
could see cases where it wouldn't be.  Kinda makes me wish we could give
the user the option at runtime somehow but I'm not sure that could be
done...
Thanks,
    Stephen

Re: Reducing relation locking overhead

From
Greg Stark
Date:
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

> Surely in the real world REINDEX is run so rarely compared to all those other
> operations it'd be a win...

It's not a question of frequency. We're not talking about something like a 10%
performance loss. You're talking about whether REINDEX is useful at all.
Consider installations where REINDEX will require shutting down business
critical operations for days...

It was a *major* new feature that many people were waiting for when Oracle
finally implemented live CREATE INDEX and REINDEX. The ability to run create
an index without blocking any operations on a table, even updates, was
absolutely critical for 24x7 operation.

-- 
greg



Re: Reducing relation locking overhead

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> It was a *major* new feature that many people were waiting for when Oracle
> finally implemented live CREATE INDEX and REINDEX. The ability to run create
> an index without blocking any operations on a table, even updates, was
> absolutely critical for 24x7 operation.

Well, we're still not in *that* ballpark and I haven't seen any serious
proposals to make us so.  How "absolutely critical" is it really?
Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we
actually have at the moment, an "absolutely critical" facility?
        regards, tom lane


Re: Reducing relation locking overhead

From
Simon Riggs
Date:
On Fri, 2005-12-02 at 02:14 -0500, Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
> > It was a *major* new feature that many people were waiting for when Oracle
> > finally implemented live CREATE INDEX and REINDEX. The ability to run create
> > an index without blocking any operations on a table, even updates, was
> > absolutely critical for 24x7 operation.
> 
> Well, we're still not in *that* ballpark and I haven't seen any serious
> proposals to make us so.  How "absolutely critical" is it really?
> Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we
> actually have at the moment, an "absolutely critical" facility?

REINDEX isn't run that regularly, so perhaps might warrant special
attention. (I think there are other things we could do to avoid ever
needing to run a REINDEX.) 

CREATE/DROP INDEX is important however, since we may want to try out new
index choices without stopping access altogether. But we do also want
the locking contention to be reduced also....

I know at least one other RDBMS that uses optimistic locking when
creating indexes. It checks the table description, builds the index with
a read lock, then checks the table description again before attempting
to lock the catalog, "create" the index and then complete. There is a
risk of getting a "table restructured error" after the build is nearly
complete. If we did that, then we wouldn't need to lock the indexes
because you wouldn't be able to see an index until it was built. Doing
something similar might allow us to have online CREATEs yet without a
locking overhead. 

24x7 operation is actually fairly common. Maybe not with a strong SLA
for availability, but many websites and embedded apps are out there all
the time. The PostgreSQL claim to fame has concurrency at the top of the
list, so we should assume that in all we do.

Best Regards, Simon Riggs



Re: Reducing relation locking overhead

From
Greg Stark
Date:
Simon Riggs <simon@2ndquadrant.com> writes:

> On Fri, 2005-12-02 at 02:14 -0500, Tom Lane wrote:
> > Greg Stark <gsstark@mit.edu> writes:
> > > It was a *major* new feature that many people were waiting for when Oracle
> > > finally implemented live CREATE INDEX and REINDEX. The ability to run create
> > > an index without blocking any operations on a table, even updates, was
> > > absolutely critical for 24x7 operation.
> > 
> > Well, we're still not in *that* ballpark and I haven't seen any serious
> > proposals to make us so.  How "absolutely critical" is it really?
> > Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we
> > actually have at the moment, an "absolutely critical" facility?

Alright, I'll grant Tom that "absolutely critical" was a bit of hyperbole.

> I know at least one other RDBMS that uses optimistic locking when
> creating indexes. It checks the table description, builds the index with
> a read lock, then checks the table description again before attempting
> to lock the catalog, "create" the index and then complete. There is a
> risk of getting a "table restructured error" after the build is nearly
> complete. 

I suspect this comes out of a very different storage model from Postgres's.

Postgres would have no trouble building an index of the existing data using
only shared locks. The problem is that any newly inserted (or updated) records
could be missing from such an index.

To do it you would then have to gather up all those newly inserted records.
And of course while you're doing that new records could be inserted. And so
on. There's no guarantee it would ever finish, though I suppose you could
detect the situation if the size of the new batch wasn't converging to 0 and
throw an error.

One optimization would be to have a flag that disabled the use of the FSM,
forcing all inserts to extend the table and allocate new tuples at the end.
This would at least limit the amount the index build would have to scan. The
index build could just do one-by-one insertions for the remaining tuples until
it catches up to the head.

At the end of the index build there's also a problem upgrading locks to put in
place the new index. That would create a deadlock risk. Perhaps that's where
the "table restructured error" comes up in these other databases?

> 24x7 operation is actually fairly common. Maybe not with a strong SLA
> for availability, but many websites and embedded apps are out there all
> the time. The PostgreSQL claim to fame has concurrency at the top of the
> list, so we should assume that in all we do.

Off the top of my head I would put these items on the list of "necessary for
24x7 operation":

. (non-FULL) VACUUM
. Online/PITR backups 
. Partitioned Tables
. online index builds

Of which Postgres has 2.5 out of 4. And most of those have come in just the
last 12 months or so. Doing pretty damned good.

-- 
greg



Re: Reducing relation locking overhead

From
Gregory Maxwell
Date:
On 02 Dec 2005 15:25:58 -0500, Greg Stark <gsstark@mit.edu> wrote:
> I suspect this comes out of a very different storage model from Postgres's.
>
> Postgres would have no trouble building an index of the existing data using
> only shared locks. The problem is that any newly inserted (or updated) records
> could be missing from such an index.
>
> To do it you would then have to gather up all those newly inserted records.
> And of course while you're doing that new records could be inserted. And so
> on. There's no guarantee it would ever finish, though I suppose you could
> detect the situation if the size of the new batch wasn't converging to 0 and
> throw an error.

After you're mostly caught up, change locking behavior to block
further updates while the final catchup happens. This could be driven
by a hurestic that says make up to N attempts to catch up without
blocking, after that just take a lock and finish the job. Presumably
the catchup would be short compared to the rest of the work.

Are their enviroments which could not tolerate even this minimal hit?
Probably, which leaves the choice of telling them 'don't reindex then'
or providingaA knob which would tell it to never block (would just try
N times and then give up, failing the reindex).


Re: Reducing relation locking overhead

From
Alvaro Herrera
Date:
Gregory Maxwell wrote:
> On 02 Dec 2005 15:25:58 -0500, Greg Stark <gsstark@mit.edu> wrote:
> > I suspect this comes out of a very different storage model from Postgres's.
> >
> > Postgres would have no trouble building an index of the existing data using
> > only shared locks. The problem is that any newly inserted (or updated) records
> > could be missing from such an index.
> >
> > To do it you would then have to gather up all those newly inserted records.
> > And of course while you're doing that new records could be inserted. And so
> > on. There's no guarantee it would ever finish, though I suppose you could
> > detect the situation if the size of the new batch wasn't converging to 0 and
> > throw an error.
> 
> After you're mostly caught up, change locking behavior to block
> further updates while the final catchup happens. This could be driven
> by a hurestic that says make up to N attempts to catch up without
> blocking, after that just take a lock and finish the job. Presumably
> the catchup would be short compared to the rest of the work.

The problem is that you need to upgrade the lock at the end of the
operation.  This is very deadlock prone, and likely to abort the whole
operation just when it's going to finish.  Is this a showstopper?  Tom
seems to think it is.  I'm not sure anyone is going to be happy if they
find that their two-day reindex was aborted just when it was going to
finish.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Reducing relation locking overhead

From
Simon Riggs
Date:
On Fri, 2005-12-02 at 19:04 -0300, Alvaro Herrera wrote:
> Gregory Maxwell wrote:
> > On 02 Dec 2005 15:25:58 -0500, Greg Stark <gsstark@mit.edu> wrote:
> > > I suspect this comes out of a very different storage model from Postgres's.
> > >
> > > Postgres would have no trouble building an index of the existing data using
> > > only shared locks. The problem is that any newly inserted (or updated) records
> > > could be missing from such an index.
> > >
> > > To do it you would then have to gather up all those newly inserted records.
> > > And of course while you're doing that new records could be inserted. And so
> > > on. 

CREATE INDEX uses SnapshotAny, so the scan that feeds the build could
easily include rows added after the CREATE INDEX started. When the scan
was exhausted we could mark that last TID and return to it after the
sort/build.

> There's no guarantee it would ever finish, though I suppose you could
> > > detect the situation if the size of the new batch wasn't converging to 0 and
> > > throw an error.
> > 
> > After you're mostly caught up, change locking behavior to block
> > further updates while the final catchup happens. This could be driven
> > by a hurestic that says make up to N attempts to catch up without
> > blocking, after that just take a lock and finish the job. Presumably
> > the catchup would be short compared to the rest of the work.
> 
> The problem is that you need to upgrade the lock at the end of the
> operation.  This is very deadlock prone, and likely to abort the whole
> operation just when it's going to finish.  Is this a showstopper?  Tom
> seems to think it is.  I'm not sure anyone is going to be happy if they
> find that their two-day reindex was aborted just when it was going to
> finish.

If that is the only objection against such a seriously useful feature,
then we should look at making some exceptions. (I understand the lock
upgrade issue).

Greg has come up with an exceptional idea here, so can we look deeper?
We already know others have done it.

What types of statement would cause the index build to fail? How else
can we prevent them from executing while the index is being built?

Best Regards, Simon Riggs




Re: Reducing relation locking overhead

From
Jochem van Dieten
Date:
On 12/2/05, Alvaro Herrera wrote:
> Gregory Maxwell wrote:
>>
>> After you're mostly caught up, change locking behavior to block
>> further updates while the final catchup happens. This could be driven
>> by a hurestic that says make up to N attempts to catch up without
>> blocking, after that just take a lock and finish the job. Presumably
>> the catchup would be short compared to the rest of the work.
>
> The problem is that you need to upgrade the lock at the end of the
> operation.  This is very deadlock prone, and likely to abort the whole
> operation just when it's going to finish.  Is this a showstopper?  Tom
> seems to think it is.  I'm not sure anyone is going to be happy if they
> find that their two-day reindex was aborted just when it was going to
> finish.

How about the following sceanrio for building a new index:
- create an empty index
- flag it as incomplete
- commit it so it becomes visible to new transactions
- new transactions will update the index when inserting / updating
- the planner will not use it for queries because it is flagged as incomplete
- wait until the the index is visible to all running transactions
- start a new seqscan and insert all records in the index
- commit
- remove the incomplete flag

Wouldn't this overcome the lock upgrade problem?

Jochem

Re: Reducing relation locking overhead

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> CREATE INDEX uses SnapshotAny, so the scan that feeds the build could
> easily include rows added after the CREATE INDEX started. When the scan
> was exhausted we could mark that last TID and return to it after the
> sort/build.

And do what?  This has nothing to do with the fundamental problem of
never being able to catch up unless you can upgrade your lock to exclude
writes.  What's worse, once you have excluded writes you have to rescan
the entire table to be sure you haven't missed anything.  So in the
scenarios where this whole thing is actually interesting, ie enormous
tables, you're still talking about a fairly long interval with writes
locked out.  Maybe not as long as a complete REINDEX, but long.
        regards, tom lane


Re: Reducing relation locking overhead

From
Tom Lane
Date:
Jochem van Dieten <jochemd@gmail.com> writes:
> How about the following sceanrio for building a new index:
> - create an empty index
> - flag it as incomplete
> - commit it so it becomes visible to new transactions
> - new transactions will update the index when inserting / updating
> - the planner will not use it for queries because it is flagged as incomplete
> - wait until the the index is visible to all running transactions
> - start a new seqscan and insert all records in the index
> - commit
> - remove the incomplete flag
> Wouldn't this overcome the lock upgrade problem?

Doesn't really solve the problem for REINDEX, though.  Presumably, the
reason that you are REINDEXing is that you would like to defragment the
existing index.  Well, that requires collecting all the index entries
and sorting them.  The above method is not going to produce a nicely
sorted index; whatever entries get made on-the-fly during the first
stage are going to determine the index shape.

This same problem applies to the build-lock-catchup paradigm, although
less severely since you can hope that the entries to be added on at the
end are only a small part of the total and will fit in the excess space
that you leave in the index leaf pages.  If you spend too long catching
up, though (as in the multiple-pass ideas that various people were
advocating), you'll end up with an index messy enough that it's
questionable why you bothered.
        regards, tom lane


Re: Reducing relation locking overhead

From
Simon Riggs
Date:
On Fri, 2005-12-02 at 17:45 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > CREATE INDEX uses SnapshotAny, so the scan that feeds the build could
> > easily include rows added after the CREATE INDEX started. When the scan
> > was exhausted we could mark that last TID and return to it after the
> > sort/build.
> 
> And do what?  This has nothing to do with the fundamental problem of
> never being able to catch up unless you can upgrade your lock to exclude
> writes.  What's worse, once you have excluded writes you have to rescan
> the entire table to be sure you haven't missed anything.  So in the
> scenarios where this whole thing is actually interesting, ie enormous
> tables, you're still talking about a fairly long interval with writes
> locked out.  Maybe not as long as a complete REINDEX, but long.

Those are technical difficulties that I believe can be solved.

Greg was right: online index builds are very useful (for us).

Best Regards, Simon Riggs



Re: Reducing relation locking overhead

From
Robert Treat
Date:
On Friday 02 December 2005 09:53, Simon Riggs wrote:
> On Fri, 2005-12-02 at 02:14 -0500, Tom Lane wrote:
> > Greg Stark <gsstark@mit.edu> writes:
> > > It was a *major* new feature that many people were waiting for when
> > > Oracle finally implemented live CREATE INDEX and REINDEX. The ability
> > > to run create an index without blocking any operations on a table, even
> > > updates, was absolutely critical for 24x7 operation.
> >
> > Well, we're still not in *that* ballpark and I haven't seen any serious
> > proposals to make us so.  How "absolutely critical" is it really?
> > Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we
> > actually have at the moment, an "absolutely critical" facility?
>
> REINDEX isn't run that regularly, so perhaps might warrant special
> attention. (I think there are other things we could do to avoid ever
> needing to run a REINDEX.)
>
> CREATE/DROP INDEX is important however, since we may want to try out new
> index choices without stopping access altogether. But we do also want
> the locking contention to be reduced also....
>

Just thought I'd toss in this random data point... I know I still have a least 
one 7.3 system running were reindexes are a part of the regular routine and 
the ability to query against the table simultaneously is certainly 
approaching "absolutly critical" territory. Hoping to get that system 
upgraded by the end of the month, at which point the frequency of reindex 
will surely decrease, but I'm not sure it's going to go away completly. I 
could probably get by at that point with DROP/CREATE INDEX but it wouldn't be 
my preferred way to do it. 

-- 
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL


Re: Reducing relation locking overhead

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> What's worse, once you have excluded writes you have to rescan the entire
> table to be sure you haven't missed anything. So in the scenarios where this
> whole thing is actually interesting, ie enormous tables, you're still
> talking about a fairly long interval with writes locked out. Maybe not as
> long as a complete REINDEX, but long.

I was thinking you would set a flag to disable use of the FSM for
inserts/updates while the reindex was running. So you would know where to find
the new tuples, at the end of the table after the last tuple you read.

-- 
greg



Re: Reducing relation locking overhead

From
Greg Stark
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:

> The problem is that you need to upgrade the lock at the end of the
> operation.  This is very deadlock prone, and likely to abort the whole
> operation just when it's going to finish.  Is this a showstopper?  Tom
> seems to think it is.  I'm not sure anyone is going to be happy if they
> find that their two-day reindex was aborted just when it was going to
> finish.

How likely is this really to be a problem in this particular case? Obviously
if two people try to reindex the same index they'll have a problem, but that's
not really a problem. (Postgres should probably do something to block that up
front rather than wait until the end to fail.)

Other than that case is there any other case the reindex could deadlock with?

I'm a bit hampered thinking about this because I'm not really sure exactly
what locks a reindex needs and what else takes those locks.


-- 
greg



Re: Reducing relation locking overhead

From
Kevin Brown
Date:
Greg Stark wrote:
> 
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
> > What's worse, once you have excluded writes you have to rescan the entire
> > table to be sure you haven't missed anything. So in the scenarios where this
> > whole thing is actually interesting, ie enormous tables, you're still
> > talking about a fairly long interval with writes locked out. Maybe not as
> > long as a complete REINDEX, but long.
> 
> I was thinking you would set a flag to disable use of the FSM for
> inserts/updates while the reindex was running. So you would know where to find
> the new tuples, at the end of the table after the last tuple you
> read.

If REINDEX works by seqscanning the table then the inclusion of new
tuples would happen for free if you turn off the FSM before beginning
the REINDEX operation -- you're guaranteed to see them last.  But that
only works if REINDEX behaves this way.

Then it's a question of what to do with in-flight updates at the time
the REINDEX hits the end of the table.

Even if REINDEX hits the table in non-sequential order, turning off
the FSM should still work.  REINDEX wouldn't need to acquire any
additional locks until after it has scanned the appended area.  So the
way I (perhaps naively) envision it working is:

1.  Acquire read lock on the table
2.  Turn off FSM
3.  Note the location of the end of the table
4.  Release read lock on the table
5.  Perform REINDEX operation
6.  Read and index the bit of the table starting with the location   noted previously.
7.  Note new end of table
8.  Acquire read lock on the table
9.  Scan any entries that have been appended past new end of table.
10. Release read lock on table
11. Turn on FSM


In the above for large relations, the bulk of the REINDEX should
happen without any locks being held by the REINDEX operation.  For
small tables (where the amount of new insert activity can be a large
percentage of the total table size), it would almost certainly be more
efficient to just take a read lock for the whole operation.  So it
might be wise to set up some sort of threshold, and to take a read
lock during the whole operation if the table size is smaller than the
threshold.

The reason the sequence I enumerate above involves taking any locks at
all is to avoid the issues that Tom brought up about having to rescan
the entire table to make sure nothing gets missed, to avoid possible
race conditions between steps 2 and 3, and to allow step 9 to
definitively complete, since otherwise in-flight updates would be
missed.


In the context of the original discussion (reduction of lock
acquisition), REINDEX isn't a common operation even if it is a
critical one, so acquisition of more than the usual number of locks
here shouldn't be a big deal.


-- 
Kevin Brown                          kevin@sysexperts.com


Re: Reducing relation locking overhead

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> What's worse, once you have excluded writes you have to rescan the entire
>> table to be sure you haven't missed anything.

> I was thinking you would set a flag to disable use of the FSM for
> inserts/updates while the reindex was running. So you would know where
> to find the new tuples, at the end of the table after the last tuple
> you read.

There are paths that do not consult the FSM, eg update where the new
tuple fits on the same page as the old, or where the backend has already
been given a target page by the FSM.  And this "set a flag" idea in
itself has enormous costs, because the flag will become a central point
of contention: it *must* be touched by every single tuple insertion.
        regards, tom lane


Re: Reducing relation locking overhead

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Other than that case is there any other case the reindex could deadlock with?

Only SELECT, INSERT, UPDATE, and DELETE.
        regards, tom lane


Re: Reducing relation locking overhead

From
Tom Lane
Date:
Kevin Brown <kevin@sysexperts.com> writes:
> In the above for large relations, the bulk of the REINDEX should
> happen without any locks being held by the REINDEX operation.

As I just pointed out to Greg, the arm-waving notion that you can "turn
off the FSM" requires a great deal of low-level locking that is not
there now.  Even ignoring that, you *still* have a lock upgrade problem
in this sketch.
        regards, tom lane


Re: Reducing relation locking overhead

From
Kevin Brown
Date:
Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > In the above for large relations, the bulk of the REINDEX should
> > happen without any locks being held by the REINDEX operation.
> 
> As I just pointed out to Greg, the arm-waving notion that you can "turn
> off the FSM" requires a great deal of low-level locking that is not
> there now.  

Yeah, I thought that the check for use of the FSM was already being
done by the lower level operators, and that contention would only
occur on the modification of the FSM "enabled" flag.  Obviously this
doesn't work at all if the FSM is just assumed to be in use at all
times, or if the FSM values are read only at transaction start or
something...


> Even ignoring that, you *still* have a lock upgrade problem
> in this sketch.

Hmm, well, I can see a deadlock potential for those operations that
have to acquire multiple locks simultaneously, and I suppose that the
table + FSM lock would qualify in the sequence here, but the rest of
it involves just a single read lock against the table.  What am I
missing?


-- 
Kevin Brown                          kevin@sysexperts.com


Re: Reducing relation locking overhead

From
Tom Lane
Date:
Kevin Brown <kevin@sysexperts.com> writes:
> Tom Lane wrote:
>> Even ignoring that, you *still* have a lock upgrade problem
>> in this sketch.

> Hmm, well, I can see a deadlock potential for those operations that
> have to acquire multiple locks simultaneously, and I suppose that the
> table + FSM lock would qualify in the sequence here, but the rest of
> it involves just a single read lock against the table.  What am I
> missing?

At some point you have to lock out writers, else you can never be
certain you have all the tuples.  I was taking your "read lock" to
actually mean "lock out writers"; otherwise the sketch doesn't work
at all.

The real situation is that you must hold at least AccessShareLock on the
table throughout the entire operation, else you have no defense against
(say) someone dropping the index or the entire table out from under you.
And when you add onto this lock in order to lock out writers
temporarily, you are engaging in a lock upgrade, which can deadlock
against any sort of exclusive lock request.  The fact that you've been
holding the AccessShareLock for quite a long time means that the window
for deadlock problems is very wide.
        regards, tom lane


Re: Reducing relation locking overhead

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > Other than that case is there any other case the reindex could deadlock with?
> 
> Only SELECT, INSERT, UPDATE, and DELETE.

How does that happen? What exclusive locks do these take that reindex would
conflict with? I guess I'm missing some basic piece of the puzzle here.

I've also started wondering if the other approach proposed doesn't have more
merit than it was granted, namely the plan to create an "incomplete index"
that other transactions update appropriately as they make changes and then
insert the records one by one.

The main objection given was that the newly created index wouldn't be as
cleanly structured as a real reindex. But it seems like that's something that
can be overcome. There's plenty of research on how to build trees that
withstand sequential inserts for example. In the worst case Postgres could
sort and build the entire index temporarily to decide where the page splits
ought to be and then arrange to insert the new records in the real index such
that the page splits end up in those places.

The unstated objection that I think is more severe here is that performance
would suck. Repeatedly inserting individual records would be much slower than
a real reindex. And worse, it would slow down regular operation since every
update and insert would need to content with this batch job to get their
chance to insert their entries.

But I'm not sure performance sucking is really that big of a problem here.
There's always the option of an offline REINDEX if you can withstand the
downtime. To do an online reindex you'll just need a faster system. Nobody
ever said 24x7 operation wouldn't have costs.

-- 
greg



Re: Reducing relation locking overhead

From
Simon Riggs
Date:
On Sat, 2005-12-03 at 08:47 -0500, Robert Treat wrote:
> On Friday 02 December 2005 09:53, Simon Riggs wrote:
> > On Fri, 2005-12-02 at 02:14 -0500, Tom Lane wrote:
> > > Greg Stark <gsstark@mit.edu> writes:
> > > > It was a *major* new feature that many people were waiting for when
> > > > Oracle finally implemented live CREATE INDEX and REINDEX. The ability
> > > > to run create an index without blocking any operations on a table, even
> > > > updates, was absolutely critical for 24x7 operation.
> > >
> > > Well, we're still not in *that* ballpark and I haven't seen any serious
> > > proposals to make us so.  How "absolutely critical" is it really?
> > > Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we
> > > actually have at the moment, an "absolutely critical" facility?
> >
> > REINDEX isn't run that regularly, so perhaps might warrant special
> > attention. (I think there are other things we could do to avoid ever
> > needing to run a REINDEX.)
> >
> > CREATE/DROP INDEX is important however, since we may want to try out new
> > index choices without stopping access altogether. But we do also want
> > the locking contention to be reduced also....
> >
> 
> Just thought I'd toss in this random data point... I know I still have a least 
> one 7.3 system running were reindexes are a part of the regular routine and 
> the ability to query against the table simultaneously is certainly 
> approaching "absolutly critical" territory. Hoping to get that system 
> upgraded by the end of the month, at which point the frequency of reindex 
> will surely decrease, but I'm not sure it's going to go away completly. I 
> could probably get by at that point with DROP/CREATE INDEX but it wouldn't be 
> my preferred way to do it. 

Understood. At 7.3, REINDEXing is essential, since rows never got
deleted and space was not reused. That is not the case now, hence a
REINDEX is less often required.

Best Regards, Simon Riggs



Re: Reducing relation locking overhead

From
Alvaro Herrera
Date:
Simon Riggs wrote:

> Understood. At 7.3, REINDEXing is essential, since rows never got
> deleted and space was not reused. That is not the case now, hence a
> REINDEX is less often required.

But it's still required or at least desirable under some circumstances.
If it could be improved, it would be good.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Reducing relation locking overhead

From
Jochem van Dieten
Date:
On 12/3/05, Tom Lane wrote:
> Jochem van Dieten writes:
>> How about the following sceanrio for building a new index:
>> - create an empty index
>> - flag it as incomplete
>> - commit it so it becomes visible to new transactions
>> - new transactions will update the index when inserting / updating
>> - the planner will not use it for queries because it is flagged as incomplete
>> - wait until the the index is visible to all running transactions
>> - start a new seqscan and insert all records in the index
>> - commit
>> - remove the incomplete flag
>> Wouldn't this overcome the lock upgrade problem?
>
> Doesn't really solve the problem for REINDEX, though.  Presumably, the
> reason that you are REINDEXing is that you would like to defragment the
> existing index.  Well, that requires collecting all the index entries
> and sorting them.  The above method is not going to produce a nicely
> sorted index; whatever entries get made on-the-fly during the first
> stage are going to determine the index shape.

So how about merging this approach with the catch-up approach?

- take an access share lock so the index doesn't change
- build the index in a non-locking approach
- flag the index as incomplete
- commit it
- from now on new transaction will register their changes in the index
- wait until it is visible to all
- do a table scan with special xmin-xmax comparison
- commit the mising tuples
- remove the incomplete flag

Jochem

Re: Reducing relation locking overhead

From
Simon Riggs
Date:
On Sat, 2005-12-03 at 17:16 -0300, Alvaro Herrera wrote:
> Simon Riggs wrote:
> 
> > Understood. At 7.3, REINDEXing is essential, since rows never got
> > deleted and space was not reused. That is not the case now, hence a
> > REINDEX is less often required.
> 
> But it's still required or at least desirable under some circumstances.
> If it could be improved, it would be good.

100% Agreed. I just think having an online CREATE INDEX is a much more
important thing, FWIW.

Best Regards, Simon Riggs



Re: Reducing relation locking overhead

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> wrote
>
> The real situation is that you must hold at least AccessShareLock on the
> table throughout the entire operation, else you have no defense against
> (say) someone dropping the index or the entire table out from under you.
> And when you add onto this lock in order to lock out writers
> temporarily, you are engaging in a lock upgrade, which can deadlock
> against any sort of exclusive lock request.  The fact that you've been
> holding the AccessShareLock for quite a long time means that the window
> for deadlock problems is very wide.
>

Maybe the deadlock problem is solvable, our current deadlock removal 
mechanism is like this:
/* who wakes up first who removes himself -- quite unfair :-( */RemoveFromWaitQueue(MyProc);

What if we change to cost-based removal, i.e., remove the one whose cost is 
smaller. In this case, an two-days-to-be-done reindex should never get 
killed.

Regards,
Qingqing






Re: Reducing relation locking overhead

From
Tom Lane
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> What if we change to cost-based removal, i.e., remove the one whose cost is 
> smaller. In this case, an two-days-to-be-done reindex should never get 
> killed.

Instead, it would kill all useful work in your system :-(.  An old
transaction would be the one invincible gunslinger in town, mowing
down every transaction that dared to go up against it, and getting
older and even less vincible with every win.  Yet this would still
not give it any guarantee of ever being able to commit, as long as
there's a constant stream of competition.

Certainly it's interesting to think about a smarter deadlock-escape
algorithm, but that's no solution compared to not getting into a
deadlock in the first place.
        regards, tom lane


Re: Reducing relation locking overhead

From
Kevin Brown
Date:
Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > Tom Lane wrote:
> >> Even ignoring that, you *still* have a lock upgrade problem
> >> in this sketch.
> 
> > Hmm, well, I can see a deadlock potential for those operations that
> > have to acquire multiple locks simultaneously, and I suppose that the
> > table + FSM lock would qualify in the sequence here, but the rest of
> > it involves just a single read lock against the table.  What am I
> > missing?
> 
> At some point you have to lock out writers, else you can never be
> certain you have all the tuples.  I was taking your "read lock" to
> actually mean "lock out writers"; otherwise the sketch doesn't work
> at all.

Right, but the idea is to lock out writers for as brief a time as
possible.  That not only minimizes the possibility of lock contention
but guarantees that REINDEX will get a complete view of the database.

That said, it hinges on some sort of efficient way of identifying the
new tuples created by other transactions that are/were running during
the bulk of the time RINDEX was running.  If there's no good way to do
that, then there's no way to avoid blocking writers for an extended
period of time.

> The real situation is that you must hold at least AccessShareLock on the
> table throughout the entire operation, else you have no defense against
> (say) someone dropping the index or the entire table out from under you.
> And when you add onto this lock in order to lock out writers
> temporarily, you are engaging in a lock upgrade, which can deadlock
> against any sort of exclusive lock request.  

But won't that depend on the order in which the lock requests appear?
If locks A, B, and C are taken in that same order by every
transaction, then there's no possibility of deadlock, right?

> The fact that you've been holding the AccessShareLock for quite a
> long time means that the window for deadlock problems is very wide.

But with respect to deadlocks, that's true only if deadlocks are
possible, which is true only if the order of lock acquisition differs
between transactions.

I guess the real question here is: is it possible to, in code,
guarantee the order of lock acquisition by any given transaction?  For
REINDEX, the problem is simplified somewhat because it's operating
against a single index and a single table, so the locks it has to
acquire are somewhat limited in scope compared with a generic
transaction.

An endeavor to acquire all locks in the same order throughout the code
would not only take care of this REINDEX deadlock problem but would
essentially eliminate all possible deadlocks arising from code-ordered
lock acquisition in the system, which I expect would be considered a
very good thing indeed.  But I expect it would be a lot of effort and
wouldn't be worth it just to make REINDEX behave differently than it
does now.

So what am I missing/overlooking here?


-- 
Kevin Brown                          kevin@sysexperts.com


Re: Reducing relation locking overhead

From
Tom Lane
Date:
Kevin Brown <kevin@sysexperts.com> writes:
> I guess the real question here is: is it possible to, in code,
> guarantee the order of lock acquisition by any given transaction?

Yes, but not in our code :-(.  This is largely determined by what the
application does.
        regards, tom lane


Re: Reducing relation locking overhead

From
Kevin Brown
Date:
Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > I guess the real question here is: is it possible to, in code,
> > guarantee the order of lock acquisition by any given transaction?
> 
> Yes, but not in our code :-(.  This is largely determined by what the
> application does.

Yeah, that's what I figured.

But what of the more limited problem of lock acquisition relative to
the locks that REINDEX might need to acquire?  Since those locks are
limited to a single table and a single index, I'd think the problem
wouldn't be insurmountable.  No?

Suppose the following rules were enforced in the code:

- when requesting a type of lock, one must first acquire all lesser lock types against the object in order of strength.
Hence, one must acquire AccessShareLock before acquiring AccessExclusiveLock.
 

- one must always acquire a given lock type against the table before acquiring it against the index.

The latter rule might be sufficient, if the former rule is already
implied by the lock types and how they're acquired.

Thus, acquisition of AccessExclusiveLock against the index
automatically implies acquisition of AccessShareLock(table), then
AccessShareLock(index), then AccessExclusiveLock(table), then
AccessExclusiveLock(index).

I could see that possibly causing performance problems (and would be
interested in knowing what performance problems it would likely
cause), but given the limited scope of the locks that REINDEX needs,
wouldn't the above be sufficient to keep REINDEX deadlock-free against
other transactions?



-- 
Kevin Brown                          kevin@sysexperts.com


Re: Reducing relation locking overhead

From
Greg Stark
Date:
Kevin Brown <kevin@sysexperts.com> writes:

> > The fact that you've been holding the AccessShareLock for quite a
> > long time means that the window for deadlock problems is very wide.
> 
> But with respect to deadlocks, that's true only if deadlocks are
> possible, which is true only if the order of lock acquisition differs
> between transactions.

Taking locks in different orders is one way to trigger deadlocks. But it's not
the only way.

If two threads both take a shared lock on a resource, then try to upgrade it
to an exclusive lock they'll both be waiting on the other thread to release
its shared lock... And in this case they're acquiring the locks in precisely
the same order.

What I don't get from what Tom said was that he implied REINDEX would deadlock
against a regular thread acquiring a plain old exclusive lock even when it's
not also a lock upgrade. It seems to me that REINDEX should be able to get in
and get its exclusive lock and finish its job and then the other job should be
able to get its exclusive lock and complete fine.

-- 
greg



Re: Reducing relation locking overhead

From
Tom Lane
Date:
Kevin Brown <kevin@sysexperts.com> writes:
> - when requesting a type of lock, one must first acquire all lesser
>   lock types against the object in order of strength.  Hence, one must
>   acquire AccessShareLock before acquiring AccessExclusiveLock.

This is exactly wrong ...
        regards, tom lane


Re: Reducing relation locking overhead

From
Kevin Brown
Date:
Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > - when requesting a type of lock, one must first acquire all lesser
> >   lock types against the object in order of strength.  Hence, one must
> >   acquire AccessShareLock before acquiring AccessExclusiveLock.
> 
> This is exactly wrong ...

And now I see why, since it will introduce deadlocks (sigh).  But what
of the other rule (always acquiring locks against the table before the
index)?  You may have stopped reading at the above...

One thing I don't quite understand about the discussion is why there's
particular attention being paid to deadlocks with respect to REINDEX
when it clearly can happen in the general case when lock promotion is
involved.  Why is REINDEX special here?

If the problem is that REINDEX has to hold an AccessShareLock to
prevent the table or index from being dropped, but does not need to
prevent writers in general (because the presumption is that there is
some way of efficiently discovering the addtional modifications made
during the bulk of REINDEX processing), then doesn't that mean that an
AccessShareLock is the wrong kind of lock for REINDEX to be holding,
and that the appropriate type of lock should be created if it doesn't
already exist?

Additionally, I was under the impression that normal INSERTs, UPDATEs,
and DELETEs didn't generally need to acquire AccessExclusiveLock,
because of MVCC.  If that's the case, then aren't the operations that
could deadlock REINDEX relatively rare?  And if those operations *do*
need to acquire that lock type, then what exactly does MVCC buy you?



-- 
Kevin Brown                          kevin@sysexperts.com


Re: Reducing relation locking overhead

From
mark@mark.mielke.cc
Date:
On Sun, Dec 04, 2005 at 10:40:55PM -0800, Kevin Brown wrote:
> One thing I don't quite understand about the discussion is why there's
> particular attention being paid to deadlocks with respect to REINDEX
> when it clearly can happen in the general case when lock promotion is
> involved.  Why is REINDEX special here?

Although there is a general case, I don't believe it is a common case.
INSERT creates the rows invisible to other transactions. UPDATE and
DELETE would lock the row for writing. Normally, lock promotion isn't
required.

If, however, a person locks the row for reading using SELECT, and
then attempts to upgrade the lock by SELECT, UPDATE, or DELETE,
this would result in a lock promotion.

> If the problem is that REINDEX has to hold an AccessShareLock to
> prevent the table or index from being dropped, but does not need to
> prevent writers in general (because the presumption is that there is
> some way of efficiently discovering the addtional modifications made
> during the bulk of REINDEX processing), then doesn't that mean that an
> AccessShareLock is the wrong kind of lock for REINDEX to be holding,
> and that the appropriate type of lock should be created if it doesn't
> already exist?

I think the problem is the 'efficiently discovering the additional
modifications during' REINDEX processing. A few ideas have been
floating around - but none proven. Even the (rather clever in my
opinion) solution that would create the index incomplete, allowing
new updates to be written to the index automatically, while REINDEX
fills in the rest of it in the background before marking it
complete, still has problems. If it is a new unique index, then
for a time, the unique constraint would not be enforced.

I think it has to be tweaked some, such that the new incomplete
index would be created along-side the other index, and once the
indexes match up, remove the original, and rename the new one
into place.

> Additionally, I was under the impression that normal INSERTs, UPDATEs,
> and DELETEs didn't generally need to acquire AccessExclusiveLock,
> because of MVCC.  If that's the case, then aren't the operations that
> could deadlock REINDEX relatively rare?  And if those operations *do*
> need to acquire that lock type, then what exactly does MVCC buy you?

REINDEX needs to see visible and invisible rows. This goes back to the
previous point. Efficiently and reliably discovering newly created rows.

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Concurrent CREATE INDEX, try 2 (was Re: Reducing relation locking overhead)

From
Hannu Krosing
Date:
Ühel kenal päeval, R, 2005-12-02 kell 02:14, kirjutas Tom Lane:
> Greg Stark <gsstark@mit.edu> writes:
> > It was a *major* new feature that many people were waiting for when Oracle
> > finally implemented live CREATE INDEX and REINDEX. The ability to run create
> > an index without blocking any operations on a table, even updates, was
> > absolutely critical for 24x7 operation.
> 
> Well, we're still not in *that* ballpark and I haven't seen any serious
> proposals to make us so.  How "absolutely critical" is it really?
> Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we
> actually have at the moment, an "absolutely critical" facility?

I don't think REINDEX to be very critical (exept if for some reason you
have failed to vacuum a high-traffic table for some time and need to get
index sizes down).

OTOH, being able to add indexes on live database is sometimes required,
and neither of the current ways ( accept a few hours of downtime or use
slony relica and do a swithcover) are always acceptable. This capability
is reportedly present in MSSQL and available for Oracle if you get the
more expensive Enetrprise Edition.


So, after more thinking, I have come up with a proposal for fully
concurrent (read+write) create index, which should need minimal amount
of locking.

Concurrent CREATE INDEX 
========================

Concurrent index NDX1 on table TAB1 is created like this:

1) start transaction. take a snapshot SNAP1

1.1) optionally, remove pages for TAB1 from FSM to force (?) all newer
inserts/updates to happen at end of table (won't work for in-page
updates without code changes) 

2) create the index as we do now, but only for pages which are visible
in SNAP1

3) record the index in pg_class, but mark it as "do not use for lookups"
in a new field. Take snapshot SNAP2. commit transaction.

-- at this point all new inserts and updates will be recorded in NDX1

4) Run a full scan over TAB1 and add all rows that are visible in SNAP2
but not in SNAP1 to NDX1. (if there is some way (like p1.1) to restrict
or record the area in heap that new tuples go to, then this can be done
more efficiently than full scan)

5) record the status of index as "ready for use".

-- now the index is fully created and usable


This is in no way a final proposal, but rather starting point for
discussion of how things might be doable. For example p.3 is probably
tricky to do in a way that all backends pick up at the right time. This
will need most places that do table updates to be reviewed to make sure
that they check for new indexes.

Any comments are appreciated.

-------------
Hannu Krosing







Re: Reducing relation locking overhead

From
Tom Lane
Date:
Kevin Brown <kevin@sysexperts.com> writes:
> And now I see why, since it will introduce deadlocks (sigh).  But what
> of the other rule (always acquiring locks against the table before the
> index)?  You may have stopped reading at the above...

We already do that.

> One thing I don't quite understand about the discussion is why there's
> particular attention being paid to deadlocks with respect to REINDEX
> when it clearly can happen in the general case when lock promotion is
> involved.  Why is REINDEX special here?

Because what people are asking for is an on-line REINDEX, ie, something
that will go through in the presence of concurrent updates.  Most other
sorts of DDL changes to tables take exclusive locks so they don't have
to worry about concurrency.  (There is of course some risk of deadlock
from the exclusive lock, but at least it happens *before* you've done
a lot of work not *after*.)

> Additionally, I was under the impression that normal INSERTs, UPDATEs,
> and DELETEs didn't generally need to acquire AccessExclusiveLock,
> because of MVCC.  If that's the case, then aren't the operations that
> could deadlock REINDEX relatively rare?

The concern about deadlock applies to the various proposals that involve
upgrading to a write-prevention lock at some late point in the process.
That clearly has the potential to deadlock against relatively weak lock
requests.
        regards, tom lane


Re: Reducing relation locking overhead

From
Alvaro Herrera
Date:
I wonder if it would work to release the AccessShareLock before
acquiring the ExclusiveLock.  Of course, this would let any ALTER TABLE
or DROP TABLE to do anything they wanted, but we could check that the
table is still the same after reacquiring the exclusive lock.  REINDEX
would have to abort if anything unexpected happened to the table while
the REINDEX transaction was waiting after releasing the shared lock.

What's not at all clear to me is at what time is the lock "upgraded"?
Just after scanning the heap for the first time?  In this case, how do
we detect all the tuples that need to be inserted after we acquire the
exclusive lock?  Are they listed somewhere?

I imagine each table could have a global flag telling "there is an
online reindex running for this relation".  If this flag is set, each
insert/delete to the index needs to save aside somewhere, the CTIDs of
tuples it is inserting/deleting.  So the protocol for reindex could be:

acquire vacuum lockacquire read lock  set REINDEX flag  build the bulk of the index  -- this takes a lot of time ...
--meanwhile other transactions save CTIDs in a "spill area"release read lock
 
acquire exclusive lock  recheck the table, abort if something weird happened  read the spill area, insert/delete from
theindex as appropiate  mark the index as completerelease exclusive lock
 
release vacuum lock

The "vacuum lock" is designed to leave any VACUUM out of the equation,
but let run any select/insert/update/delete run.  Maybe this lock could
leave ALTER TABLE and other stuff out too.  Not sure if we have
something like this in our lock table -- if not, can we create it?

Note that by this proposal, any DDL gets more expensive -- but if it's
the normal case (no REINDEX running), it's only a flag check.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Reducing relation locking overhead

From
Kevin Brown
Date:
Tom Lane wrote:
> The concern about deadlock applies to the various proposals that involve
> upgrading to a write-prevention lock at some late point in the process.
> That clearly has the potential to deadlock against relatively weak lock
> requests.

After looking at the various lock types, I don't see how this is the
case at all (which may mean that I'm more confused than ever.  But
please read on).  It seems to me that only ops that promote to
AccessExclusiveLock can deadlock against at least some of the proposed
REINDEX implementations.

REINDEX would have to initially grab AccessShareLock, of course, but
AccessExclusiveLock is the only lock type that blocks against it, so
in the case of lock promotion the only operations that would cause
REINDEX to really deadlock (as opposed to simply blocking) are
operations on the entire table (ALTER TABLE, DROP TABLE, etc.).

None of the common operations block against an AccessShareLock, and
the order of acquisition against objects (table vs index) is already
enforced, so where's the deadlock potential?


REINDEX would, I expect, promote its lock to ShareLock when it's time
for it to block writers.  That would block against quite a number of
operations, of course, but that's not a problem in and of itself,
because it need only wait until the operations in question are
finished.  The lock won't be granted until those other operations are
finished, and nothing aside from table-level ops will block against
the REINDEX until that lock is granted.  A true deadlock won't happen
against common operations unless REINDEX promotes its lock again to
something stronger than ShareLock, and that's easy to avoid: just have
REINDEX promote directly from AccessShareLock to the strongest lock it
will ever need.




-- 
Kevin Brown                          kevin@sysexperts.com


Re: Concurrent CREATE INDEX, try 2 (was Re: Reducing relation locking overhead)

From
Jochem van Dieten
Date:
On 12/5/05, Hannu Krosing wrote:
>
> Concurrent CREATE INDEX
> ========================
>
> Concurrent index NDX1 on table TAB1 is created like this:
>
> 1) start transaction. take a snapshot SNAP1
>
> 1.1) optionally, remove pages for TAB1 from FSM to force (?) all newer
> inserts/updates to happen at end of table (won't work for in-page
> updates without code changes)
>
> 2) create the index as we do now, but only for pages which are visible
> in SNAP1
>
> 3) record the index in pg_class, but mark it as "do not use for lookups"
> in a new field. Take snapshot SNAP2. commit transaction.

What happens if another transaction takes a snapshot between SNAP2 and
the commit? Wouldn't you need a lock to guard against that? (Not that
I don't know if that is possible or desirable.)

Jochem

Jochem van Dieten <jochemd@gmail.com> writes:
> On 12/5/05, Hannu Krosing wrote:
>> 3) record the index in pg_class, but mark it as "do not use for lookups"
>> in a new field. Take snapshot SNAP2. commit transaction.

> What happens if another transaction takes a snapshot between SNAP2 and
> the commit? Wouldn't you need a lock to guard against that? (Not that
> I don't know if that is possible or desirable.)

It's worse than that, because an updating command that is already
running has already made its list of which indexes to update.  You can't
say "commit" and expect transactions already in flight to react
magically to the presence of the new index.  If you take a lock that
excludes writes, and then release that lock with your commit (lock
release actually happens after commit btw), then you can be sure that
subsequent write transactions will see your new index, because they take
their writer's lock before they inspect pg_index to see what indexes
they need to update.

Short of taking such a lock, you have a race condition.

There's another little problem: it's not clear that "present in SNAP2
but not in SNAP1" has anything to do with the condition you need.  This
would exclude rows made by transactions still in progress as of SNAP2,
but you can't know whether such rows were made before or after your
commit of the index.  It doesn't do the right thing for deleted rows
either (deleted rows may still need to be entered into the index),
though perhaps you could fix that with a creative reinterpretation of
what "present in a snap" means.
        regards, tom lane


Re: Concurrent CREATE INDEX, try 2 (was Re: Reducing

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2005-12-06 kell 20:50, kirjutas Jochem van Dieten:
> On 12/5/05, Hannu Krosing wrote:
> >
> > Concurrent CREATE INDEX
> > ========================
> >
> > Concurrent index NDX1 on table TAB1 is created like this:
> >
> > 1) start transaction. take a snapshot SNAP1
> >
> > 1.1) optionally, remove pages for TAB1 from FSM to force (?) all newer
> > inserts/updates to happen at end of table (won't work for in-page
> > updates without code changes)
> >
> > 2) create the index as we do now, but only for pages which are visible
> > in SNAP1
> >
> > 3) record the index in pg_class, but mark it as "do not use for lookups"
> > in a new field. Take snapshot SNAP2. commit transaction.
> 
> What happens if another transaction takes a snapshot between SNAP2 and
> the commit? 

I'm hoping there to be some clever way to circumvent (the effects) of
it. But I can't see it yet.

> Wouldn't you need a lock to guard against that? (Not that
> I don't know if that is possible or desirable.)

That may be needed. At least I hope it to be possible in a way that can
quarantee avoiding deadlocks.

What I have in mind would be something like this to get both SNAP2 and
commit between any transactions:

LOOP:   LOCK AGAINST STARTING NEW TRANSACTIONS   LOOP UP TO N SEC :       IF NO OTHER TRANSACTIONS: BREAK       ELSE:
CONTINUE  IF NO OTHER TRANSACTIONS: BREAK   ELSE:       UNLOCK AGAINST STARTING NEW TRANSACTIONS       SLEEP N SEC
 
TAKE SNAP2
COMMIT (AND UNLOCK)


This will eventually succeed (given right values for N ) and will
quarantee that SNAP2 and COMMIT are atomic wrt other backends.


--------------
Hannu




Re: Concurrent CREATE INDEX, try 2 (was Re: Reducing

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2005-12-06 kell 15:12, kirjutas Tom Lane:
> Jochem van Dieten <jochemd@gmail.com> writes:
> > On 12/5/05, Hannu Krosing wrote:
> >> 3) record the index in pg_class, but mark it as "do not use for lookups"
> >> in a new field. Take snapshot SNAP2. commit transaction.
> 
> > What happens if another transaction takes a snapshot between SNAP2 and
> > the commit? Wouldn't you need a lock to guard against that? (Not that
> > I don't know if that is possible or desirable.)
> 
> It's worse than that, because an updating command that is already
> running has already made its list of which indexes to update.  You can't
> say "commit" and expect transactions already in flight to react
> magically to the presence of the new index.  If you take a lock that
> excludes writes, and then release that lock with your commit (lock
> release actually happens after commit btw),

Is it possible to release a lock without commit ?

>  then you can be sure that
> subsequent write transactions will see your new index, because they take
> their writer's lock before they inspect pg_index to see what indexes
> they need to update.
> 
> Short of taking such a lock, you have a race condition.
> 
> There's another little problem: it's not clear that "present in SNAP2
> but not in SNAP1" has anything to do with the condition you need.  This
> would exclude rows made by transactions still in progress as of SNAP2,
> but you can't know whether such rows were made before or after your
> commit of the index.  It doesn't do the right thing for deleted rows
> either (deleted rows may still need to be entered into the index),
> though perhaps you could fix that with a creative reinterpretation of
> what "present in a snap" means.

It seems that taking SNAP1 also needs to be fitted between any other
transactions (i.e no transaction can be running at the time) which can
hopefully be done as I outlined to my other answer to grandparent.

-----------------
Hannu




Hannu Krosing <hannu@skype.net> writes:
> What I have in mind would be something like this to get both SNAP2 and
> commit between any transactions:

> LOOP:
>     LOCK AGAINST STARTING NEW TRANSACTIONS

I can hardly credit that "let's block startup of ALL new transactions"
is a more desirable restriction than "let's block writers to the table
we wish to reindex".
        regards, tom lane


Hannu Krosing <hannu@skype.net> writes:
> Is it possible to release a lock without commit ?

Yes, but I don't see where that helps you here.

(To do any of this, you'd need to use the same kluge VACUUM does to hold
selected locks across a series of transactions.  So in reality you'd
probably be thinking of committing a startup transaction and letting
some of the locks be released by that.)
        regards, tom lane


Re: Concurrent CREATE INDEX, try 2 (was Re: Reducing

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2005-12-06 kell 15:38, kirjutas Tom Lane:
> Hannu Krosing <hannu@skype.net> writes:
> > What I have in mind would be something like this to get both SNAP2 and
> > commit between any transactions:
> 
> > LOOP:
> >     LOCK AGAINST STARTING NEW TRANSACTIONS
> 
> I can hardly credit that "let's block startup of ALL new transactions"
> is a more desirable restriction than "let's block writers to the table
> we wish to reindex".

If the block is short-time (will be removed one way or other in a few
(tenths of) seconds, then this is much more desirable than blocking
writers for a few hours.

The scenario where concurrent create index command is be needed is 24/7
OLTP databases, which can't be taken down for maintenance. Usully they
can be arranged to tolerate postponing a few transactions for one
second.

----------
Hannu




Re: Concurrent CREATE INDEX, try 2 (was Re: Reducing

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2005-12-06 kell 15:41, kirjutas Tom Lane:
> Hannu Krosing <hannu@skype.net> writes:
> > Is it possible to release a lock without commit ?
> 
> Yes, but I don't see where that helps you here.
> 
> (To do any of this, you'd need to use the same kluge VACUUM does to hold
> selected locks across a series of transactions.  So in reality you'd
> probably be thinking of committing a startup transaction and letting
> some of the locks be released by that.)

Hmm, that sounds like an plan:

1) run a transaction repeatedly, trying to hit a point of no concurrent
transactions, encance the odds by locking out starting other
transactions for a few (tenths or hundredths of) seconds, if it
succeeds, record SNAP1, commit and and continue, else rollback, then
sleep a little and retry.

2) build index on all rows inserted before SNAP1

3) run a transaction repeatedly, trying to hit a point of no concurrent
transactions by locking out other transactions for a few (tenths or
hundredths of) seconds, if it succeeds, record SNAP2, mark index as
visible for inserts, commit. now all new transactions see the index and
use it when inserting new tuples.

4) scan over table, add all tuples between SNAP1 and SNAP2 to index 

5) mark index as usable for query plans


-------------
Hannu




Re: Concurrent CREATE INDEX, try 2 (was Re: Reducing

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> 1) run a transaction repeatedly, trying to hit a point of no concurrent
> transactions,

In the sort of 24x7 environment that people are arguing this is needed
for, it's entirely possible that that will *never* succeed.
        regards, tom lane


Re: Concurrent CREATE INDEX, try 2 (was Re: Reducing

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2005-12-06 kell 16:01, kirjutas Tom Lane:
> Hannu Krosing <hannu@skype.net> writes:
> > 1) run a transaction repeatedly, trying to hit a point of no concurrent
> > transactions,
> 
> In the sort of 24x7 environment that people are arguing this is needed
> for, it's entirely possible that that will *never* succeed.

My OLTP transactions are usually 5-50ms in length. common sense tells
me, that if I disallow new transactions for 100ms, I am more than likely
to have waited for all existing ones to have finished and can do my 1 ms
of "take snapshot + commit" and let all the waiting transactions to
commence.

If the database is running longer transactions, there can be a GUC to
adjust the time CUNCURRENT CREATE INDEX will wait. For example for
trx'es mostly in 0.5-2 sec range the wait could be set to 3 sec.

-------------
Hannu



Hannu Krosing <hannu@skype.net> writes:

> The scenario where concurrent create index command is be needed is 24/7
> OLTP databases, which can't be taken down for maintenance. Usully they
> can be arranged to tolerate postponing a few transactions for one
> second.

Well, the dominant defining characteristic of "OLTP" is precisely that you do
*not* have under your control the timing requirements and can't make such
arrangements. That is, you have to process requests as fast as they come in
whatever that might be.

But that said, realistically *any* solution has to obtain a lock at some time
to make the schema change. I would say pretty much any O(1) (constant time)
outage is at least somewhat acceptable as contrasted with the normal index
build which locks out other writers for at least O(n lg n) time. Anything on
the order of 100ms is probably as good as it gets here.

-- 
greg



Re: Concurrent CREATE INDEX, try 2 (was Re: Reducing

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2005-12-06 kell 19:32, kirjutas Greg Stark:
> Hannu Krosing <hannu@skype.net> writes:
> 
> > The scenario where concurrent create index command is be needed is 24/7
> > OLTP databases, which can't be taken down for maintenance. Usully they
> > can be arranged to tolerate postponing a few transactions for one
> > second.
> 
> Well, the dominant defining characteristic of "OLTP" is precisely that you do
> *not* have under your control the timing requirements and can't make such
> arrangements. That is, you have to process requests as fast as they come in
> whatever that might be.

While "as fast as possible" is a good goal when designing and optimising
a DB engine proper, you never need to design a real system to a spec "as
fast as possible" but rather to some given expected performance.

For me a 24/7 OLTP is more like a "Real Time" system, where all queries
have to be processed in "not more than" a certain time v.s. "as fast as
possible". There "as fast as possible" is a secondary goal, a lot less
important than meeting the deadlines.

For example one real db processes requests usually in 50-200ms, but the
maximum the client is prepared to wait is set to 20 sec. Anything longer
than that and the bells start ringing.

> But that said, realistically *any* solution has to obtain a lock at some time
> to make the schema change. I would say pretty much any O(1) (constant time)
> outage is at least somewhat acceptable as contrasted with the normal index
> build which locks out other writers for at least O(n lg n) time. Anything on
> the order of 100ms is probably as good as it gets here.

For me any delay less than the client timeout is acceptable and anything
more than that is not. N sec is ok, N+1 is not. It's as simple as that.

And if the CREATE INDEX takes 2 weeks in order to let other OLTP
processing go on uninterrupted then it is completely OK. I can afford to
set the deadline for it accordingly. 

Thinking of it, maybe concurrent CREATE INDEX should also honour
vacuum_cost_* GUC's and throttle its progress accordingly in order to
not starve others on IO/CPU .

--------------------
Hannu




Re: Concurrent CREATE INDEX, try 2 (was Re: Reducing

From
Jochem van Dieten
Date:
On 12/6/05, Hannu Krosing wrote:
>
> 1) run a transaction repeatedly, trying to hit a point of no concurrent
> transactions, encance the odds by locking out starting other
> transactions for a few (tenths or hundredths of) seconds, if it
> succeeds, record SNAP1, commit and and continue, else rollback, then
> sleep a little and retry.

Which locks can be released by committing here?


> 2) build index on all rows inserted before SNAP1
>
> 3) run a transaction repeatedly, trying to hit a point of no concurrent
> transactions by locking out other transactions for a few (tenths or
> hundredths of) seconds, if it succeeds, record SNAP2, mark index as
> visible for inserts, commit. now all new transactions see the index and
> use it when inserting new tuples.
>
> 4) scan over table, add all tuples between SNAP1 and SNAP2 to index

You can not guarantee that every tuple inserted in the table will be
visible to SNAP 2 if you take SNAP2 before the commit of the
insert-only index has dropped below the global XMIN-horizon.


> 5) mark index as usable for query plans

How about:

- begin transaction X1  - insert all visible tuples in an index  - mark index incomplete
- commit

- wait for X1 to become visible to all running transactions (X1 is
known from the XMIN in pg_class / pg_index)

- begin transaction X2  - insert all missing tuples in index  - mark index complete
- commit

Jochem

Re: Concurrent CREATE INDEX, try 2 (was Re: Reducing

From
Greg Stark
Date:
Hannu Krosing <hannu@skype.net> writes:

> > But that said, realistically *any* solution has to obtain a lock at some time
> > to make the schema change. I would say pretty much any O(1) (constant time)
> > outage is at least somewhat acceptable as contrasted with the normal index
> > build which locks out other writers for at least O(n lg n) time. Anything on
> > the order of 100ms is probably as good as it gets here.
> 
> For me any delay less than the client timeout is acceptable and anything
> more than that is not. N sec is ok, N+1 is not. It's as simple as that.

I don't think the client timeout is directly relevant here. If your client
timeout is 20s and you take 19s, how many requests have queued up behind you?
If you normally process requests in under 200ms and receive 10 requests per
second (handling at least 2 simultaneously) then you now have 190 requests
queued up. Those requests take resources and will slow down your server. If
they slow things down too much then you will start failing to meet your 200ms
deadline.

It's more likely that your system is engineered to use queueing and
simultaneous dispatch to deal with spikes in load up to a certain margin. Say
you know it can deal with spikes in load of up to 2x the regular rate. Then
you can deal with service outage of up to the 200ms deadline. If you can deal
with spikes of up to 4x the regular rate then you can deal with an outage of
up to 600ms. 

Moreover even if you had the extra resources available to handle a 19s backlog
of requests, how long would it take you to clear that backlog? If you have a
narrow headroom on meeting the deadline in the first place, and now you have
even less headroom because of the resources dedicated to the queue, it'll take
you a long time to clear the backlog.

We periodically ran into problems with load spikes or other performance
problems causing things to get very slow and stay slow for a while. Letting
things settle out usually worked but occasionally we had to restart the whole
system to clear out the queue of requests.

-- 
greg



Re: Concurrent CREATE INDEX, try 2 (was Re: Reducing

From
Hannu Krosing
Date:
Ühel kenal päeval, K, 2005-12-07 kell 13:36, kirjutas Greg Stark:
> Hannu Krosing <hannu@skype.net> writes:
> 
> > > But that said, realistically *any* solution has to obtain a lock at some time
> > > to make the schema change. I would say pretty much any O(1) (constant time)
> > > outage is at least somewhat acceptable as contrasted with the normal index
> > > build which locks out other writers for at least O(n lg n) time. Anything on
> > > the order of 100ms is probably as good as it gets here.
> > 
> > For me any delay less than the client timeout is acceptable and anything
> > more than that is not. N sec is ok, N+1 is not. It's as simple as that.
> 
> I don't think the client timeout is directly relevant here. 

It is relevant. It is the ultimate check of success or failure :)

> If your client
> timeout is 20s and you take 19s, how many requests have queued up behind you?
> If you normally process requests in under 200ms and receive 10 requests per
> second (handling at least 2 simultaneously) then you now have 190 requests
> queued up.

Again, I'm handling 20 to 200 simultaneously quite nicely.

> Those requests take resources and will slow down your server. If
> they slow things down too much then you will start failing to meet your 200ms
> deadline.

If I can't meet the deadline, I've got a problem. The rest is
implementation detail.

> It's more likely that your system is engineered to use queueing and
> simultaneous dispatch to deal with spikes in load up to a certain margin. Say
> you know it can deal with spikes in load of up to 2x the regular rate.

I know it can, just that the 3x spike lasts for 6 hours :P

> Then
> you can deal with service outage of up to the 200ms deadline. If you can deal
> with spikes of up to 4x the regular rate then you can deal with an outage of
> up to 600ms. 

Small local fluctuations happen all the time. As a rule of a thumb I
want to stay below 50% of resource usage on average for any noticable
period and will start looking for code optimisations or additional
hardware if this is crossed.

> Moreover even if you had the extra resources available to handle a 19s backlog
> of requests, how long would it take you to clear that backlog? If you have a
> narrow headroom on meeting the deadline in the first place, and now you have
> even less headroom because of the resources dedicated to the queue, it'll take
> you a long time to clear the backlog.

While it feels heroic to run at 90% capacity, it is not usually a good
policy. All kinds of unforeseen stuff happens all the time -
checkpoints, backups, vacuums, unexpected growth, system cronjobs, ...
With too little headroom you are screwed anyway.

What I am aiming at with this CONCURRENT CREATE INDEX proposal, is being
no more disruptive than other stuff that keeps happening anyway. That
would be the baseline. Anything better is definitely desirable, but
should not be a stopper for implementing the baseline functionality.

-----------------
Hannu









Re: Reducing relation locking overhead

From
"Jim C. Nasby"
Date:
On Fri, Dec 02, 2005 at 03:25:58PM -0500, Greg Stark wrote:
> Postgres would have no trouble building an index of the existing data using
> only shared locks. The problem is that any newly inserted (or updated) records
> could be missing from such an index.
> 
> To do it you would then have to gather up all those newly inserted records.
> And of course while you're doing that new records could be inserted. And so
> on. There's no guarantee it would ever finish, though I suppose you could
> detect the situation if the size of the new batch wasn't converging to 0 and
> throw an error.

Why throw an error? Just grab a lock that would prevent any new inserts
from occuring. Or at least make that an option.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Reducing relation locking overhead

From
"Jim C. Nasby"
Date:
On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> > What's worse, once you have excluded writes you have to rescan the entire
> > table to be sure you haven't missed anything. So in the scenarios where this
> > whole thing is actually interesting, ie enormous tables, you're still
> > talking about a fairly long interval with writes locked out. Maybe not as
> > long as a complete REINDEX, but long.
> 
> I was thinking you would set a flag to disable use of the FSM for
> inserts/updates while the reindex was running. So you would know where to find
> the new tuples, at the end of the table after the last tuple you read.

What about keeping a seperate list of new tuples? Obviously we'd only do
this when an index was being built on a table. Since it would probably
be problematic and expensive to check for this every time you accessed a
table, it would make sense to check only at the start of a transaction
and have an index build wait until all running transactions knew that an
index build was going to happen.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Reducing relation locking overhead

From
Hannu Krosing
Date:
Ühel kenal päeval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby:
> On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote:
> > Tom Lane <tgl@sss.pgh.pa.us> writes:
> > > What's worse, once you have excluded writes you have to rescan the entire
> > > table to be sure you haven't missed anything. So in the scenarios where this
> > > whole thing is actually interesting, ie enormous tables, you're still
> > > talking about a fairly long interval with writes locked out. Maybe not as
> > > long as a complete REINDEX, but long.
> > 
> > I was thinking you would set a flag to disable use of the FSM for
> > inserts/updates while the reindex was running. So you would know where to find
> > the new tuples, at the end of the table after the last tuple you read.
> 
> What about keeping a seperate list of new tuples? Obviously we'd only do
> this when an index was being built on a table. 

The problem with separate list is that it can be huge. For example on a
table with 200 inserts/updates per second an index build lasting 6 hours
would accumulate total on 6*3600*200 = 4320000 new tuples.

----------------
Hannu




Re: Reducing relation locking overhead

From
"Jim C. Nasby"
Date:
On Thu, Dec 08, 2005 at 08:57:42AM +0200, Hannu Krosing wrote:
> ??hel kenal p??eval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby:
> > On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote:
> > > Tom Lane <tgl@sss.pgh.pa.us> writes:
> > > > What's worse, once you have excluded writes you have to rescan the entire
> > > > table to be sure you haven't missed anything. So in the scenarios where this
> > > > whole thing is actually interesting, ie enormous tables, you're still
> > > > talking about a fairly long interval with writes locked out. Maybe not as
> > > > long as a complete REINDEX, but long.
> > > 
> > > I was thinking you would set a flag to disable use of the FSM for
> > > inserts/updates while the reindex was running. So you would know where to find
> > > the new tuples, at the end of the table after the last tuple you read.
> > 
> > What about keeping a seperate list of new tuples? Obviously we'd only do
> > this when an index was being built on a table. 
> 
> The problem with separate list is that it can be huge. For example on a
> table with 200 inserts/updates per second an index build lasting 6 hours
> would accumulate total on 6*3600*200 = 4320000 new tuples.

Sure, but it's unlikely that such a table would be very wide, so 4.3M
tuples would probably only amount to a few hundred MB of data. It's also
possible that this list could be vacuumed by whatever the regular vacuum
process is for the table.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Reducing relation locking overhead

From
Simon Riggs
Date:
On Thu, 2005-12-01 at 21:37 -0500, Tom Lane wrote:
> In looking at the current pgbench results, I notice that one
> considerable contribution to LWLock contention is access to the
> heavyweight-lock manager. 

> 4. The only reason we need to take relation-level locks on indexes
> at all is to make the world safe for REINDEX being done concurrently
> with read-only accesses to the table (that don't use the index being
> reindexed).  If we went back to requiring exclusive lock for reindex we
> could forget all about both #2 and #3.  Particularly for updates of
> relations with lots of indexes, this could be a pretty significant win.
> However we'd definitely be giving up something that was seen as a
> feature at one point, so I'm not sold on this idea ... unless someone
> can see a way to reduce the overhead without giving up concurrent
> REINDEX.

In a high txn rate workload the majority of lock requests will
definitely be towards indexes. This is a major overhead and contention
point and as you say, exists only to allow REINDEX (and online CREATE
TABLE when it exists) to operate.

There is a considerable price to be paid for this flexibility, so
although I think that flexibility is important, I do think we need to
look at ways of reducing the cost for the 95%+ of the time we pay the
price for no benefit. It seems clear that if we do not, then we are
going to have to completely redesign the lock manager, so any way
forward from here is a reasonable size task.

Putting extra restrictions on REINDEX-like operations would be
acceptable in these circumstances. 

Now follows various ideas, many of which are quite possibly bogus, but
the main point is that we need a way, even if it isn't one of these:

One such idea might be to require that they occur outside of a
transaction block, just like VACUUM.

Further thoughts:
1. Normally, we do not lock indexes via the LockMgrLock

2. When a REINDEX-like operation comes along, it first of all updates an
MaintIntentLock flag on the index relation, which causes a relcache
invalidation. It then waits until all backends have updated their
relcache.

3. While the MaintIntentLock is set, all operations acquire and release
LockMgrLocks on the index.

4. The REINDEX-like operation proceeds, having accepted a possibly
extensive delay in acquiring the MaintIntentLock, in the name of
concurrent access.

5. When the REINDEX completes, we unset the MaintIntentLock and other
backends return to normal operation.

Best Regards, Simon Riggs



Re: Reducing relation locking overhead

From
Hannu Krosing
Date:
Ühel kenal päeval, N, 2005-12-08 kell 01:08, kirjutas Jim C. Nasby:
> On Thu, Dec 08, 2005 at 08:57:42AM +0200, Hannu Krosing wrote:
> > ??hel kenal p??eval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby:
> > > On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote:
> > > > Tom Lane <tgl@sss.pgh.pa.us> writes:
> > > > > What's worse, once you have excluded writes you have to rescan the entire
> > > > > table to be sure you haven't missed anything. So in the scenarios where this
> > > > > whole thing is actually interesting, ie enormous tables, you're still
> > > > > talking about a fairly long interval with writes locked out. Maybe not as
> > > > > long as a complete REINDEX, but long.
> > > > 
> > > > I was thinking you would set a flag to disable use of the FSM for
> > > > inserts/updates while the reindex was running. So you would know where to find
> > > > the new tuples, at the end of the table after the last tuple you read.
> > > 
> > > What about keeping a seperate list of new tuples? Obviously we'd only do
> > > this when an index was being built on a table. 
> > 
> > The problem with separate list is that it can be huge. For example on a
> > table with 200 inserts/updates per second an index build lasting 6 hours
> > would accumulate total on 6*3600*200 = 4320000 new tuples.
> 
> Sure, but it's unlikely that such a table would be very wide, so 4.3M
> tuples would probably only amount to a few hundred MB of data. It's also
> possible that this list could be vacuumed by whatever the regular vacuum
> process is for the table.

I think that keeping such list as part the table at well defined
location (like pages from N to M) is the best strategy, as it will
automatically make all new tuples available to parallel processes and
avoids both duplicate storage as well as the the need for changing
insert/update code.

---------------
Hannu



Re: Concurrent CREATE INDEX, try 2 (was Re: Reducing

From
Csaba Nagy
Date:
On Wed, 2005-12-07 at 19:36, Greg Stark wrote:
[snip]
> We periodically ran into problems with load spikes or other performance
> problems causing things to get very slow and stay slow for a while. Letting
> things settle out usually worked but occasionally we had to restart the whole
> system to clear out the queue of requests.

Just as a personal opinion: I would love a REINDEX which does not block
reads/writes, even if writes will be more expensive while it's running.
There's always a period of time I can schedule the REINDEX so there's
very low write activity, but it is impossible to find a time slot when
there's NO write activity... and I think most of the real world
applications are like this. I think it's very rare that an application
is constantly getting high load, but most of them are constantly getting
SOME important activity which makes downtime hard to schedule. Now if
the slowdown of writes is not more than the acceptable service level,
then it is a very workable solution to schedule the REINDEX on a not so
busy time slot.

Cheers,
Csaba.




Re: Reducing relation locking overhead

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Further thoughts:
> 1. Normally, we do not lock indexes via the LockMgrLock

> 2. When a REINDEX-like operation comes along, it first of all updates an
> MaintIntentLock flag on the index relation, which causes a relcache
> invalidation. It then waits until all backends have updated their
> relcache.

There isn't any way for it to do that (ie, be sure everyone else has
adjusted to the new state of affairs), short of acquiring some sort of
short-term exclusive lock on the table, which is a really bad idea.
The pending lock would block other incoming requests on the table until
all the current users exited their transactions.

I think the idea of not doing locking on indexes was pretty much shot
down in this thread, and we have to go looking for other solutions ...
thus my other proposal.
        regards, tom lane


Re: Reducing relation locking overhead

From
Csaba Nagy
Date:
On Thu, 2005-12-08 at 16:05, Tom Lane wrote:
[SNIP]
> There isn't any way for it to do that (ie, be sure everyone else has
> adjusted to the new state of affairs), short of acquiring some sort of
> short-term exclusive lock on the table, which is a really bad idea.
> The pending lock would block other incoming requests on the table until
> all the current users exited their transactions.
> 

But it is an acceptable compromise to lock the table until all current
transactions are over... the alternative for reindexing a big table is
usually to schedule a down-time, which is even worse...

REINDEX is usually used to fix a big tables big index bloat, and that
won't fly without a downtime, or, with this short-term full table lock
in a low-traffic time-slot. 

For my usage patterns I would vote with the table lock if it is just a
means of blocking new transactions until the running ones finish. I'll
just make sure there are none long running when I issue the REINDEX...

Cheers,
Csaba.






Re: Reducing relation locking overhead

From
Simon Riggs
Date:
On Thu, 2005-12-08 at 16:23 +0100, Csaba Nagy wrote:
> On Thu, 2005-12-08 at 16:05, Tom Lane wrote:
> [SNIP]
> > There isn't any way for it to do that (ie, be sure everyone else has
> > adjusted to the new state of affairs), short of acquiring some sort of
> > short-term exclusive lock on the table, which is a really bad idea.
> > The pending lock would block other incoming requests on the table until
> > all the current users exited their transactions.

It might be possible to do this with some new form of utility lock that
can be unlocked before end of transaction, then re-locked again later.
That would imply the use of optimistic locking, as already discussed.
Only DDL and VACUUMs *need* to be locked out during the REINDEX.

I'm hand-waving here, so I'll stop. But we do know there *is* a way,
because this is already implemented elsewhere, somehow.

> But it is an acceptable compromise to lock the table until all current
> transactions are over... the alternative for reindexing a big table is
> usually to schedule a down-time, which is even worse...
> 
> REINDEX is usually used to fix a big tables big index bloat, and that
> won't fly without a downtime, or, with this short-term full table lock
> in a low-traffic time-slot. 
> 
> For my usage patterns I would vote with the table lock if it is just a
> means of blocking new transactions until the running ones finish. I'll
> just make sure there are none long running when I issue the REINDEX...

Certainly that is my view. You only schedule these things when you have
to, picking as light traffic period as you can. Most people would accept
some pretty hefty restrictions in order to get it to work.

Best Regards, Simon Riggs



Improving free space usage (was: Reducing relation locking overhead)

From
"Jim C. Nasby"
Date:
On Thu, Dec 08, 2005 at 11:58:50AM +0200, Hannu Krosing wrote:
> ??hel kenal p??eval, N, 2005-12-08 kell 01:08, kirjutas Jim C. Nasby:
> > On Thu, Dec 08, 2005 at 08:57:42AM +0200, Hannu Krosing wrote:
> > > ??hel kenal p??eval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby:
> > > > On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote:
> > > > > Tom Lane <tgl@sss.pgh.pa.us> writes:
> > > > > > What's worse, once you have excluded writes you have to rescan the entire
> > > > > > table to be sure you haven't missed anything. So in the scenarios where this
> > > > > > whole thing is actually interesting, ie enormous tables, you're still
> > > > > > talking about a fairly long interval with writes locked out. Maybe not as
> > > > > > long as a complete REINDEX, but long.
> > > > > 
> > > > > I was thinking you would set a flag to disable use of the FSM for
> > > > > inserts/updates while the reindex was running. So you would know where to find
> > > > > the new tuples, at the end of the table after the last tuple you read.
> > > > 
> > > > What about keeping a separate list of new tuples? Obviously we'd only do
> > > > this when an index was being built on a table. 
> > > 
> > > The problem with separate list is that it can be huge. For example on a
> > > table with 200 inserts/updates per second an index build lasting 6 hours
> > > would accumulate total on 6*3600*200 = 4320000 new tuples.
> > 
> > Sure, but it's unlikely that such a table would be very wide, so 4.3M
> > tuples would probably only amount to a few hundred MB of data. It's also
> > possible that this list could be vacuumed by whatever the regular vacuum
> > process is for the table.
> 
> I think that keeping such list as part the table at well defined
> location (like pages from N to M) is the best strategy, as it will
> automatically make all new tuples available to parallel processes and
> avoids both duplicate storage as well as the the need for changing
> insert/update code.

There's one thing I hate about that idea though: good luck trying to
move those tuples somewhere else after the index build is done and you
now want to shrink the table back down to a more normal size. If we had
a better way to do that it would be much more palatable, but right now
on a heavily updated table this would result in a lot of bloat.

Along those lines, I've wondered if it makes sense to add more
flexibility in how free space is reclaimed in a table. One obvious
possibility is to have a strategy where new tuples will always look to
the FSM for space (instead of going into the current page if possible),
and the FSM will always hand out the earliest page in the table it has.
This mode would have the effect of moving tuples towards the front of
the table, allowing for space reclamation. A variation might be that
this mode will not effect tuples that are generated as part of an UPDATE
and are in the first x% of the table, since it doesn't make sense to
move a tuple from page 2 to page 1 in a 1000 page table.

Another possibility is to always go to the FSM and to have the FSM hand
back the page that is closest to the new tuple according to a certain
index. This would allow for ALTER TABLE CLUSTER to be much more
self-maintaining. The downside is that you'd have to do a lookup on that
index, but presumably if the index is important enough to cluster on
then it should be well-cached. There's probably some other tweaks that
could be done as well to make this more performant.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Reducing relation locking overhead

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> I'm hand-waving here, so I'll stop. But we do know there *is* a way,
> because this is already implemented elsewhere, somehow.

That's not really the point --- the question is whether the cure is
worse than the disease.  It's entirely possible that the tradeoffs
needed to support fully concurrent REINDEX would represent a higher
price than the feature is worth, or that it's simply too much work
to get there from here.

For instance, I would imagine that the way Oracle does this relies
on their use of rollback segments, which is something we're certainly
unlikely to emulate.

Given the discussion so far, it seems likely to me that completely
concurrent REINDEX is indeed out of reach, and that what we ought to
be thinking about is what sort of compromise design (ie, partially
concurrent REINDEX) is reasonable.

Something that might work is:

1. Take ShareUpdateExclusiveLock (this blocks VACUUM and DDL changes),
then run existing CREATE INDEX code.  The finished index may be missing
some tuples inserted during the run.

2. Commit transaction so that index becomes visible (we assume it's
marked so that the planner will know not to rely on it).  Continue to
hold ShareUpdateExclusiveLock so VACUUM doesn't run.

3. Attempt to acquire ShareLock (possibly a ConditionalLockAcquire/sleep
loop instead of just flat-out LockAcquire).  Once we have this we know
there are no active writer transactions.  Release the lock immediately.

4. Make a new scan of the table and insert any rows not already present
in the index.  (This need not process anything inserted later than step
3, because any new transactions will know to insert rows in the index
anyway.)

5. Mark index good and commit, releasing all locks.

I don't think that it's worth the effort and complexity to try to avoid
a full table scan in step 4.  At best you would save much less than 50%
of the total work, and the impact on normal operations is not free.

If what you want is a REINDEX rather than creating an independent new
index, then at step 5 you need to do a swap operation which'll require
obtaining exclusive lock on the index.  This creates another opportunity
for deadlock failures, but again a conditional loop might help.

There are still some issues about the behavior when the index is UNIQUE.
Ideally you would like the CREATE INDEX to fail on a duplicate, not any
concurrent writer transaction, but I don't think it's possible to
guarantee that.

Also, I'm not sure how we get rid of the broken index if there is a
failure later than step 2.
        regards, tom lane


Re: Concurrent CREATE INDEX, try 2 (was Re: Reducing relation locking

From
Ron Mayer
Date:
Simon Riggs wrote:
> ...REINDEX...CREATE/DROP INDEX...

I'm curious if partitioning can help help
provide online create/reindex.

For example, if I could set set up a table
partitioned by "modified time"; could I make
a new partition so all new inserts go into a
new partition and then I can re-index / create-index
on the old partitions concurrently because only
reads and deletes would be happening on that
old table?

The new partition would have a new index anyway
so no reindex would even be needed on that part.


Re: Improving free space usage (was: Reducing relation locking

From
Hannu Krosing
Date:
Ühel kenal päeval, N, 2005-12-08 kell 12:57, kirjutas Jim C. Nasby:
> On Thu, Dec 08, 2005 at 11:58:50AM +0200, Hannu Krosing wrote:

> > > > The problem with separate list is that it can be huge. For example on a
> > > > table with 200 inserts/updates per second an index build lasting 6 hours
> > > > would accumulate total on 6*3600*200 = 4320000 new tuples.
> > > 
> > > Sure, but it's unlikely that such a table would be very wide, so 4.3M
> > > tuples would probably only amount to a few hundred MB of data. It's also
> > > possible that this list could be vacuumed by whatever the regular vacuum
> > > process is for the table.
> > 
> > I think that keeping such list as part the table at well defined
> > location (like pages from N to M) is the best strategy, as it will
> > automatically make all new tuples available to parallel processes and
> > avoids both duplicate storage as well as the the need for changing
> > insert/update code.
> 
> There's one thing I hate about that idea though: good luck trying to
> move those tuples somewhere else after the index build is done and you
> now want to shrink the table back down to a more normal size. 

I feel your pain.

To solve similar problem I have been forced to write scripts that do
updates of pk_column=pk_column until the tuple moves to another page as
shown by ctid . Not a sensible thing to do (do a lot of updates to
*increase* performance), but necessary nevertheless considering  current
postgres behaviour. 

> If we had
> a better way to do that it would be much more palatable, but right now
> on a heavily updated table this would result in a lot of bloat.

Actually any long transaction would do that.

> Along those lines, I've wondered if it makes sense to add more
> flexibility in how free space is reclaimed in a table. One obvious
> possibility is to have a strategy where new tuples will always look to
> the FSM for space (instead of going into the current page if possible),
> and the FSM will always hand out the earliest page in the table it has.
> This mode would have the effect of moving tuples towards the front of
> the table, allowing for space reclamation. A variation might be that
> this mode will not effect tuples that are generated as part of an UPDATE
> and are in the first x% of the table, since it doesn't make sense to
> move a tuple from page 2 to page 1 in a 1000 page table.

This % could be depending on some "fill factor" of the table, aiming not
to move tuples, that would end up in the final volume of a balance
table, which, in case of heavily updated table, would probably be 2 to 3
times the volume of densely populated table.

> Another possibility is to always go to the FSM and to have the FSM hand
> back the page that is closest to the new tuple according to a certain
> index. This would allow for ALTER TABLE CLUSTER to be much more
> self-maintaining. The downside is that you'd have to do a lookup on that
> index, but presumably if the index is important enough to cluster on
> then it should be well-cached. There's probably some other tweaks that
> could be done as well to make this more performant.

Yes, I agree on all your points about better placement of new tuples,
all they would be useful indeed.

--------------
Hannu






Re: Improving free space usage (was: Reducing relation locking overhead)

From
"Jim C. Nasby"
Date:
On Fri, Dec 09, 2005 at 12:00:14AM +0200, Hannu Krosing wrote:
> > Along those lines, I've wondered if it makes sense to add more
> > flexibility in how free space is reclaimed in a table. One obvious
> > possibility is to have a strategy where new tuples will always look to
> > the FSM for space (instead of going into the current page if possible),
> > and the FSM will always hand out the earliest page in the table it has.
> > This mode would have the effect of moving tuples towards the front of
> > the table, allowing for space reclamation. A variation might be that
> > this mode will not effect tuples that are generated as part of an UPDATE
> > and are in the first x% of the table, since it doesn't make sense to
> > move a tuple from page 2 to page 1 in a 1000 page table.
> 
> This % could be depending on some "fill factor" of the table, aiming not
> to move tuples, that would end up in the final volume of a balance
> table, which, in case of heavily updated table, would probably be 2 to 3
> times the volume of densely populated table.
> 
> > Another possibility is to always go to the FSM and to have the FSM hand
> > back the page that is closest to the new tuple according to a certain
> > index. This would allow for ALTER TABLE CLUSTER to be much more
> > self-maintaining. The downside is that you'd have to do a lookup on that
> > index, but presumably if the index is important enough to cluster on
> > then it should be well-cached. There's probably some other tweaks that
> > could be done as well to make this more performant.
> 
> Yes, I agree on all your points about better placement of new tuples,
> all they would be useful indeed.

Sounds like a TODO, barring objections...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Reducing relation locking overhead

From
Hannu Krosing
Date:
Ühel kenal päeval, N, 2005-12-08 kell 14:53, kirjutas Tom Lane:
> Given the discussion so far, it seems likely to me that completely
> concurrent REINDEX is indeed out of reach, and that what we ought to
> be thinking about is what sort of compromise design (ie, partially
> concurrent REINDEX) is reasonable.
> 
> Something that might work is:
> 
> 1. Take ShareUpdateExclusiveLock (this blocks VACUUM and DDL changes),
> then run existing CREATE INDEX code.  The finished index may be missing
> some tuples inserted during the run.
> 
> 2. Commit transaction so that index becomes visible (we assume it's
> marked so that the planner will know not to rely on it).  Continue to
> hold ShareUpdateExclusiveLock so VACUUM doesn't run.
> 
> 3. Attempt to acquire ShareLock (possibly a ConditionalLockAcquire/sleep
> loop instead of just flat-out LockAcquire).  Once we have this we know
> there are no active writer transactions.  Release the lock immediately.
> 
> 4. Make a new scan of the table and insert any rows not already present
> in the index.  (This need not process anything inserted later than step
> 3, because any new transactions will know to insert rows in the index
> anyway.)

How do you plan to determine "any rows not already present in the index"
without explicitly remembering the start and end snapshots of existing
CREATE INDEX (SNAP1 and SNAP2 in my proposal)? actually the end point
seems to be covered, but what about start condition ?

In the last round of discussion you pointed out that index itself can't
be effectively used for this in case there are lots of equal index keys.
(As I pointed out, this can be fixed if we will start using ctid to
determine placement/order of equal keys, but I don't think we are
building indexes this way now).

I still think that wedging start of 1. and end of 2. into points where
no concurrent transaction is running would be the easiest and most
robust way to do it. 

And if the attempts (locking periods) to find/force that spot are short
enough, they can be tolerated in practice.

> 5. Mark index good and commit, releasing all locks.
> 
> I don't think that it's worth the effort and complexity to try to avoid
> a full table scan in step 4.  At best you would save much less than 50%
> of the total work, and the impact on normal operations is not free.

Agreed. The usecase needing concurrent index, being already slow,  can
probably be made to tolerate another 2-3x slowdown.

> If what you want is a REINDEX rather than creating an independent new
> index, then at step 5 you need to do a swap operation which'll require
> obtaining exclusive lock on the index.  This creates another opportunity
> for deadlock failures, but again a conditional loop might help.
> 
> There are still some issues about the behavior when the index is UNIQUE.
> Ideally you would like the CREATE INDEX to fail on a duplicate, not any
> concurrent writer transaction, but I don't think it's possible to
> guarantee that.

Ideally, but probably not too important in practice. The point can be
always made that there already is a unique index at the point where
concurrent trx fails. If the point is before end of 2. the concurrent
trx will probably wait until first commit before failing, no ?

> Also, I'm not sure how we get rid of the broken index if there is a
> failure later than step 2.

What about expicit DROP INDEX ? Even for REINDEX the index has to be
visible as a separate index after 2. so that inserts updates will be
aware of it.

--------------
Hannu





Re: Reducing relation locking overhead

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> How do you plan to determine "any rows not already present in the index"
> without explicitly remembering the start and end snapshots of existing
> CREATE INDEX (SNAP1 and SNAP2 in my proposal)?

I was thinking in terms of actually looking into the index to see if the
particular TID is present or not.  You could use snapshots to optimize
this by avoiding index probes for tuples that must be present, which
hopefully will be most of 'em.  Also you need a snapshot to detect
tuples that are new enough that they certainly will be indexed by their
inserting transaction, so that you don't have a race condition between
an active inserter and the REINDEX.  (I think this is possible but maybe
I missed something.)  That leaves you looking at just the tuples
inserted by transactions that might or might not have known about the
index.  So yeah, you do need SNAP1 and SNAP2 but they're being used in
a different way than the original proposal.

> In the last round of discussion you pointed out that index itself can't
> be effectively used for this in case there are lots of equal index keys.

True, but if you can avoid going to the index at all for the majority of
the tuples, I think this is tolerable.  In any case the design idea here
seems to be "we don't care how long REINDEX takes as long as it's not
blocking anyone".
        regards, tom lane


Re: Reducing relation locking overhead

From
Simon Riggs
Date:
On Sat, 2005-12-10 at 21:07 -0500, Tom Lane wrote:
> In any case the design idea here
> seems to be "we don't care how long REINDEX takes as long as it's not
> blocking anyone".

All sounds great so far. I'd like this as an option for CREATE INDEX
also.

Best Regards, Simon Riggs





Re: Reducing relation locking overhead

From
Hannu Krosing
Date:
Ühel kenal päeval, L, 2005-12-10 kell 21:07, kirjutas Tom Lane:
> Hannu Krosing <hannu@skype.net> writes:
> > How do you plan to determine "any rows not already present in the index"
> > without explicitly remembering the start and end snapshots of existing
> > CREATE INDEX (SNAP1 and SNAP2 in my proposal)?
> 
> I was thinking in terms of actually looking into the index to see if the
> particular TID is present or not.  You could use snapshots to optimize
> this by avoiding index probes for tuples that must be present, which
> hopefully will be most of 'em.  Also you need a snapshot to detect
> tuples that are new enough that they certainly will be indexed by their
> inserting transaction, so that you don't have a race condition between
> an active inserter and the REINDEX.  (I think this is possible but maybe
> I missed something.)  

So would it be possible to do the whole indexing whil another long
transaction runs, possibly in SERIALIZABLE isolation level, inserting
tuples all the way.

> That leaves you looking at just the tuples
> inserted by transactions that might or might not have known about the
> index.  So yeah, you do need SNAP1 and SNAP2 but they're being used in
> a different way than the original proposal.

By taking snaps at points with no concurrrent transactions I was hoping
to make things (both concept and code) simpler by reducing SNAP1 and
SNAP2 to mere transaction ids and making sure that these and only these
tuples inserted by transactions which fall between them need to be added
to the index.

If it does not actually make things clearer then it may not be a good
idea to try to fit them between transactions. 

Otoh, we still do need to wait - possibly a long time - for "3. Attempt
to acquire ShareLock". So why not wait also at start in case that would
make things cleares/simpler ?

> > In the last round of discussion you pointed out that index itself can't
> > be effectively used for this in case there are lots of equal index keys.
> 
> True, but if you can avoid going to the index at all for the majority of
> the tuples, I think this is tolerable.

And it seems that we could store same-valued index leafs in TID order,
at least during concurrent create index/reindex, making the index lookup
cheaper.

To me, it seems a good idea to store similar values in index always in
TID order, as this makes an index scan do the right thing (scan in heap
order), at least for key=const case, even without intermediate bitmap
index. Or are there downsides to this in general case ?

>  In any case the design idea here
> seems to be "we don't care how long REINDEX takes as long as it's not
> blocking anyone".

Yes, thats the general idea. 

Within reason of course, for example making a seqscan over the index for
each and every tuple inserted during building the first index would
probably still be too slow :)

>             regards, tom lane



Re: Reducing relation locking overhead

From
Alvaro Herrera
Date:
Hannu Krosing wrote:
> Ühel kenal päeval, L, 2005-12-10 kell 21:07, kirjutas Tom Lane:

> >  In any case the design idea here
> > seems to be "we don't care how long REINDEX takes as long as it's not
> > blocking anyone".
> 
> Yes, thats the general idea. 
> 
> Within reason of course, for example making a seqscan over the index for
> each and every tuple inserted during building the first index would
> probably still be too slow :)

You don't need to seqscan the _index_.  You need to scan the table.
Those tuples that do not satisfy the snapshot or where you are in doubt,
you examine the index to see whether they are there.  The bulk of it you
just skip.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Improving free space usage (was: Reducing relation locking overhead)

From
Bruce Momjian
Date:
Added to TODO:
* Allow FSM to return free space toward the beginning of the heap file, in  hopes that empty pages at the end can be
truncatedby VACUUM
 


---------------------------------------------------------------------------

Jim C. Nasby wrote:
> On Fri, Dec 09, 2005 at 12:00:14AM +0200, Hannu Krosing wrote:
> > > Along those lines, I've wondered if it makes sense to add more
> > > flexibility in how free space is reclaimed in a table. One obvious
> > > possibility is to have a strategy where new tuples will always look to
> > > the FSM for space (instead of going into the current page if possible),
> > > and the FSM will always hand out the earliest page in the table it has.
> > > This mode would have the effect of moving tuples towards the front of
> > > the table, allowing for space reclamation. A variation might be that
> > > this mode will not effect tuples that are generated as part of an UPDATE
> > > and are in the first x% of the table, since it doesn't make sense to
> > > move a tuple from page 2 to page 1 in a 1000 page table.
> > 
> > This % could be depending on some "fill factor" of the table, aiming not
> > to move tuples, that would end up in the final volume of a balance
> > table, which, in case of heavily updated table, would probably be 2 to 3
> > times the volume of densely populated table.
> > 
> > > Another possibility is to always go to the FSM and to have the FSM hand
> > > back the page that is closest to the new tuple according to a certain
> > > index. This would allow for ALTER TABLE CLUSTER to be much more
> > > self-maintaining. The downside is that you'd have to do a lookup on that
> > > index, but presumably if the index is important enough to cluster on
> > > then it should be well-cached. There's probably some other tweaks that
> > > could be done as well to make this more performant.
> > 
> > Yes, I agree on all your points about better placement of new tuples,
> > all they would be useful indeed.
> 
> Sounds like a TODO, barring objections...
> -- 
> Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
> Pervasive Software      http://pervasive.com    work: 512-231-6117
> vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461
> 

--  Bruce Momjian   http://candle.pha.pa.us SRA OSS, Inc.   http://www.sraoss.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Improving free space usage (was: Reducing relation locking overhead)

From
"Jim C. Nasby"
Date:
At the risk of editorializing...

* Allow for tables with more than a certain percent of free space to force new tuple allocation in the last few pages
tocome only from the FSM (prefering the earliest pages), with the intention of allowing the table to shrink
 

I also believe there's also a second TODO:

* Allow FSM page hand-out on tables to be influenced by the clustered index. This would help maintain cluster order.

On Thu, Mar 02, 2006 at 10:12:59PM -0500, Bruce Momjian wrote:
> 
> Added to TODO:
> 
>     * Allow FSM to return free space toward the beginning of the heap file, in
>       hopes that empty pages at the end can be truncated by VACUUM
> 
> 
> ---------------------------------------------------------------------------
> 
> Jim C. Nasby wrote:
> > On Fri, Dec 09, 2005 at 12:00:14AM +0200, Hannu Krosing wrote:
> > > > Along those lines, I've wondered if it makes sense to add more
> > > > flexibility in how free space is reclaimed in a table. One obvious
> > > > possibility is to have a strategy where new tuples will always look to
> > > > the FSM for space (instead of going into the current page if possible),
> > > > and the FSM will always hand out the earliest page in the table it has.
> > > > This mode would have the effect of moving tuples towards the front of
> > > > the table, allowing for space reclamation. A variation might be that
> > > > this mode will not effect tuples that are generated as part of an UPDATE
> > > > and are in the first x% of the table, since it doesn't make sense to
> > > > move a tuple from page 2 to page 1 in a 1000 page table.
> > > 
> > > This % could be depending on some "fill factor" of the table, aiming not
> > > to move tuples, that would end up in the final volume of a balance
> > > table, which, in case of heavily updated table, would probably be 2 to 3
> > > times the volume of densely populated table.
> > > 
> > > > Another possibility is to always go to the FSM and to have the FSM hand
> > > > back the page that is closest to the new tuple according to a certain
> > > > index. This would allow for ALTER TABLE CLUSTER to be much more
> > > > self-maintaining. The downside is that you'd have to do a lookup on that
> > > > index, but presumably if the index is important enough to cluster on
> > > > then it should be well-cached. There's probably some other tweaks that
> > > > could be done as well to make this more performant.
> > > 
> > > Yes, I agree on all your points about better placement of new tuples,
> > > all they would be useful indeed.
> > 
> > Sounds like a TODO, barring objections...
> > -- 
> > Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
> > Pervasive Software      http://pervasive.com    work: 512-231-6117
> > vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461
> > 
> 
> -- 
>   Bruce Momjian   http://candle.pha.pa.us
>   SRA OSS, Inc.   http://www.sraoss.com
> 
>   + If your life is a hard drive, Christ can be your backup. +
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
> 

-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Improving free space usage (was: Reducing relation locking

From
Bruce Momjian
Date:
Jim C. Nasby wrote:
> At the risk of editorializing...
> 
> * Allow for tables with more than a certain percent of free space to
>   force new tuple allocation in the last few pages to come only from the
>   FSM (prefering the earliest pages), with the intention of allowing the
>   table to shrink

That seems too confusing.

> I also believe there's also a second TODO:
> 
> * Allow FSM page hand-out on tables to be influenced by the clustered
>   index. This would help maintain cluster order.

How would FSM know how to do that?  I added it with a question mark.

--  Bruce Momjian   http://candle.pha.pa.us SRA OSS, Inc.   http://www.sraoss.com
 + If your life is a hard drive, Christ can be your backup. +