Thread: CLUSTER not lose indexes

CLUSTER not lose indexes

From
Alvaro Herrera
Date:
Hackers:

I've modified commands/cluster.c so that it recreates the indexes on the
table after clustering the table.  I attach the patch.

There are (of course) things I don't understand.  For example, whether
(or when) I should use CommandCounterIncrement() after each
index_create, or if I should call setRelhasindex() only once (and not
once per index); or whether I need to acquire some lock on the indexes.

I tested it with one table and several indexes.  Truth is I don't know
how to test for concurrency, or if it's worth the trouble.

The purpose of this experiment (and, I hope, more to follow) is to
familiarize myself with the guts of PostgreSQL, so I can work on my CS
thesis with it.  If you can point me my misconceptions I'd be happy to
try again (and again, and...)

Thank you.

--
Alvaro Herrera (<alvherre[a]atentus.com>)
"La Primavera ha venido. Nadie sabe como ha sido" (A. Machado)

Attachment

Re: CLUSTER not lose indexes

From
Tom Lane
Date:
Alvaro Herrera <alvherre@atentus.com> writes:
> There are (of course) things I don't understand.  For example, whether
> (or when) I should use CommandCounterIncrement() after each
> index_create, or if I should call setRelhasindex() only once (and not
> once per index); or whether I need to acquire some lock on the indexes.

I think you probably want a CommandCounterIncrement at the bottom of the
loop (after setRelhasindex).  If it works as-is it's just by chance,
ie due to internal CCI calls in index_create.

Locking newly-created indexes is not really necessary, since no one else
can see them until you commit anyhow.

+         tuple = SearchSysCache(RELOID, ObjectIdGetDatum(attrs->indexOID),
+                 0, 0, 0);
+         if (!HeapTupleIsValid(tuple))
+                 break;

Breaking out of the loop hardly seems an appropriate response to this
failure condition.  Not finding the index' pg_class entry is definitely
an error.

I'd also suggest more-liberal commenting, as well as more attention to
updating the existing comments to match new reality.


In general, I'm not thrilled about expending more code on the existing
fundamentally-broken implementation of CLUSTER.  We need to look at
making use of the ability to write a new version of a table (or index)
under a new relfilenode value, without changing the table's OID.
However, some parts of your patch will probably still be needed when
someone gets around to making that happen, so I won't object for now.

            regards, tom lane



Re: CLUSTER not lose indexes

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> Hackers:
>
> I've modified commands/cluster.c so that it recreates the indexes on the
> table after clustering the table.  I attach the patch.
>
> There are (of course) things I don't understand.  For example, whether
> (or when) I should use CommandCounterIncrement() after each
> index_create, or if I should call setRelhasindex() only once (and not
> once per index); or whether I need to acquire some lock on the indexes.
>
> I tested it with one table and several indexes.  Truth is I don't know
> how to test for concurrency, or if it's worth the trouble.
>
> The purpose of this experiment (and, I hope, more to follow) is to
> familiarize myself with the guts of PostgreSQL, so I can work on my CS
> thesis with it.  If you can point me my misconceptions I'd be happy to
> try again (and again, and...)

I think Tom was suggesting that you may want to continue work on CLUSTER
and make use of relfilenode.  After the cluster, you can just update
pg_class.relfilenode with the new file name (random oid generated at
build time) and as soon as you commit, all backends will start using the
new file and you can delete the old one.

The particular case we would like to improve is this:

    /* Destroy old heap (along with its index) and rename new. */
    heap_drop_with_catalog(OIDOldHeap, allowSystemTableMods);

    CommandCounterIncrement();

    renamerel(OIDNewHeap, oldrelation->relname);

In this code, we delete the old relation, then rename the new one.  It
would be good to have this all happen in one update of
pg_class.relfilenode;  that way it is an atomic operation.

So, create a heap (in the temp namespace so it is deleted on crash),
copy the old heap into the new file in cluster order, and when you are
done, point the old pg_class relfilenode at the new clustered heap
filename, then point the new cluster heap pg_class at the old heap file,
and then drop the cluster heap file;   that will remove the _old_ file
(I believe on commit) and you are ready to go.

So, you are basically creating a new heap, but at finish, the new heap's
pg_class and the old heap's file go away.  I thought about doing it
without creating the pg_class entry for the new heap, but the code
really wants to have a heap it can manipulate.

Same with index rebuilding, I think.  The indexes are built on the new
heap, and the relfilenode swap works just the same.

Let us know if you want to pursue that and we can give you additional
assistance.  I would like to see CLUSTER really polished up.  More
people are using it now and it really needs someone to focus on it.

Glad to see you working on CLUSTER.  Welcome aboard.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026



Re: CLUSTER not lose indexes

From
Alvaro Herrera
Date:
Bruce Momjian dijo:

> Alvaro Herrera wrote:
> > Hackers:
> >
> > I've modified commands/cluster.c so that it recreates the indexes on the
> > table after clustering the table.  I attach the patch.

> I think Tom was suggesting that you may want to continue work on CLUSTER
> and make use of relfilenode.  After the cluster, you can just update
> pg_class.relfilenode with the new file name (random oid generated at
> build time) and as soon as you commit, all backends will start using the
> new file and you can delete the old one.

I'm looking at pg_class, and relfilenode is equal to oid in all cases
AFAICS.  If I create a new, "random" relfilenode, the equality will not
hold.  Is that OK?  Also, is the new relfilenode somehow guaranteed to
not be assigned to another relation (pg_class tuple, I think)?


> In this code, we delete the old relation, then rename the new one.  It
> would be good to have this all happen in one update of
> pg_class.relfilenode;  that way it is an atomic operation.

OK, I'll see if I can do that.

> Let us know if you want to pursue that and we can give you additional
> assistance.  I would like to see CLUSTER really polished up.  More
> people are using it now and it really needs someone to focus on it.

I thought CLUSTER was meant to be replaced by automatically clustering
indexes.  Isn't that so?  Is anyone working on that?


> Glad to see you working on CLUSTER.  Welcome aboard.

Thank you.  I'm really happy to be able to work in Postgres.

--
Alvaro Herrera (<alvherre[a]atentus.com>)
"Limitate a mirar... y algun dia veras"




Re: CLUSTER not lose indexes

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> Bruce Momjian dijo:
>
> > Alvaro Herrera wrote:
> > > Hackers:
> > >
> > > I've modified commands/cluster.c so that it recreates the indexes on the
> > > table after clustering the table.  I attach the patch.
>
> > I think Tom was suggesting that you may want to continue work on CLUSTER
> > and make use of relfilenode.  After the cluster, you can just update
> > pg_class.relfilenode with the new file name (random oid generated at
> > build time) and as soon as you commit, all backends will start using the
> > new file and you can delete the old one.
>
> I'm looking at pg_class, and relfilenode is equal to oid in all cases
> AFAICS.  If I create a new, "random" relfilenode, the equality will not
> hold.  Is that OK?  Also, is the new relfilenode somehow guaranteed to

Yes, they are all the same only because we haven't gotten CLUSTER fixed
yet.  ;-)  We knew we needed to make cases where they are not the same
specifically for cases like CLUSTER.

> not be assigned to another relation (pg_class tuple, I think)?

It will be fine.  The relfilenode will be the one assigned to the new
heap table and will not be used by others.

>
>
> > In this code, we delete the old relation, then rename the new one.  It
> > would be good to have this all happen in one update of
> > pg_class.relfilenode;  that way it is an atomic operation.
>
> OK, I'll see if I can do that.
>
> > Let us know if you want to pursue that and we can give you additional
> > assistance.  I would like to see CLUSTER really polished up.  More
> > people are using it now and it really needs someone to focus on it.
>
> I thought CLUSTER was meant to be replaced by automatically clustering
> indexes.  Isn't that so?  Is anyone working on that?

We have no idea how to automatically cluster based on an index.  Not
even sure how the commercial guys do it.

> > Glad to see you working on CLUSTER.  Welcome aboard.
>
> Thank you.  I'm really happy to be able to work in Postgres.

Great.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026



Re: CLUSTER not lose indexes

From
Alvaro Herrera
Date:
Tom Lane dijo:

> I think you probably want a CommandCounterIncrement at the bottom of the
> loop (after setRelhasindex).  If it works as-is it's just by chance,
> ie due to internal CCI calls in index_create.

Done.


> +         tuple = SearchSysCache(RELOID, ObjectIdGetDatum(attrs->indexOID),
> +                 0, 0, 0);
> +         if (!HeapTupleIsValid(tuple))
> +                 break;
>
> Breaking out of the loop hardly seems an appropriate response to this
> failure condition.  Not finding the index' pg_class entry is definitely
> an error.

Sure.  elog(ERROR) now.  I'm not sure what was I thinking when I wrote
that.

> I'd also suggest more-liberal commenting, as well as more attention to
> updating the existing comments to match new reality.

I'm afraid I cannot get too verbose no matter how hard I try.  I hope
this one is OK.

> In general, I'm not thrilled about expending more code on the existing
> fundamentally-broken implementation of CLUSTER.  We need to look at
> making use of the ability to write a new version of a table (or index)
> under a new relfilenode value, without changing the table's OID.
> However, some parts of your patch will probably still be needed when
> someone gets around to making that happen, so I won't object for now.

Will try to do this.

--
Alvaro Herrera (<alvherre[a]atentus.com>)
Licensee shall have no right to use the Licensed Software
for productive or commercial use. (Licencia de StarOffice 6.0 beta)

Attachment

Re: CLUSTER not lose indexes

From
Tom Lane
Date:
Alvaro Herrera <alvherre@atentus.com> writes:
> I'm looking at pg_class, and relfilenode is equal to oid in all cases
> AFAICS.  If I create a new, "random" relfilenode, the equality will not
> hold.  Is that OK?

The idea is that OID is the logical table identifier and relfilenode is
the physical identifier (so relfilenode is what should be used in bufmgr
and below).  There are undoubtedly a few places that get this wrong, but
we won't be able to ferret them out until someone starts to exercise
the facility.

> Also, is the new relfilenode somehow guaranteed to
> not be assigned to another relation (pg_class tuple, I think)?

I've been wondering about that myself.  We might have to add a unique
index on pg_class.relfilenode to ensure this; otherwise, after OID
wraparound there would be no guarantees.

>> In this code, we delete the old relation, then rename the new one.  It
>> would be good to have this all happen in one update of
>> pg_class.relfilenode;  that way it is an atomic operation.

As long as you have not committed, it's atomic anyway because no one can
see your updates.  It'd be nice to do it in one update for efficiency,
but don't contort the code beyond reason to achieve that.

            regards, tom lane



Re: CLUSTER not lose indexes

From
Bruce Momjian
Date:
Tom Lane wrote:
> > Also, is the new relfilenode somehow guaranteed to
> > not be assigned to another relation (pg_class tuple, I think)?
>
> I've been wondering about that myself.  We might have to add a unique
> index on pg_class.relfilenode to ensure this; otherwise, after OID
> wraparound there would be no guarantees.

Yep, good point.

> >> In this code, we delete the old relation, then rename the new one.  It
> >> would be good to have this all happen in one update of
> >> pg_class.relfilenode;  that way it is an atomic operation.
>
> As long as you have not committed, it's atomic anyway because no one can
> see your updates.  It'd be nice to do it in one update for efficiency,
> but don't contort the code beyond reason to achieve that.

Sorry, I meant to say that we added relfilenode for exactly this case,
so that we have atomic file access semantics.  Do we already have that
feature in the current code?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026



Re: CLUSTER not lose indexes

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> > In general, I'm not thrilled about expending more code on the existing
> > fundamentally-broken implementation of CLUSTER.  We need to look at
> > making use of the ability to write a new version of a table (or index)
> > under a new relfilenode value, without changing the table's OID.
> > However, some parts of your patch will probably still be needed when
> > someone gets around to making that happen, so I won't object for now.

OK, I remember now.  In renamerel(), the new clustered file is renamed
to the old table name.  However, it has a new oid.  The idea of
relfilenode was that we want to cluster the table and keep the same
relation oid.  That way, other system tables that reference that OID
will still work.

Seems like renamerel will have to stay because it is used by ALTER TABLE
RENAME, so we just need some new code that updates the relfilenode of
the old pg_class row to point to the new clustered file.  Swapping
relfilenodes between the old and new pg_class rows and deleting the new
table should do the trick of deleting the non-clustered file and the
temp pg_class row at the same time.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026



Re: CLUSTER not lose indexes

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Seems like renamerel will have to stay because it is used by ALTER TABLE
> RENAME, so we just need some new code that updates the relfilenode of
> the old pg_class row to point to the new clustered file.  Swapping
> relfilenodes between the old and new pg_class rows and deleting the new
> table should do the trick of deleting the non-clustered file and the
> temp pg_class row at the same time.

I think you're still letting your thinking be contorted by the existing
CLUSTER implementation.  Do we need a temp pg_class entry at all?  Seems
like we just want to UPDATE the pg_class row with the new relfilenode
value; then we can see the update but no one else can (till we commit).
Ditto for the indexes.

What's still a little unclear to me is how to access the old heap and
index files to read the data while simultaneously accessing the new ones
to write it.  Much of the existing code likes to have a Relation struct
available, not only a RelFileNode, so it may be necessary to have both
old and new Relations present at the same time.  If that's the case we
might be stuck with making a temp pg_class entry just to support a phony
Relation :-(

            regards, tom lane



Re: CLUSTER not lose indexes

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Seems like renamerel will have to stay because it is used by ALTER TABLE
> > RENAME, so we just need some new code that updates the relfilenode of
> > the old pg_class row to point to the new clustered file.  Swapping
> > relfilenodes between the old and new pg_class rows and deleting the new
> > table should do the trick of deleting the non-clustered file and the
> > temp pg_class row at the same time.
>
> I think you're still letting your thinking be contorted by the existing
> CLUSTER implementation.  Do we need a temp pg_class entry at all?  Seems
> like we just want to UPDATE the pg_class row with the new relfilenode
> value; then we can see the update but no one else can (till we commit).
> Ditto for the indexes.
>
> What's still a little unclear to me is how to access the old heap and
> index files to read the data while simultaneously accessing the new ones
> to write it.  Much of the existing code likes to have a Relation struct
> available, not only a RelFileNode, so it may be necessary to have both
> old and new Relations present at the same time.  If that's the case we
> might be stuck with making a temp pg_class entry just to support a phony
> Relation :-(

Yes, that was my conclusion, that we need the temp heap so we can access
it in a clean manner.  Sure, it would be nice if we could access a file
on its own, but it doesn't seem worth the complexity required to
accomplish it.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026



Re: CLUSTER not lose indexes

From
Alvaro Herrera
Date:
Tom Lane dijo:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Seems like renamerel will have to stay because it is used by ALTER TABLE
> > RENAME, so we just need some new code that updates the relfilenode of
> > the old pg_class row to point to the new clustered file.  Swapping
> > relfilenodes between the old and new pg_class rows and deleting the new
> > table should do the trick of deleting the non-clustered file and the
> > temp pg_class row at the same time.
>
> I think you're still letting your thinking be contorted by the existing
> CLUSTER implementation.  Do we need a temp pg_class entry at all?  Seems
> like we just want to UPDATE the pg_class row with the new relfilenode
> value; then we can see the update but no one else can (till we commit).
> Ditto for the indexes.

That's what I originally thought: mess around directly with the smgr (or
with some upper layer? I don't know) to create a new relfilenode, and
then attach it to the heap.  I don't know if it's possible or too
difficult.

Then, with Bruce's explanation, I thought I should just create a temp
table and exchange relfilenodes, which is much simpler.

--
Alvaro Herrera (<alvherre[a]atentus.com>)
"El dia que dejes de cambiar dejaras de vivir"




Re: CLUSTER not lose indexes

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> > I think you're still letting your thinking be contorted by the existing
> > CLUSTER implementation.  Do we need a temp pg_class entry at all?  Seems
> > like we just want to UPDATE the pg_class row with the new relfilenode
> > value; then we can see the update but no one else can (till we commit).
> > Ditto for the indexes.
>
> That's what I originally thought: mess around directly with the smgr (or
> with some upper layer? I don't know) to create a new relfilenode, and
> then attach it to the heap.  I don't know if it's possible or too
> difficult.
>
> Then, with Bruce's explanation, I thought I should just create a temp
> table and exchange relfilenodes, which is much simpler.

Yes, creating the file is easy.  Inserting into a file without a heap
relation pointer is a pain.  ;-)

Actually, I did a little research on the crash/abort handling of
swapping the relfilenodes between the old and clustered pg_class heap
entries.  A abort/crash can happen in several ways-  abort of the
transaction, backend crash during the transaction, or cleaning up of old
file at commit.

First, when you create a table, the relfilenode is _copied_ to
pendingDeletes.  These files are cleaned up if the transaction aborts in
smgrDoPendingDeletes().  There will also be an entry in there for the
same table after you call heap_delete.  As long as the relfilenode at
heap_delete time points to the original heap file, you should be fine.
If it commits, the new file is removed (remember that oid was saved).
If it aborts, the new file is removed (CLUSTER failed).

As far as a crash, there is RemoveTempRelations which calls
FindTempRelations() but I think that only gets populated after the
commit happens and the temp namespaces is populated.  I assume the
aborted part of the transaction is invisible to the backend at that
point (crash time).  Of course, it wouldn't hurt to test these cases to
make sure it works right, and I would document that the copying of
relfilenode in smgr create/unlink is a desired effect relied upon by
cluster and any other commands that may use relfilenode renaming in the
future.  (If it wasn't copied, then an abort would delete the _old_
heap.  Bad.)

FYI, RENAME also deals with relfilenode renaming in setNewRelfilenode().
The difference is that RENAME doesn't need to access the old index, just
build a new one, so it can take shortcuts in how it handles things. It
uses two methods to modify the tuple, one directly modifying pg_class,
rather than inserting a new heap row. and another of doing it the
traditional way.  It does this because REINDEX is used to fix corrupt
indexes, including corrupt system indexes.  You will not be using that
type of code in CLUSTER because there is a real temp heap associated
with this operation.  Just heap_update() like normal for both relations.

Hope this helps.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026



Re: CLUSTER not lose indexes

From
Alvaro Herrera
Date:
Bruce Momjian dijo:

> FYI, RENAME also deals with relfilenode renaming in setNewRelfilenode().
> The difference is that RENAME doesn't need to access the old index, just
> build a new one, so it can take shortcuts in how it handles things. It
> uses two methods to modify the tuple, one directly modifying pg_class,
> rather than inserting a new heap row. and another of doing it the
> traditional way.  It does this because REINDEX is used to fix corrupt
> indexes, including corrupt system indexes.  You will not be using that
> type of code in CLUSTER because there is a real temp heap associated
> with this operation.  Just heap_update() like normal for both relations.

Well, I think my approach is somewhat more naive.  What I'm actually
doing is something like:

1.  save information on extant indexes
2.  create a new heap, and populate (clustered)
3.  swap the relfilenodes of the new and old heaps
4.  drop new heap (with old relfilenode)
5.  for each index saved in 1:
5.1.  create a new index with the same attributes
5.2.  swap relfilenodes of original and new index
5.3.  drop new index (with old relfilenode)

But now I'm lost.  It has worked sometimes; then I change a minimal
thing, recompile and then it doesn't work (bufmgr fails an assertion).
I can tell that the new (table) filenode is correct if I skip step 4
above, and check the temp table manually (but of course it has no
indexes).  I've never gotten as far as completing all steps right, so I
cannot tell whether the new indexes' filenodes are correct.

I'm posting the new patch.  Please review it; I've turned it upside down
a few times and frankly don't know what's happening.  I sure am
forgetting a lock or something but cannot find what is it.

--
Alvaro Herrera (<alvherre[a]atentus.com>)
"Aprender sin pensar es inutil; pensar sin aprender, peligroso" (Confucio)

Attachment