Re: CLUSTER not lose indexes - Mailing list pgsql-patches

From Bruce Momjian
Subject Re: CLUSTER not lose indexes
Date
Msg-id 200207072329.g67NTUC21748@candle.pha.pa.us
Whole thread Raw
In response to Re: CLUSTER not lose indexes  (Alvaro Herrera <alvherre@atentus.com>)
Responses Re: CLUSTER not lose indexes
List pgsql-patches
Alvaro Herrera wrote:
> > I think you're still letting your thinking be contorted by the existing
> > CLUSTER implementation.  Do we need a temp pg_class entry at all?  Seems
> > like we just want to UPDATE the pg_class row with the new relfilenode
> > value; then we can see the update but no one else can (till we commit).
> > Ditto for the indexes.
>
> That's what I originally thought: mess around directly with the smgr (or
> with some upper layer? I don't know) to create a new relfilenode, and
> then attach it to the heap.  I don't know if it's possible or too
> difficult.
>
> Then, with Bruce's explanation, I thought I should just create a temp
> table and exchange relfilenodes, which is much simpler.

Yes, creating the file is easy.  Inserting into a file without a heap
relation pointer is a pain.  ;-)

Actually, I did a little research on the crash/abort handling of
swapping the relfilenodes between the old and clustered pg_class heap
entries.  A abort/crash can happen in several ways-  abort of the
transaction, backend crash during the transaction, or cleaning up of old
file at commit.

First, when you create a table, the relfilenode is _copied_ to
pendingDeletes.  These files are cleaned up if the transaction aborts in
smgrDoPendingDeletes().  There will also be an entry in there for the
same table after you call heap_delete.  As long as the relfilenode at
heap_delete time points to the original heap file, you should be fine.
If it commits, the new file is removed (remember that oid was saved).
If it aborts, the new file is removed (CLUSTER failed).

As far as a crash, there is RemoveTempRelations which calls
FindTempRelations() but I think that only gets populated after the
commit happens and the temp namespaces is populated.  I assume the
aborted part of the transaction is invisible to the backend at that
point (crash time).  Of course, it wouldn't hurt to test these cases to
make sure it works right, and I would document that the copying of
relfilenode in smgr create/unlink is a desired effect relied upon by
cluster and any other commands that may use relfilenode renaming in the
future.  (If it wasn't copied, then an abort would delete the _old_
heap.  Bad.)

FYI, RENAME also deals with relfilenode renaming in setNewRelfilenode().
The difference is that RENAME doesn't need to access the old index, just
build a new one, so it can take shortcuts in how it handles things. It
uses two methods to modify the tuple, one directly modifying pg_class,
rather than inserting a new heap row. and another of doing it the
traditional way.  It does this because REINDEX is used to fix corrupt
indexes, including corrupt system indexes.  You will not be using that
type of code in CLUSTER because there is a real temp heap associated
with this operation.  Just heap_update() like normal for both relations.

Hope this helps.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026



pgsql-patches by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: CLUSTER not lose indexes
Next
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] Non-standard feature request