Re: CLUSTER not lose indexes - Mailing list pgsql-patches
From | Bruce Momjian |
---|---|
Subject | Re: CLUSTER not lose indexes |
Date | |
Msg-id | 200207072329.g67NTUC21748@candle.pha.pa.us Whole thread Raw |
In response to | Re: CLUSTER not lose indexes (Alvaro Herrera <alvherre@atentus.com>) |
Responses |
Re: CLUSTER not lose indexes
|
List | pgsql-patches |
Alvaro Herrera wrote: > > I think you're still letting your thinking be contorted by the existing > > CLUSTER implementation. Do we need a temp pg_class entry at all? Seems > > like we just want to UPDATE the pg_class row with the new relfilenode > > value; then we can see the update but no one else can (till we commit). > > Ditto for the indexes. > > That's what I originally thought: mess around directly with the smgr (or > with some upper layer? I don't know) to create a new relfilenode, and > then attach it to the heap. I don't know if it's possible or too > difficult. > > Then, with Bruce's explanation, I thought I should just create a temp > table and exchange relfilenodes, which is much simpler. Yes, creating the file is easy. Inserting into a file without a heap relation pointer is a pain. ;-) Actually, I did a little research on the crash/abort handling of swapping the relfilenodes between the old and clustered pg_class heap entries. A abort/crash can happen in several ways- abort of the transaction, backend crash during the transaction, or cleaning up of old file at commit. First, when you create a table, the relfilenode is _copied_ to pendingDeletes. These files are cleaned up if the transaction aborts in smgrDoPendingDeletes(). There will also be an entry in there for the same table after you call heap_delete. As long as the relfilenode at heap_delete time points to the original heap file, you should be fine. If it commits, the new file is removed (remember that oid was saved). If it aborts, the new file is removed (CLUSTER failed). As far as a crash, there is RemoveTempRelations which calls FindTempRelations() but I think that only gets populated after the commit happens and the temp namespaces is populated. I assume the aborted part of the transaction is invisible to the backend at that point (crash time). Of course, it wouldn't hurt to test these cases to make sure it works right, and I would document that the copying of relfilenode in smgr create/unlink is a desired effect relied upon by cluster and any other commands that may use relfilenode renaming in the future. (If it wasn't copied, then an abort would delete the _old_ heap. Bad.) FYI, RENAME also deals with relfilenode renaming in setNewRelfilenode(). The difference is that RENAME doesn't need to access the old index, just build a new one, so it can take shortcuts in how it handles things. It uses two methods to modify the tuple, one directly modifying pg_class, rather than inserting a new heap row. and another of doing it the traditional way. It does this because REINDEX is used to fix corrupt indexes, including corrupt system indexes. You will not be using that type of code in CLUSTER because there is a real temp heap associated with this operation. Just heap_update() like normal for both relations. Hope this helps. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
pgsql-patches by date: