"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
>> a shared cache for system catalog tuples, which might be a win but I'm
>> not sure (I'm worried about contention for the cache, especially if it's
>> protected by just one or a few spinlocks). Anyway, if we did have one
>> then keeping an accurate block count in the relation's pg_class row
>> would be a practical alternative.
> But there would be a problem if we use shared catalog cache.
> Being updated system tuples are only visible to an updating backend
> and other backends should see committed tuples.
> On the other hand,an accurate block count should be visible to all
> backends.
> Which tuple of a row should we load to catalog cache and update ?
Good point --- rolling back a transaction would cancel changes to the
pg_class row, but it mustn't cause the relation's file to get truncated
(since there could be tuples of other uncommitted transactions in the
newly added block(s)).
This says that having a block count column in pg_class is the Wrong
Thing; we should get rid of relpages entirely. The Right Thing is a
separate data structure in shared memory that stores the current
physical block count for each active relation. The first backend to
touch a given relation would insert an entry, and then subsequent
extensions/truncations/deletions would need to update it. We already
obtain a special lock when extending a relation, so seems like there'd
be no extra locking cost to have a table like this.
Anyone up for actually implementing this ;-) ? I have other things
I want to work on...
>> Well, it seems to me that the first misbehavior (incomplete delete becomes
>> a partial truncate, and you can try again) is a lot better than the
>> second (incomplete delete leaves an undeletable, unrecreatable table).
>> Should I go ahead and make delete/truncate work back-to-front, or do you
>> see a reason why that'd be a bad thing to do?
> I also think back-to-front is better.
OK, I have a couple other little things I want to do in md.c, so I'll
see what I can do about that. Even with a shared-memory relation
length table, back-to-front truncation would be the safest way to
proceed, so we'll want to make this change in any case.
> Deletion is necessary only not to consume disk space.
>
> For example vacuum could remove not deleted files.
Hmm ... interesting idea ... but I can hear the complaints
from users already...
regards, tom lane