Re: [HACKERS] mdnblocks is an amazing time sink in huge relations - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] mdnblocks is an amazing time sink in huge relations
Date
Msg-id 1689.940302654@sss.pgh.pa.us
Whole thread Raw
In response to RE: [HACKERS] mdnblocks is an amazing time sink in huge relations  ("Hiroshi Inoue" <Inoue@tpf.co.jp>)
Responses RE: [HACKERS] mdnblocks is an amazing time sink in huge relations  ("Hiroshi Inoue" <Inoue@tpf.co.jp>)
List pgsql-hackers
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
>> a shared cache for system catalog tuples, which might be a win but I'm
>> not sure (I'm worried about contention for the cache, especially if it's
>> protected by just one or a few spinlocks).  Anyway, if we did have one
>> then keeping an accurate block count in the relation's pg_class row
>> would be a practical alternative.

> But there would be a problem if we use shared catalog cache.
> Being updated system tuples are only visible to an updating backend
> and other backends should see committed tuples.
> On the other hand,an accurate block count should be visible to all
> backends.
> Which tuple of a row should we load to catalog cache and update ?

Good point --- rolling back a transaction would cancel changes to the
pg_class row, but it mustn't cause the relation's file to get truncated
(since there could be tuples of other uncommitted transactions in the
newly added block(s)).

This says that having a block count column in pg_class is the Wrong
Thing; we should get rid of relpages entirely.  The Right Thing is a
separate data structure in shared memory that stores the current
physical block count for each active relation.  The first backend to
touch a given relation would insert an entry, and then subsequent
extensions/truncations/deletions would need to update it.  We already
obtain a special lock when extending a relation, so seems like there'd
be no extra locking cost to have a table like this.

Anyone up for actually implementing this ;-) ?  I have other things
I want to work on...

>> Well, it seems to me that the first misbehavior (incomplete delete becomes
>> a partial truncate, and you can try again) is a lot better than the
>> second (incomplete delete leaves an undeletable, unrecreatable table).
>> Should I go ahead and make delete/truncate work back-to-front, or do you
>> see a reason why that'd be a bad thing to do?

> I also think back-to-front is better.

OK, I have a couple other little things I want to do in md.c, so I'll
see what I can do about that.  Even with a shared-memory relation
length table, back-to-front truncation would be the safest way to
proceed, so we'll want to make this change in any case.

> Deletion is necessary only not to consume disk space.
>
> For example vacuum could remove not deleted files.

Hmm ... interesting idea ... but I can hear the complaints
from users already...
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: [HACKERS] indexable and locale
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] sort on huge table