Re: mdnblocks() sabotages error checking in _mdfd_getseg() - Mailing list pgsql-hackers

From Robert Haas
Subject Re: mdnblocks() sabotages error checking in _mdfd_getseg()
Date
Msg-id CA+TgmoZY0U+XCMzs+iBw8PnrNi7E4+uD4Fnxbr9YmFk+P-KFYA@mail.gmail.com
Whole thread Raw
In response to Re: mdnblocks() sabotages error checking in _mdfd_getseg()  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: mdnblocks() sabotages error checking in _mdfd_getseg()  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Thu, Dec 10, 2015 at 1:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 10 December 2015 at 16:47, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Thu, Dec 10, 2015 at 11:36 AM, Andres Freund <andres@anarazel.de>
>> wrote:
>> >> In fact, having no way to get the relation length other than scanning
>> >> 1000 files doesn't seem like an especially good choice even if we used
>> >> a better data structure.  Putting a header page in the heap would make
>> >> getting the length of a relation O(1) instead of O(segments), and for
>> >> a bonus, we'd be able to reliably detect it if a relation file
>> >> disappeared out from under us.  That's a difficult project and
>> >> definitely not my top priority, but this code is old and crufty all
>> >> the same.)
>> >
>> > The md layer doesn't really know whether it's dealing with an index, or
>> > with an index, or ... So handling this via a metapage doesn't seem
>> > particularly straightforward.
>>
>> It's not straightforward, but I don't think that's the reason.  What
>> we could do is look at the call sites that use
>> RelationGetNumberOfBlocks() and change some of them to get the
>> information some other way instead.  I believe get_relation_info() and
>> initscan() are the primary culprits, accounting for some enormous
>> percentage of the system calls we do on a read-only pgbench workload.
>> Those functions certainly know enough to consult a metapage if we had
>> such a thing.
>
> It looks pretty straightforward to me...
>
> The number of relations with >1 file is likely to be fairly small, so we can
> just have an in-memory array to record that. 8 bytes per relation >1 GB
> isn't going to take much shmem, but we can extend using dynshmem as needed.
> We can seq scan the array at relcache build time and invalidate relcache
> when we extend. WAL log any extension to a new segment and write the table
> to disk at checkpoint.

Invaliding the relcache when we extend would be extremely expensive,
but we could probably come up with some variant of this that would
work.  I'm not very excited about this design, though; I think
actually putting a metapage on each relation would be better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: mdnblocks() sabotages error checking in _mdfd_getseg()
Next
From: Pavel Stehule
Date:
Subject: Re: [patch] Proposal for \rotate in psql