Thread: Block write statistics WIP

Block write statistics WIP

From
Greg Smith
Date:
I have some time now for working on the mystery of why there are no
block write statistics in the database.  I hacked out the statistics
collection and reporting side already, rough patch attached if you want
to see the boring parts.

I guessed that there had to be a gotcha behind why this wasn't done
before now, and I've found one so far.  All of the read statistics are
collected with a macro that expects to know a Relation number.  Callers
to ReadBuffer pass one.  On the write side, the two places that
increment the existing write counters (pg_stat_statements, vacuum) are
MarkBufferDirty and MarkBufferDirtyHint.  Neither of those is given a
Relation though.  Inspecting the Buffer passed can only find the buffer
tag's RelFileNode.

I've thought of two paths to get a block write count out of that so far:

-Provide a function to find the Relation from the RelFileNode.  There is
a warning about the perils of assuming you can map that way from a
buftag value in buf_internals.h though:

"Note: the BufferTag data must be sufficient to determine where to write
the block, without reference to pg_class or pg_tablespace entries.  It's
possible that the backend flushing the buffer doesn't even believe the
relation is visible yet (its xact may have started before the xact that
created the rel).  The storage manager must be able to cope anyway."

-Modify every caller of MarkDirty* to include a relation when that
information is available.  There are about 200 callers of those
functions around, so that won't be fun.  I noted that many of them are
in the XLog replay functions, which won't have the relation--but those
aren't important to count anyway.

Neither of these options feels very good to me, so selecting between the
two feels like picking the lesser of two evils.  I'm fine with chugging
through all of the callers to modify MarkDirty*, but I didn't want to do
that only to have the whole approach rejected as wrong.  That's why I
stopped here to get some feedback.

The way that MarkDirty requires this specific underlying storage manager
behavior to work properly strikes me as as a bit of a layering violation
too.  I'd like the read and write paths to have a similar API, but here
they don't even operate on the same type of inputs.  Addressing that is
probably harder than just throwing a hack on the existing code though.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachment

Re: Block write statistics WIP

From
Heikki Linnakangas
Date:
On 19.05.2013 11:15, Greg Smith wrote:
> I've thought of two paths to get a block write count out of that so far:
>
> -Provide a function to find the Relation from the RelFileNode. There is
> a warning about the perils of assuming you can map that way from a
> buftag value in buf_internals.h though:
>
> "Note: the BufferTag data must be sufficient to determine where to write
> the block, without reference to pg_class or pg_tablespace entries. It's
> possible that the backend flushing the buffer doesn't even believe the
> relation is visible yet (its xact may have started before the xact that
> created the rel). The storage manager must be able to cope anyway."
>
> -Modify every caller of MarkDirty* to include a relation when that
> information is available. There are about 200 callers of those functions
> around, so that won't be fun. I noted that many of them are in the XLog
> replay functions, which won't have the relation--but those aren't
> important to count anyway.
>
> Neither of these options feels very good to me, so selecting between the
> two feels like picking the lesser of two evils. I'm fine with chugging
> through all of the callers to modify MarkDirty*, but I didn't want to do
> that only to have the whole approach rejected as wrong. That's why I
> stopped here to get some feedback.

Adding a Relation argument to all the Mark* calls seems fine to me. I 
don't find it ugly at all. The biggest objection would be that if there 
are extensions out there that call those functions, they would need to 
be changed, but I think that's fine.

> The way that MarkDirty requires this specific underlying storage manager
> behavior to work properly strikes me as as a bit of a layering violation
> too. I'd like the read and write paths to have a similar API, but here
> they don't even operate on the same type of inputs. Addressing that is
> probably harder than just throwing a hack on the existing code though.

To be honest, I don't understand what you mean by that. ?

- Heikki



Re: Block write statistics WIP

From
Greg Smith
Date:
On 5/20/13 7:51 AM, Heikki Linnakangas wrote:
>> The way that MarkDirty requires this specific underlying storage manager
>> behavior to work properly strikes me as as a bit of a layering violation
>> too. I'd like the read and write paths to have a similar API, but here
>> they don't even operate on the same type of inputs. Addressing that is
>> probably harder than just throwing a hack on the existing code though.
>
> To be honest, I don't understand what you mean by that. ?

Let's say you were designing a storage layer API from scratch for what 
Postgres does.  That might take a relation as its input, like ReadBuffer 
does.  Hiding the details of how that turns into a physical file 
operation would be a useful goal of such a layer.  I'd then consider it 
a problem if that exposed things like the actual mapping of relations 
into files to callers.

What we actually have right now is this MarkDirty function that operates 
on BufferTag data, which points directly to the underlying file.  I see 
that as cutting the storage API in half and calling a function in the 
middle of the implementation.  It strikes me as kind of weird that the 
read side and write side are not more symmetrical.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: Block write statistics WIP

From
Heikki Linnakangas
Date:
On 23.05.2013 19:10, Greg Smith wrote:
> On 5/20/13 7:51 AM, Heikki Linnakangas wrote:
>>> The way that MarkDirty requires this specific underlying storage
>>> manager behavior to work properly strikes me as as a bit of a
>>> layering violation too. I'd like the read and write paths to have
>>> a similar API, but here they don't even operate on the same type
>>> of inputs. Addressing that is probably harder than just throwing
>>> a hack on the existing code though.
>>
>> To be honest, I don't understand what you mean by that. ?
>
> Let's say you were designing a storage layer API from scratch for
> what Postgres does. That might take a relation as its input, like
> ReadBuffer does. Hiding the details of how that turns into a physical
> file operation would be a useful goal of such a layer. I'd then
> consider it a problem if that exposed things like the actual mapping
> of relations into files to callers.

Ok, got it.

> What we actually have right now is this MarkDirty function that
> operates on BufferTag data, which points directly to the underlying
> file. I see that as cutting the storage API in half and calling a
> function in the middle of the implementation.

Well, no, the BufferTag struct is internal to the buffer manager 
implementation. It's not part of the API; it's an implementation detail 
of the buffer manager.

> It strikes me as kind of weird that the read side and write side are
> not more symmetrical.

It might seem weird if you expect the API to be similar to POSIX read() 
and write(), where you can read() an arbitrary block at any location, 
and write() an arbitrary block at any location. A better comparison 
would be e.g open() and close(). open() returns a file descriptor, which 
you pass to other functions to operate on the file. When you're done, 
you call close(fd). The file descriptor is completely opaque to the user 
program, you do all the operations through the functions that take the 
fd as argument. Similarly, ReadBuffer() returns a Buffer, which is 
completely opaque to the caller, and you do all the operations through 
various functions and macros that operate on the Buffer. When you're 
done, you release the buffer with ReleaseBuffer().

(sorry for the digression from the original topic, I don't have any 
problem with what adding an optional Relation argument to MarkBuffer if 
you need that :-) )

- Heikki



Re: Block write statistics WIP

From
Satoshi Nagayasu
Date:
Hi,

I'm looking into this patch as a reviewer.

(2013/05/24 10:33), Heikki Linnakangas wrote:
> On 23.05.2013 19:10, Greg Smith wrote:
>> On 5/20/13 7:51 AM, Heikki Linnakangas wrote:
>> It strikes me as kind of weird that the read side and write side are
>> not more symmetrical.
>
> It might seem weird if you expect the API to be similar to POSIX read()
> and write(), where you can read() an arbitrary block at any location,
> and write() an arbitrary block at any location. A better comparison
> would be e.g open() and close(). open() returns a file descriptor, which
> you pass to other functions to operate on the file. When you're done,
> you call close(fd). The file descriptor is completely opaque to the user
> program, you do all the operations through the functions that take the
> fd as argument. Similarly, ReadBuffer() returns a Buffer, which is
> completely opaque to the caller, and you do all the operations through
> various functions and macros that operate on the Buffer. When you're
> done, you release the buffer with ReleaseBuffer().
>
> (sorry for the digression from the original topic, I don't have any
> problem with what adding an optional Relation argument to MarkBuffer if
> you need that :-) )

Or should we add some pointer, which is accociated with the Relation,
into BufferDesc? Maybe OID?

I'm thinking of FlushBuffer() too. How can we count physical write
for each relation in FlushBuffer()? Or any other appropriate place?

BTW, Greg's first patch could not be applied to the latest HEAD.
I guess it need to have some fix, I couldn't clafiry it though.

Regards,
-- 
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp



Re: Block write statistics WIP

From
Greg Smith
Date:
On 7/1/13 3:10 AM, Satoshi Nagayasu wrote:
> Or should we add some pointer, which is accociated with the Relation,
> into BufferDesc? Maybe OID?

That is the other option here, I looked at it but didn't like it.  The 
problem is that at the point when a new page is created, it's not always 
clear yet what relation it's going to hold.  That means that if the 
buffer page itself is where you look up the relation OID, every code 
path that manipulates relation IDs will need to worry about maintaining 
this information.  It's possible to do it that way, but I don't know 
that it's any less work than making all the write paths keep track of 
it.  It would need some extra locking around updating the relation tag 
in the buffer pages too, and I'd prefer not to

The other thing that I like about the writing side is that I can 
guarantee the code is correct pretty easily.  Yes, it's going to take 
days worth of modifying writing code.  But the compile will force you to 
fix all of them, and once they're all updated I'd be surprised if 
something unexpected happened.

If instead the data moves onto the buffer page header instead, we could 
be chasing bugs similar to the relcache ones forever.  Also, as a tie 
breaker, buffer pages will get bigger and require more locking this way.  Making writers tell us the relation doesn't
needany of that.
 

> I'm thinking of FlushBuffer() too. How can we count physical write
> for each relation in FlushBuffer()? Or any other appropriate place?

SMgrRelation is available there, so tagging the relation should be easy 
in this case.

> BTW, Greg's first patch could not be applied to the latest HEAD.
> I guess it need to have some fix, I couldn't clafiry it though.

Since this is a WIP patch, what I was looking for was general design 
approach feedback, with my bit as a simple example.  I didn't expect 
anyone to spend much time doing a formal review of that quick hack. 
You're welcome to try and play with that code to add things, I'm not 
very attached to it yet.  I've basically gotten what I wanted to here 
from this submission; I'm marking it returned with feedback.  If anyone 
else has comments, I'm still open to discussion here too.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com