Thread: Multiple Storage per Tablespace, or Volumes

Multiple Storage per Tablespace, or Volumes

From
Dimitri Fontaine
Date:
Hi list,

Here's a proposal of this idea which stole a good part of my night.
I'll present first the idea, then 2 use cases where to read some rational and
few details. Please note I won't be able to participate in any development
effort associated with this idea, may such a thing happen!

The bare idea is to provide a way to 'attach' multiple storage facilities (say
volumes) to a given tablespace. Each volume may be attached in READ ONLY,
READ WRITE or WRITE ONLY mode.
You can mix RW and WO volumes into the same tablespace, but can't have RO with
any W form, or so I think.

It would be pretty handy to be able to add and remove volumes on a live
cluster, and this could be a way to implement moving/extending tablespaces.


Use Case A: better read performances while keeping data write reliability

The first application of this multiple volumes per tablespace idea is to keep
a tablespace both into RAM (tmpfs or ramfs) and on disk (both RW).

Then PG should be able to read from both volumes when dealing with read
queries, and would have to fwrite()/fsync() both volumes for each write.
Of course, write speed will be constrained by the slowest volume, but the
quicker one would then be able to take away some amount of read queries
meanwhile.

It would be neat if PG was able to account volumes relative write speed in
order to assign pounds to each tablespace volumes; and have the planner or
executor span read queries among volumes depending on that.
For example if a single query has a plan containing several full scan (of
indexes and/or tables) in the same tablespace, those could be done on
different volumes.

Use Case B: Synchronous Master Slave(s) Replication

By using a Distributed File System capable of being mounted from several nodes
at the same time, we could have a configuration where a master node has
('exports') a WO tablespace volume, and one or more slaves (depending on FS
capability) configures a RO tablespace volume.

PG has then to be able to cope with a RO volume: the data are not written by
PG itself (local node point of view), so some limitations would certainly
occur.
Will it be possible, for example, to add indexes to data on slaves?
I'd use the solution even without this, thus...

When the master/slave link is broken, the master can no more write to
tablespace, as if it was a local disk failure of some sort, so this should
prevent nasty desync' problems: data is written on all W volumes or data is
not written at all.


I realize this proposal is the first draft of a work to be done, and that I
won't be able to make a lot more than drafting this idea. This mail is sent
on the hackers list in the hope someone there will find this is worth
considering and polishing...

Regards, and thanks for the good work ;)
--
Dimitri Fontaine

Re: Multiple Storage per Tablespace, or Volumes

From
Martijn van Oosterhout
Date:
On Mon, Feb 19, 2007 at 11:25:41AM +0100, Dimitri Fontaine wrote:
> Hi list,
>
> Here's a proposal of this idea which stole a good part of my night.
> I'll present first the idea, then 2 use cases where to read some rational and
> few details. Please note I won't be able to participate in any development
> effort associated with this idea, may such a thing happen!
>
> The bare idea is to provide a way to 'attach' multiple storage facilities (say
> volumes) to a given tablespace. Each volume may be attached in READ ONLY,
> READ WRITE or WRITE ONLY mode.
> You can mix RW and WO volumes into the same tablespace, but can't have RO with
> any W form, or so I think.

Somehow this seems like implementing RAID within postgres, which seems
a bit outside of the scope of a DB.

> Use Case A: better read performances while keeping data write reliability
>
> The first application of this multiple volumes per tablespace idea is to keep
> a tablespace both into RAM (tmpfs or ramfs) and on disk (both RW).

For example, I don't beleive there is a restiction against having one
member of a RAID array being a RAM disk.

> Use Case B: Synchronous Master Slave(s) Replication
>
> By using a Distributed File System capable of being mounted from several nodes
> at the same time, we could have a configuration where a master node has
> ('exports') a WO tablespace volume, and one or more slaves (depending on FS
> capability) configures a RO tablespace volume.

Here you have the problem of row visibility. The data in the table isn't
very useful without the clog, and that's not stored in a tablespace...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Multiple Storage per Tablespace, or Volumes

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Somehow this seems like implementing RAID within postgres,

RAID and LVM too.  I can't get excited about re-inventing those wheels
when perfectly good implementations already exist for us to sit on top of.
        regards, tom lane


Re: Multiple Storage per Tablespace, or Volumes

From
Dimitri Fontaine
Date:
Le lundi 19 février 2007 16:33, Tom Lane a écrit :
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > Somehow this seems like implementing RAID within postgres,
>
> RAID and LVM too.  I can't get excited about re-inventing those wheels
> when perfectly good implementations already exist for us to sit on top of.

I though moving some knowledge about data availability into PostgreSQL code
could provide some valuable performance benefit, allowing to organize reads
(for example parallel tables scan/indexes scan to different volumes) and
obtaining data from 'quicker' known volume (or least used/charged).

You're both saying RAID/LVM implementations provide good enough performances
for PG not having to go this way, if I understand correctly.
And distributed file systems are enough to have the replication stuff, without
PG having to deal explicitly with the work involved.

May be I should have slept after all ;)

Thanks for your time and comments, regards,
--
Dimitri Fontaine


Re: Multiple Storage per Tablespace, or Volumes

From
Tom Lane
Date:
Dimitri Fontaine <dim@dalibo.com> writes:
> You're both saying RAID/LVM implementations provide good enough performances 
> for PG not having to go this way, if I understand correctly.

There's certainly no evidence to suggest that reimplementing them
ourselves would be a productive use of our time.
        regards, tom lane


Re: Multiple Storage per Tablespace, or Volumes

From
Martijn van Oosterhout
Date:
On Mon, Feb 19, 2007 at 05:10:36PM +0100, Dimitri Fontaine wrote:
> > RAID and LVM too.  I can't get excited about re-inventing those wheels
> > when perfectly good implementations already exist for us to sit on top of.
>
> I though moving some knowledge about data availability into PostgreSQL code
> could provide some valuable performance benefit, allowing to organize reads
> (for example parallel tables scan/indexes scan to different volumes) and
> obtaining data from 'quicker' known volume (or least used/charged).

Well, organising requests to be handled quickly is a function of
LVM/RAID, so we don't go there. However, speeding up scans by having
multiple requests is an interesting approach, as would perhaps a
different random_page_cost for different tablespaces.

My point is, don't try to implement the mechanics of LVM/RAID into
postgres, instead, work on providing ways for users to take advantage
of these mechanisms if they have them. Look at it as if you have got
LVM/RAID setup for your ideas, how do you get postgres to take
advantage of them?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Multiple Storage per Tablespace, or Volumes

From
Peter Eisentraut
Date:
Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > Somehow this seems like implementing RAID within postgres,
>
> RAID and LVM too.  I can't get excited about re-inventing those
> wheels when perfectly good implementations already exist for us to
> sit on top of.

I expect that someone will point out that Windows doesn't support RAID 
or LVM, and we'll have to reimplement it anyway.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: Multiple Storage per Tablespace, or Volumes

From
Magnus Hagander
Date:
Peter Eisentraut wrote:
> Tom Lane wrote:
>> Martijn van Oosterhout <kleptog@svana.org> writes:
>>> Somehow this seems like implementing RAID within postgres,
>> RAID and LVM too.  I can't get excited about re-inventing those
>> wheels when perfectly good implementations already exist for us to
>> sit on top of.
> 
> I expect that someone will point out that Windows doesn't support RAID 
> or LVM, and we'll have to reimplement it anyway. 

Windows supports both RAID and LVM.

//Magnus



Re: Multiple Storage per Tablespace, or Volumes

From
Stefan Kaltenbrunner
Date:
Peter Eisentraut wrote:
> Tom Lane wrote:
>> Martijn van Oosterhout <kleptog@svana.org> writes:
>>> Somehow this seems like implementing RAID within postgres,
>> RAID and LVM too.  I can't get excited about re-inventing those
>> wheels when perfectly good implementations already exist for us to
>> sit on top of.
> 
> I expect that someone will point out that Windows doesn't support RAID 
> or LVM, and we'll have to reimplement it anyway.

windows supports software raid just fine since Windows 2000 or so ...


Stefan


Re: Multiple Storage per Tablespace, or Volumes

From
Peter Eisentraut
Date:
Magnus Hagander wrote:
> Windows supports both RAID and LVM.

Oh good, so we've got that on record. :)

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: Multiple Storage per Tablespace, or Volumes

From
"Joshua D. Drake"
Date:
Stefan Kaltenbrunner wrote:
> Peter Eisentraut wrote:
>> Tom Lane wrote:
>>> Martijn van Oosterhout <kleptog@svana.org> writes:
>>>> Somehow this seems like implementing RAID within postgres,
>>> RAID and LVM too.  I can't get excited about re-inventing those
>>> wheels when perfectly good implementations already exist for us to
>>> sit on top of.
>> I expect that someone will point out that Windows doesn't support RAID 
>> or LVM, and we'll have to reimplement it anyway.
> 
> windows supports software raid just fine since Windows 2000 or so ...

Longer than that... it supported mirroring and raid 5 in NT4 and
possibly even NT3.51 IIRC.


Joshua D. Drake


> 
> 
> Stefan
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org
> 


-- 
     === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997            http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/



Re: Multiple Storage per Tablespace, or Volumes

From
Andrew Sullivan
Date:
On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > Somehow this seems like implementing RAID within postgres,
> 
> RAID and LVM too.  I can't get excited about re-inventing those wheels
> when perfectly good implementations already exist for us to sit on top of.

Ok, warning, this is a "you know what would be sweet" moment.

What would be nice is to be able to detach one of the volumes, and
know the span of the data in there without being able to access the
data.

The problem that a lot of warehouse operators have is something like
this: "We know we have all this data, but we don't know what we will
want to do with it later.  So keep it all.  I'll get back to you when
I want to know something."

It'd be nice to be able to load up all that data once, and then shunt
it off into (say) read-only media.  If one could then run a query
that would tell one which spans of data are candidates for the
search, you could bring back online (onto reasonably fast storage,
for instance) just the volumes you need to read.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
Users never remark, "Wow, this software may be buggy and hard 
to use, but at least there is a lot of code underneath."    --Damien Katz


Re: Multiple Storage per Tablespace, or Volumes

From
"Joshua D. Drake"
Date:
Andrew Sullivan wrote:
> On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
>> Martijn van Oosterhout <kleptog@svana.org> writes:
>>> Somehow this seems like implementing RAID within postgres,
>> RAID and LVM too.  I can't get excited about re-inventing those wheels
>> when perfectly good implementations already exist for us to sit on top of.
> 
> Ok, warning, this is a "you know what would be sweet" moment.

The dreaded words from a developers mouth to every manager in the world.

Joshua D. Drake



-- 
     === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997            http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/



Re: Multiple Storage per Tablespace, or Volumes

From
Greg Smith
Date:
On Mon, 19 Feb 2007, Joshua D. Drake wrote:

> Longer than that... it supported mirroring and raid 5 in NT4 and
> possibly even NT3.51 IIRC.

Mirroring and RAID 5 go back to Windows NT 3.1 Advanced Server in 1993:

http://support.microsoft.com/kb/114779
http://www.byte.com/art/9404/sec8/art7.htm

The main source of confusion about current support for this feature is 
that the desktop/workstation version of Windows don't have it.  For 
Windows XP, you need the XP Professional version to get "dynamic disk" 
support; it's not in the home edition.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Multiple Storage per Tablespace, or Volumes

From
Bruce Momjian
Date:
Joshua D. Drake wrote:
> Andrew Sullivan wrote:
> > On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
> >> Martijn van Oosterhout <kleptog@svana.org> writes:
> >>> Somehow this seems like implementing RAID within postgres,
> >> RAID and LVM too.  I can't get excited about re-inventing those wheels
> >> when perfectly good implementations already exist for us to sit on top of.
> > 
> > Ok, warning, this is a "you know what would be sweet" moment.
> 
> The dreaded words from a developers mouth to every manager in the world.

Yea, I just instinctively hit "delete" when I saw that phrase.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Multiple Storage per Tablespace, or Volumes

From
Alvaro Herrera
Date:
Andrew Sullivan wrote:
> On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
> > Martijn van Oosterhout <kleptog@svana.org> writes:
> > > Somehow this seems like implementing RAID within postgres,
> > 
> > RAID and LVM too.  I can't get excited about re-inventing those wheels
> > when perfectly good implementations already exist for us to sit on top of.
> 
> Ok, warning, this is a "you know what would be sweet" moment.
> 
> What would be nice is to be able to detach one of the volumes, and
> know the span of the data in there without being able to access the
> data.
> 
> The problem that a lot of warehouse operators have is something like
> this: "We know we have all this data, but we don't know what we will
> want to do with it later.  So keep it all.  I'll get back to you when
> I want to know something."

You should be able to do that with tablespaces and VACUUM FREEZE, the
point of the latter being that you can take the disk containing the
"read only" data offline, and still have the data readable after
plugging it back in, no matter how far along the transaction ID counter
is.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Multiple Storage per Tablespace, or Volumes

From
"Simon Riggs"
Date:
On Mon, 2007-02-19 at 17:35 -0300, Alvaro Herrera wrote:
> Andrew Sullivan wrote:
> > On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
> > > Martijn van Oosterhout <kleptog@svana.org> writes:
> > > > Somehow this seems like implementing RAID within postgres,
> > > 
> > > RAID and LVM too.  I can't get excited about re-inventing those wheels
> > > when perfectly good implementations already exist for us to sit on top of.
> > 
> > Ok, warning, this is a "you know what would be sweet" moment.
> > 
> > What would be nice is to be able to detach one of the volumes, and
> > know the span of the data in there without being able to access the
> > data.
> > 
> > The problem that a lot of warehouse operators have is something like
> > this: "We know we have all this data, but we don't know what we will
> > want to do with it later.  So keep it all.  I'll get back to you when
> > I want to know something."
> 
> You should be able to do that with tablespaces and VACUUM FREEZE, the
> point of the latter being that you can take the disk containing the
> "read only" data offline, and still have the data readable after
> plugging it back in, no matter how far along the transaction ID counter
> is.

Doesn't work anymore because VACUUM FREEZE doesn't (and can't) take a
full table lock, so somebody can always update data after a data block
has been frozen. That can lead to putting a table onto read-only media
when it still needs vacuuming, which is a great way to break the DB. It
also doesn't freeze still visible data, so there's no easy way of doing
this. Waiting until the VF is the oldest Xid is prone to deadlock as
well.

Ideally, we'd have a copy to read-only media whilst freezing, as an
atomic operation, with some guarantees that it will actually have frozen
*everything*, or fail: 
ALTER TABLE SET TABLESPACE foo READONLY;

Can we agree that as a TODO item?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Multiple Storage per Tablespace, or Volumes

From
David Fetter
Date:
On Mon, Feb 19, 2007 at 02:50:34PM -0500, Andrew Sullivan wrote:
> On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
> > Martijn van Oosterhout <kleptog@svana.org> writes:
> > > Somehow this seems like implementing RAID within postgres,
> > 
> > RAID and LVM too.  I can't get excited about re-inventing those
> > wheels when perfectly good implementations already exist for us to
> > sit on top of.
> 
> Ok, warning, this is a "you know what would be sweet" moment.
> 
> What would be nice is to be able to detach one of the volumes, and
> know the span of the data in there without being able to access the
> data.
> 
> The problem that a lot of warehouse operators have is something like
> this: "We know we have all this data, but we don't know what we will
> want to do with it later.  So keep it all.  I'll get back to you
> when I want to know something."
> 
> It'd be nice to be able to load up all that data once, and then
> shunt it off into (say) read-only media.  If one could then run a
> query that would tell one which spans of data are candidates for the
> search, you could bring back online (onto reasonably fast storage,
> for instance) just the volumes you need to read.

Isn't this one of the big use cases for table partitioning?

Cheers,
D
-- 
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666                             Skype: davidfetter

Remember to vote!


Log levels for checkpoint/bgwriter monitoring

From
Greg Smith
Date:
I have a WIP patch that adds the main detail I have found I need to 
properly tune checkpoint and background writer activity.  I think it's 
almost ready to submit (you can see the current patch against 8.2 at 
http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) 
after making it a bit more human-readable.  But I've realized that along 
with that, I need some guidance in regards to what log level is 
appropriate for this information.

An example works better than explaining what the patch does:

2007-02-19 21:53:24.602 EST - DEBUG:  checkpoint required (wrote 
checkpoint_segments)
2007-02-19 21:53:24.685 EST - DEBUG:  checkpoint starting
2007-02-19 21:53:24.705 EST - DEBUG:  checkpoint flushing buffer pool
2007-02-19 21:53:24.985 EST - DEBUG:  checkpoint database fsync starting
2007-02-19 21:53:42.725 EST - DEBUG:  checkpoint database fsync complete
2007-02-19 21:53:42.726 EST - DEBUG:  checkpoint buffer flush dirty=8034 
write=279956 us sync=17739974 us

Remember that "Load distributed checkpoint" discussion back in December? 
You can see exactly how bad the problem is on your system with this log 
style (this is from a pgbench run where it's postively awful--it really 
does take over 17 seconds for the fsync to execute, and there are clients 
that are hung the whole time waiting for it).

I also instrumented the background writer.  You get messages like this:

2007-02-19 21:58:54.328 EST - DEBUG:  BGWriter Scan All - Written = 5/5 
Unscanned = 23/54

This shows that we wrote (5) the maximum pages we were allowed to write 
(5) while failing to scan almost half (23) of the buffers we meant to look 
at (54).  By taking a look at this output while the system is under load, 
I found I was able to do bgwriter optimization that used to take me days 
of frustrating testing in hours.  I've been waiting for a good guide to 
bgwriter tuning since 8.1 came out.  Once you have this, combined with 
knowing how many buffers were dirty at checkpoint time because the 
bgwriter didn't get to them in time (the number you want to minimize), the 
tuning guide practically writes itself.

So my question is...what log level should all this go at?  Right now, I 
have the background writer stuff adjusting its level dynamically based on 
what happened; it logs at DEBUG2 if it hits the write limit (which should 
be a rare event once you're tuned properly), DEBUG3 for writes that 
scanned everything they were supposed to, and DEBUG4 if it scanned but 
didn't find anything to write.  The source of checkpoint information logs 
at DEBUG1, the fsync/write info at DEBUG2.

I'd like to move some of these up.  On my system, I even have many of the 
checkpoint messages logged at INFO (the source of the checkpoint and the 
total write time line).  It's a bit chatty, but when you get some weird 
system pause issue it makes it easy to figure out if checkpoints were to 
blame.  Given how useful I feel some of these messages are to system 
tuning, and to explaining what currently appears as inexplicable pauses, I 
don't want them to be buried at DEBUG levels where people are unlikely to 
ever see them (I think some people may be concerned about turning on 
things labeled DEBUG at all).  I am aware that I am too deep into this to 
have an unbiased opinion at this point though, which is why I ask for 
feedback on how to proceed here.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Multiple Storage per Tablespace, or Volumes

From
Robert Treat
Date:
On Monday 19 February 2007 15:08, Bruce Momjian wrote:
> Joshua D. Drake wrote:
> > Andrew Sullivan wrote:
> > > On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
> > >> Martijn van Oosterhout <kleptog@svana.org> writes:
> > >>> Somehow this seems like implementing RAID within postgres,
> > >>
> > >> RAID and LVM too.  I can't get excited about re-inventing those wheels
> > >> when perfectly good implementations already exist for us to sit on top
> > >> of.
> > >
> > > Ok, warning, this is a "you know what would be sweet" moment.
> >
> > The dreaded words from a developers mouth to every manager in the world.
>
> Yea, I just instinctively hit "delete" when I saw that phrase.

Too bad... I know oracle can do what he wants... possibly other db systems as 
well. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: Multiple Storage per Tablespace, or Volumes

From
Robert Treat
Date:
On Monday 19 February 2007 11:27, Martijn van Oosterhout wrote:
> On Mon, Feb 19, 2007 at 05:10:36PM +0100, Dimitri Fontaine wrote:
> > > RAID and LVM too.  I can't get excited about re-inventing those wheels
> > > when perfectly good implementations already exist for us to sit on top
> > > of.
> >
> > I though moving some knowledge about data availability into PostgreSQL
> > code could provide some valuable performance benefit, allowing to
> > organize reads (for example parallel tables scan/indexes scan to
> > different volumes) and obtaining data from 'quicker' known volume (or
> > least used/charged).
>
> Well, organising requests to be handled quickly is a function of
> LVM/RAID, so we don't go there. However, speeding up scans by having
> multiple requests is an interesting approach, as would perhaps a
> different random_page_cost for different tablespaces.
>

On one of my systems I have 1 tablespace for read data (99-1), 1 for read 
mostly data (90-10), and 1 for write mostly (40-60).  The breakdown is based 
on a combination of the underlying hardware and usage patterns of the tables 
involved. I suspect that isn't that uncommon really.  I've often thought that 
being able to set guc variables to a specific tablespace (like you can do for 
users) would allow for a lot of flexibility in tuning queries that go across 
different tablespaces. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: Multiple Storage per Tablespace, or Volumes

From
Andrew Sullivan
Date:
On Mon, Feb 19, 2007 at 07:10:52PM -0800, David Fetter wrote:
> 
> Isn't this one of the big use cases for table partitioning?

Sure, but you can't detach that data in the meantime, AFAIK.  Maybe
I've missed something.

If I have 10 years of finace data, and I have to keep it all online
all the time, my electricity costs alone are outrageous.  If I can
know _where_ the data is without being able to get it until I've
actually made it available, then my query tools could be smart enough
to detect "table partitioned; this data is offline", I could roll
back to my savepoint, return a partial result set to the user, and
tell it "call back in 24 hours for your full report".  

Yes, I know, hands waving in the air.  But I already said I was
having a "you know what would be sweet" moment.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
This work was visionary and imaginative, and goes to show that visionary
and imaginative work need not end up well.     --Dennis Ritchie


Re: Log levels for checkpoint/bgwriter monitoring

From
Robert Treat
Date:
On Monday 19 February 2007 22:59, Greg Smith wrote:
> I have a WIP patch that adds the main detail I have found I need to
> properly tune checkpoint and background writer activity.  I think it's
> almost ready to submit (you can see the current patch against 8.2 at
> http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt )
> after making it a bit more human-readable.  But I've realized that along
> with that, I need some guidance in regards to what log level is
> appropriate for this information.
>
> An example works better than explaining what the patch does:
>
> 2007-02-19 21:53:24.602 EST - DEBUG:  checkpoint required (wrote
> checkpoint_segments)
> 2007-02-19 21:53:24.685 EST - DEBUG:  checkpoint starting
> 2007-02-19 21:53:24.705 EST - DEBUG:  checkpoint flushing buffer pool
> 2007-02-19 21:53:24.985 EST - DEBUG:  checkpoint database fsync starting
> 2007-02-19 21:53:42.725 EST - DEBUG:  checkpoint database fsync complete
> 2007-02-19 21:53:42.726 EST - DEBUG:  checkpoint buffer flush dirty=8034
> write=279956 us sync=17739974 us
>
> Remember that "Load distributed checkpoint" discussion back in December?
> You can see exactly how bad the problem is on your system with this log
> style (this is from a pgbench run where it's postively awful--it really
> does take over 17 seconds for the fsync to execute, and there are clients
> that are hung the whole time waiting for it).
>
> I also instrumented the background writer.  You get messages like this:
>
> 2007-02-19 21:58:54.328 EST - DEBUG:  BGWriter Scan All - Written = 5/5
> Unscanned = 23/54
>
> This shows that we wrote (5) the maximum pages we were allowed to write
> (5) while failing to scan almost half (23) of the buffers we meant to look
> at (54).  By taking a look at this output while the system is under load,
> I found I was able to do bgwriter optimization that used to take me days
> of frustrating testing in hours.  I've been waiting for a good guide to
> bgwriter tuning since 8.1 came out.  Once you have this, combined with
> knowing how many buffers were dirty at checkpoint time because the
> bgwriter didn't get to them in time (the number you want to minimize), the
> tuning guide practically writes itself.
>
> So my question is...what log level should all this go at?  Right now, I
> have the background writer stuff adjusting its level dynamically based on
> what happened; it logs at DEBUG2 if it hits the write limit (which should
> be a rare event once you're tuned properly), DEBUG3 for writes that
> scanned everything they were supposed to, and DEBUG4 if it scanned but
> didn't find anything to write.  The source of checkpoint information logs
> at DEBUG1, the fsync/write info at DEBUG2.
>
> I'd like to move some of these up.  On my system, I even have many of the
> checkpoint messages logged at INFO (the source of the checkpoint and the
> total write time line).  It's a bit chatty, but when you get some weird
> system pause issue it makes it easy to figure out if checkpoints were to
> blame.  Given how useful I feel some of these messages are to system
> tuning, and to explaining what currently appears as inexplicable pauses, I
> don't want them to be buried at DEBUG levels where people are unlikely to
> ever see them (I think some people may be concerned about turning on
> things labeled DEBUG at all).  I am aware that I am too deep into this to
> have an unbiased opinion at this point though, which is why I ask for
> feedback on how to proceed here.
>

My impression of this is that DBA's would typically want to run this for a 
short period of time to get thier systems tuned and then it pretty much 
becomes chatter.  Can you come up with an idea of what information DBA's need 
to know?  Would it be better to hide all of this logging behind a guc that 
can be turned on or off (log_bgwriter_activity)? Maybe you could just 
reinsterment it as dtrace probes so a seperate stand-alone process could pull 
the information as needed? :-)

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: Log levels for checkpoint/bgwriter monitoring

From
"Jim C. Nasby"
Date:
On Mon, Feb 19, 2007 at 10:59:38PM -0500, Greg Smith wrote:
> I have a WIP patch that adds the main detail I have found I need to 
> properly tune checkpoint and background writer activity.  I think it's 
> almost ready to submit (you can see the current patch against 8.2 at 
> http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) 
> after making it a bit more human-readable.  But I've realized that along 
> with that, I need some guidance in regards to what log level is 
> appropriate for this information.
It would also be extremely useful to make checkpoint stats visible
somewhere in the database (presumably via the existing stats mechanism).
The log output is great for doing initial tuning, but you'd want to also
be able to monitor things, which would be impractical via logging. I'm
thinking just tracking how many pages had to be flushed during a
checkpoint would be a good start.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: Log levels for checkpoint/bgwriter monitoring

From
Greg Smith
Date:
On Thu, 22 Feb 2007, Jim C. Nasby wrote:

> It would also be extremely useful to make checkpoint stats visible 
> somewhere in the database (presumably via the existing stats 
> mechanism)... I'm thinking just tracking how many pages had to be 
> flushed during a checkpoint would be a good start.

I'm in the middle of testing an updated version of the patch, once I nail 
down exactly what needs to be logged I'd planned to open a discussion on 
which of those things would be best served by pg_stats instead of a log.

I decided specifically to aim for the logs instead for the checkpoint data 
because if you're in a situation where are inserting fast enough that the 
checkpoints are spaced closely together, you'd end up having to poll 
pg_stats all the time for make sure you catch them all, which becomes even 
less efficient than just logging the data.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Log levels for checkpoint/bgwriter monitoring

From
Greg Smith
Date:
On Wed, 21 Feb 2007, Robert Treat wrote:

> My impression of this is that DBA's would typically want to run this for a
> short period of time to get thier systems tuned and then it pretty much
> becomes chatter.  Can you come up with an idea of what information DBA's need
> to know?

I am structing the log levels so that you can see all the events that are 
likely sources for extreme latency by using log level DEBUG1.  That level 
isn't too chatty and I've decided I can leave the server there forever if 
need by.  I've actually come with a much better interface to the 
background writer logging than the one I originally outlined, that I'll 
have ready for feedback in a few more days.  Now it dumps a nice summary 
report every time it completes a scan, rather than logging lower-level 
info.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Log levels for checkpoint/bgwriter monitoring

From
Jim Nasby
Date:
On Mar 5, 2007, at 8:34 PM, Greg Smith wrote:
> On Thu, 22 Feb 2007, Jim C. Nasby wrote:
>
>> It would also be extremely useful to make checkpoint stats visible  
>> somewhere in the database (presumably via the existing stats  
>> mechanism)... I'm thinking just tracking how many pages had to be  
>> flushed during a checkpoint would be a good start.
>
> I'm in the middle of testing an updated version of the patch, once  
> I nail down exactly what needs to be logged I'd planned to open a  
> discussion on which of those things would be best served by  
> pg_stats instead of a log.
>
> I decided specifically to aim for the logs instead for the  
> checkpoint data because if you're in a situation where are  
> inserting fast enough that the checkpoints are spaced closely  
> together, you'd end up having to poll pg_stats all the time for  
> make sure you catch them all, which becomes even less efficient  
> than just logging the data.

It's always good to be able to log stuff for detailed  
troubleshooting, so that's a good place to start. The flipside is  
that it's much easier to machine-parse a table rather than trying to  
scrape the logs. And I don't think we'll generally care about each  
individual checkpoint; rather we'll want to look at things like  
'checkpoints/hour' and 'checkpoint written pages/hour'.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: Log levels for checkpoint/bgwriter monitoring

From
Greg Smith
Date:
On Tue, 6 Mar 2007, Jim Nasby wrote:

> The flipside is that it's much easier to machine-parse a table rather 
> than trying to scrape the logs.

Now you might realize why I've been so vocal on the SQL log export 
implementation details.

> And I don't think we'll generally care about each individual checkpoint; 
> rather we'll want to look at things like 'checkpoints/hour' and 
> 'checkpoint written pages/hour'.

After a few months of staring at this data, I've found averages like that 
misleading.  The real problem areas correlate with the peak pages written 
at any one checkpoint.  Lowering that value is really the end-game for 
optimizing the background writer, and the peaks are what will nail you 
with a nasty fsync pause at checkpoint time.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Log levels for checkpoint/bgwriter monitoring

From
ITAGAKI Takahiro
Date:
Greg Smith <gsmith@gregsmith.com> wrote:

> After a few months of staring at this data, I've found averages like that 
> misleading.  The real problem areas correlate with the peak pages written 
> at any one checkpoint.  Lowering that value is really the end-game for 
> optimizing the background writer, and the peaks are what will nail you 
> with a nasty fsync pause at checkpoint time.

If you've already had some technical knowledge to lead best settings from
activity logs, could you write it down in the codes? I hope some kinds of
automatic control features in bgwriter if its best configurations vary by
usages or activities.


BTW, I'm planning two changes in bgwriter.
 [Load distributed checkpoint]   http://archives.postgresql.org/pgsql-patches/2007-02/msg00522.php [Automatic
adjustmentof bgwriter_lru_maxpages]   http://archives.postgresql.org/pgsql-patches/2007-03/msg00092.php
 

I have some results that if we have plenty of time for checkpoints,
bgwriter_all_maxpages is not a so important parameter because it is
adjusted to "shared_buffers / duration of checkpoint".
Also, my recommended bgwriter_lru_maxpages is "average number of
recycled buffers per cycle", that is hardly able to tune manually.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: Log levels for checkpoint/bgwriter monitoring

From
Jim Nasby
Date:
On Mar 6, 2007, at 10:11 PM, ITAGAKI Takahiro wrote:
> I have some results that if we have plenty of time for checkpoints,
> bgwriter_all_maxpages is not a so important parameter because it is
> adjusted to "shared_buffers / duration of checkpoint".
> Also, my recommended bgwriter_lru_maxpages is "average number of
> recycled buffers per cycle", that is hardly able to tune manually.

What do you mean by 'number of recycled buffers per cycle"?
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: Log levels for checkpoint/bgwriter monitoring

From
Greg Smith
Date:
On Wed, 7 Mar 2007, ITAGAKI Takahiro wrote:

> Also, my recommended bgwriter_lru_maxpages is "average number of
> recycled buffers per cycle", that is hardly able to tune manually.

This is completely dependent on what percentage of your buffer cache is 
pinned.  If your load is something like the standard pgbench, the LRU 
writer will rarely find anything useful to write, so this entire line of 
thinking won't work.  The proper behavior for heavily pinned data is to 
turn off the LRU writer altogether so there's more time to run the all 
scan.

The job I'm trying to take on here is not to presume I can solve these 
problems myself yet.  I've instead recognized that people need usefully 
organized information in order to even move in that direction, and that 
informatoin is not even close to being available right now.

What my latest work in progress patches do is summarize each scan of the 
buffer pool with information about how much was written by each of the two 
writers, along with noting what percentage of the pool was pinned data. 
I'm trying to get that one ready to submit this week.  Those three values 
suggest some powerful techniques for tuning, but it's not quite good 
enough to allow auto-tuning.  It also needs a feel for how much time is 
left before the next checkpoint.

What really needs to go along with all this is a sort of progress bar that 
esimates how long we are from a checkpoint based on both a) the timeout, 
and b) how many segments have been written.  The timeout one is easy to 
work with that way (from what I read of your code, you've worked that 
angle).  The part I had trouble doing was getting the WAL writers to 
communicate a progress report on how many segments they filled back to the 
bgwriter.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Log levels for checkpoint/bgwriter monitoring

From
ITAGAKI Takahiro
Date:
Jim Nasby <decibel@decibel.org> wrote:

> > Also, my recommended bgwriter_lru_maxpages is "average number of
> > recycled buffers per cycle", that is hardly able to tune manually.
> 
> What do you mean by 'number of recycled buffers per cycle"?

There is the following description in the documentation:

| * bgwriter_lru_percent (floating point)
| To reduce the probability that server processes will need to issue their
| own writes, the background writer tries to write buffers that are likely
| to be recycled soon.
| * bgwriter_lru_maxpages (integer)
| In each round, no more than this many buffers will be written as a
| result of scanning soon-to-be-recycled buffers.

The number of recycled buffers per round is the same as the number of
StrategyGetBuffer() calls per round. I think the number is suitable for
bgwriter_lru_maxpages if we want bgwriter to write almost of dirty pages.

I proposed to change the semantics of bgwriter_lru_maxpages. It represented
"maximum writes per round", but the new meaning is "reserved buffers for
recycle". Non-dirty unused buffers will be counted among it.


I'm measuring the difference of performance between manual and automatic
tuning, especially adding some jobs before writes. I'll inform you about
them when I get the results.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: Log levels for checkpoint/bgwriter monitoring

From
ITAGAKI Takahiro
Date:
Greg Smith <gsmith@gregsmith.com> wrote:

> > Also, my recommended bgwriter_lru_maxpages is "average number of
> > recycled buffers per cycle", that is hardly able to tune manually.
> 
> This is completely dependent on what percentage of your buffer cache is 
> pinned.

Don't you mean usage_count? In my understanding, each backend pins two
or so buffers at once. So percentage of pinned buffers should be low.

> If your load is something like the standard pgbench, the LRU 
> writer will rarely find anything useful to write, so this entire line of 
> thinking won't work.  The proper behavior for heavily pinned data is to 
> turn off the LRU writer altogether so there's more time to run the all 
> scan.

I think you are pointing out the problem of buffer management algorithm
itself, not only bgwriter. Even if you turn off the LRU writer, each
backend pays the same cost to find recyclable buffers every time they
allocate a buffer.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: Log levels for checkpoint/bgwriter monitoring

From
Greg Smith
Date:
On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote:

> In my understanding, each backend pins two or so buffers at once. So 
> percentage of pinned buffers should be low.

With the pgbench workload, a substantial percentage of the buffer cache 
ends up pinned.  From staring at the buffer cache using 
contrib/pg_buffercache, I believe most of that consists of the index 
blocks for the records being updated in the accounts table.

I just posted a new version of the patch I asked for feedback on at the 
beginning of this thread, the latest one is at 
http://westnet.com/~gsmith/content/postgresql/new-patch-checkpoint.txt 
I've been adjusting it to monitor the same data I think you need to refine 
your patch.  I believe the approach you're taking includes some 
assumptions that seem perfectly reasonable, but that my testing doesn't 
agree with.  There's nothing like measuring something to settle what's 
really going on, though, so that's what I've been focusing on.  I'd love 
to get some feedback on whether other people can replicate the kind of 
things I'm seeing.

The new code generates statistics about exactly what the background writer 
scan found during each round.  If there's been substantial write activity, 
it prints a log line when it recycles back to the beginning of the all 
scan, to help characterize what the buffer pool looked like during that 
scan from the perspective of the bgwriter.  Here's some sample log output 
from my underpowered laptop while running pgbench:

bgwriter scan all writes=16.6 MB (69.3%) pinned=11.7 MB (48.8%) LRU=7.7 MB (31.9%)
...
checkpoint required (wrote checkpoint_segments)
checkpoint buffers dirty=19.4 MB (80.8%) write=188.9 ms sync=4918.1 ms

Here 69% of the buffer cache contained dirty data, and 49% of the cache 
was both pinned and dirty.  During that same time period, the LRU write 
also wrote out a fair amount of data, operating on the 20% of the cache 
that was dirty but not pinned.  On my production server, where the 
background writer is turned way up to reduce checkpoint times, these 
numbers are even more extreme; almost everything that's dirty is also 
pinned during pgbench, and the LRU is lucky to find anything it can write 
as a result.

That patch is against the 8.2 codebase; now that I'm almost done I'm 
planning to move it to HEAD instead soon (where it will conflict 
considerably with your patch).  If you have an 8.2 configuration you can 
test with my patch applied, set log_min_messages = debug2, try it out, and 
see what you get when running pgbench for a while.  I think you'll 
discover some interesting and unexpected things.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Log levels for checkpoint/bgwriter monitoring

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> With the pgbench workload, a substantial percentage of the buffer cache 
> ends up pinned.

[ raised eyebrow... ]  Prove that.  AFAIK it's impossible for the
pgbench queries to pin more than about three or four buffers per backend
concurrently.
        regards, tom lane


Re: Log levels for checkpoint/bgwriter monitoring

From
ITAGAKI Takahiro
Date:
Greg Smith <gsmith@gregsmith.com> wrote:

> > In my understanding, each backend pins two or so buffers at once. So 
> > percentage of pinned buffers should be low.
> 
> With the pgbench workload, a substantial percentage of the buffer cache 
> ends up pinned.

> http://westnet.com/~gsmith/content/postgresql/new-patch-checkpoint.txt 
> bgwriter scan all writes=16.6 MB (69.3%) pinned=11.7 MB (48.8%) LRU=7.7 MB (31.9%)
> ...
> checkpoint required (wrote checkpoint_segments)
> checkpoint buffers dirty=19.4 MB (80.8%) write=188.9 ms sync=4918.1 ms
> 
> Here 69% of the buffer cache contained dirty data, and 49% of the cache 
> was both pinned and dirty.

No. "Pinned" means bufHdr->refcount > 0 and you don't distinguish pinned or
recently-used (bufHdr->usage_count > 0) buffers in your patch.

!     if (bufHdr->refcount != 0 || bufHdr->usage_count != 0)     {
!         if (skip_pinned)
!         {
!             UnlockBufHdr(bufHdr);
!             return BUF_PINNED;
!         }
!         buffer_write_type=BUF_WRITTEN_AND_PINNED;


Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: Log levels for checkpoint/bgwriter monitoring

From
Greg Smith
Date:
On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote:

> "Pinned" means bufHdr->refcount > 0 and you don't distinguish pinned or 
> recently-used (bufHdr->usage_count > 0) buffers in your patch.

Thank you, I will revise the terminology used accordingly.  I was using 
"pinned" as a shortcut for "will be ignored by skip_pinned" which was 
sloppy of me.  As I said, I was trying to show how the buffer cache looks 
from the perspective of the background writer, and therefore lumping them 
together because that's how SyncOneBuffer views them.  A buffer cache full 
of either type will be largely ignored by the LRU writer, and that's what 
I've been finding when running insert/update heavy workloads like pgbench.

If I might suggest a terminology change to avoid this confusion in the 
future, I'd like to rename the SyncOneBuffer "skip_pinned" parameter to 
something like "skip_active", which is closer to the real behavior.  I 
know Oracle refers to these as "hot" and "cold" LRU entries.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Log levels for checkpoint/bgwriter monitoring

From
Jim Nasby
Date:
On Mar 8, 2007, at 11:51 PM, Greg Smith wrote:
> almost everything that's dirty is also pinned during pgbench, and  
> the LRU is lucky to find anything it can write as a result

I'm wondering if pg_bench is a good test of this stuff. ISTM it's  
unrealistically write-heavy, which is going to tend to not only put a  
lot of dirty buffers into the pool, but also keep them pinned enough  
that you can't write them.

Perhaps you should either modify pg_bench to do a lot more selects  
out of the various tables or look towards a different benchmark.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: Log levels for checkpoint/bgwriter monitoring

From
Jim Nasby
Date:
On Mar 8, 2007, at 11:51 PM, Greg Smith wrote:
> almost everything that's dirty is also pinned during pgbench, and  
> the LRU is lucky to find anything it can write as a result

I'm wondering if pg_bench is a good test of this stuff. ISTM it's  
unrealistically write-heavy, which is going to tend to not only put a  
lot of dirty buffers into the pool, but also keep them pinned enough  
that you can't write them.

Perhaps you should either modify pg_bench to do a lot more selects  
out of the various tables
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: Log levels for checkpoint/bgwriter monitoring

From
Jim Nasby
Date:
On Mar 9, 2007, at 7:57 AM, Greg Smith wrote:
> On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote:
>
>> "Pinned" means bufHdr->refcount > 0 and you don't distinguish  
>> pinned or recently-used (bufHdr->usage_count > 0) buffers in your  
>> patch.
>
> Thank you, I will revise the terminology used accordingly.  I was  
> using "pinned" as a shortcut for "will be ignored by skip_pinned"  
> which was sloppy of me.  As I said, I was trying to show how the  
> buffer cache looks from the perspective of the background writer,  
> and therefore lumping them together because that's how  
> SyncOneBuffer views them.  A buffer cache full of either type will  
> be largely ignored by the LRU writer, and that's what I've been  
> finding when running insert/update heavy workloads like pgbench.
>
> If I might suggest a terminology change to avoid this confusion in  
> the future, I'd like to rename the SyncOneBuffer "skip_pinned"  
> parameter to something like "skip_active", which is closer to the  
> real behavior.  I know Oracle refers to these as "hot" and "cold"  
> LRU entries.

Well, AIUI, whether the buffer is actually pinned or not is almost  
inconsequential (other than if a buffer *is* pinned then it's usage  
count is about to become > 0, so there's no reason to consider  
writing it).

What that parameter really does is control whether you're going to  
follow the LRU semantics or not...
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: Log levels for checkpoint/bgwriter monitoring

From
Tom Lane
Date:
Jim Nasby <decibel@decibel.org> writes:
> On Mar 8, 2007, at 11:51 PM, Greg Smith wrote:
>> almost everything that's dirty is also pinned during pgbench, and  
>> the LRU is lucky to find anything it can write as a result

> I'm wondering if pg_bench is a good test of this stuff.

On reflection I think that Greg's result is probably unsurprising, and
furthermore does not indicate that anything is wrong.

What it shows (now that we got past the terminology) is that only about
half of the buffer pool is subject to replacement during any given clock
sweep.  For low-usage pages that's about what you'd expect: a page is
sucked in on demand (using a buffer returned by the clock sweep), and
when we're done with it it'll have usage_count = 1.  If it's not touched
again then when the clock sweep returns to it it'll be decremented to
usage_count 0, and on the next visit it'll be recycled for use as
something else.  Thus for low-usage pages you'd fully expect that about
half of the buffer population has usage_count 1 and the rest has usage
count 0; which is strikingly close to Greg's measurement that 48.8%
of the population has usage_count 0.

What this seems to tell us is that pgbench's footprint of heavily used
pages (those able to achieve usage_counts above 1) is very small.  Which
is probably right unless you've used a very large scale factor.  I'd be
interested to know what scale factor and shared_buffers setting led to
the above measurement.

It strikes me that the patch would be more useful if it produced a
histogram of the observed usage_counts, rather than merely the count
for usage_count = 0.
        regards, tom lane


Re: Log levels for checkpoint/bgwriter monitoring

From
Greg Smith
Date:
On Fri, 9 Mar 2007, Jim Nasby wrote:

> I'm wondering if pg_bench is a good test of this stuff. ISTM it's 
> unrealistically write-heavy, which is going to tend to not only put a 
> lot of dirty buffers into the pool, but also keep them pinned enough 
> that you can't write them.

Whether it's "unrealistically" write-heavy kind of depends on what your 
real app is.  The standard pgbench is a bit weird because it does so many 
updates to tiny tables, which adds a level of locking contention that 
doesn't really reflect many real-world situations.  But the no-branch mode 
(update/select to accounts, insert into history) isn't too dissimilar from 
some insert-heavy logging applications I've seen.

The main reason I brought this all up was because Itagaki seemed to be 
using pgbench for some of his performance tests.  I just wanted to point 
out that the LRU background writer specifically tends to be very 
underutilized when using pgbench.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Log levels for checkpoint/bgwriter monitoring

From
Greg Smith
Date:
On Fri, 9 Mar 2007, Tom Lane wrote:

> I'd be interested to know what scale factor and shared_buffers setting 
> led to the above measurement.

That was just a trivial example with 1 client, scale=10 (~160MB database), 
and shared_buffers=24MB.  Where things really get interesting with pgbench 
is on a system with enough horsepower+clients to dirty the whole buffer 
cache well before a checkpoint.  I regularly see >75% of the cache dirty 
and blocked from LRU writes with pgbench's slightly pathological workload 
in that situation.

You're correct that these results aren't particularly surprising or 
indicative of a problem to be corrected.  But they do shed some light on 
what pgbench is and isn't appropriate for testing.

> It strikes me that the patch would be more useful if it produced a
> histogram of the observed usage_counts, rather than merely the count
> for usage_count = 0.

I'll start working in that direction.  With the feedback everyone has 
given me on how few of the buffers are truly "pinned" via the correct 
usage of the term, I'm going to revisit the usage details and revise that 
section.  I'm happy with how I'm reporting the checkpoint details now, 
still some work left to do on the bgwriter activity.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Log levels for checkpoint/bgwriter monitoring

From
Greg Smith
Date:
On Fri, 9 Mar 2007, Tom Lane wrote:

> It strikes me that the patch would be more useful if it produced a 
> histogram of the observed usage_counts

Don't have something worth releasing yet, but I did code a first rev of 
this today.  The results are quite instructive and it's well worth looking 
at.

The main server I work on has shared_buffers=60000 for pgbench testing and 
the background writers turned way up (in hopes of reducing checkpoint 
times; that's a whole nother topic).  I run pgbench with s=100 (~1.6GB 
database).  Here's what I get as statistics on the buffer pool after a 
scan when the server is "happy", from a run with 20 clients:

writes=38.3MB (8.2%) pinned+used=38.3MB (8.2%)
dirty buffer usage count histogram:
0=0.1% 1=0.3% 2=26% 3=17% 4=21% 5+=36%

Basically, everything that's left dirty is also heavily used; as I noted 
before, when I inspect with pg_buffercache these are mostly index blocks. 
I note that I should probably generate a second histogram that compares 
the clean data.  The all scan is writing pages out as soon as they dirty, 
the LRU background writer is lucky if it can find anything to do.  (Note 
that I don't rely on the LRU writer more because that causes a significant 
lull in writes after a checkpoint.  By the time it gets going again it's 
harder to catch up, and delaying PostgreSQL writes also aggrevates issues 
with the way Linux caches writes.)

What's I found really interesting was comparing a "sad" section where 
performance breaks down.  This is from a minute later:

writes=441.6MB (94.2%) pinned+used=356.2MB (76.0%)
dirty buffer usage count histogram:
0=18.7% 1=26.4% 2=31% 3=11% 4=9% 5+=4%

Note how the whole buffer distribution has shifted toward lower usage 
counts.  The breakdown seems to involve evicting the index blocks to make 
room for the recently dirty data when it can't be written out fast enough. 
As the index blocks go poof, things slow further, and into the vicious 
circle you go.  Eventually you can end up blocked on a combination of 
buffer evictions and disk seeks for uncached data that are fighting with 
the writes.

The bgwriter change this suggested to me is defining a triage behavior 
where the all scan switches to acting like the LRU one:

-Each sample period, note what % of the usage_count=0 records are dirty 
-If that number is above a tunable threshold, switch to only writing 
usage_count=0 records until it isn't anymore.

On my system a first guess for that tunable would be 2-5%, based on what 
values I see on either side of the sad periods.  No doubt some systems 
would set that much higher, haven't tested my system at home yet to have a 
guideline for a more typical PC.

As for why this behavior matters so much to me, I actually have a 
prototype auto-tuning background writer design that was hung up on this 
particular problem.  It notes how often you write out max_pages, uses that 
to model the average percentage of the buffer cache you're traversing each 
pass, then runs a couple of weighted-average feedback loops to aim for a 
target seconds/sweep.

The problem was that it went berzerk when the whole buffer cache was dirty 
(I hope someone appreciates that I've been calling this "Felix Unger on 
crack mode" in my notes).  I needed some way to prioritize which buffers 
to concentrate on when that happens, and so far the above has been a good 
first-cut way to help with that.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Log levels for checkpoint/bgwriter monitoring

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> Here's what I get as statistics on the buffer pool after a 
> scan when the server is "happy", from a run with 20 clients:

> writes=38.3MB (8.2%) pinned+used=38.3MB (8.2%)
> dirty buffer usage count histogram:
> 0=0.1% 1=0.3% 2=26% 3=17% 4=21% 5+=36%

Interesting, but not real meaningful by itself --- what's the situation
for non-dirty buffers?  And how many are there of each?

It might also be interesting to know exactly how many buffers were
pinned at the time the scan passed over them.  In theory it should be a
small fraction, but maybe it isn't ...
        regards, tom lane


Re: Log levels for checkpoint/bgwriter monitoring

From
Greg Smith
Date:
On Mon, 12 Mar 2007, Tom Lane wrote:

> It might also be interesting to know exactly how many buffers were
> pinned at the time the scan passed over them.  In theory it should be a
> small fraction, but maybe it isn't ...

It is; the theory holds for all the tests I tried today.  The actual 
pinned buffers were so few (typically a fraction of the clients) that I 
reverted to just lumping them in with the recently used ones.  To better 
reflect the vast majority of what it's interacting with, in my patch I 
renamed the SyncOneBuffer "skip_pinned" to "skip_recently_used".  It seems 
natural that something currently pinned would also be considered recently 
used, the current naming I didn't find so obvious.

I'm also now collecting clean vs. dirty usage histogram counts as well 
since you suggested it.  Nothing exciting to report there so far, may note 
something interesting after I collect more data.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Log levels for checkpoint/bgwriter monitoring

From
Bruce Momjian
Date:
Is this patch ready?

---------------------------------------------------------------------------

Greg Smith wrote:
> I have a WIP patch that adds the main detail I have found I need to 
> properly tune checkpoint and background writer activity.  I think it's 
> almost ready to submit (you can see the current patch against 8.2 at 
> http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) 
> after making it a bit more human-readable.  But I've realized that along 
> with that, I need some guidance in regards to what log level is 
> appropriate for this information.
> 
> An example works better than explaining what the patch does:
> 
> 2007-02-19 21:53:24.602 EST - DEBUG:  checkpoint required (wrote 
> checkpoint_segments)
> 2007-02-19 21:53:24.685 EST - DEBUG:  checkpoint starting
> 2007-02-19 21:53:24.705 EST - DEBUG:  checkpoint flushing buffer pool
> 2007-02-19 21:53:24.985 EST - DEBUG:  checkpoint database fsync starting
> 2007-02-19 21:53:42.725 EST - DEBUG:  checkpoint database fsync complete
> 2007-02-19 21:53:42.726 EST - DEBUG:  checkpoint buffer flush dirty=8034 
> write=279956 us sync=17739974 us
> 
> Remember that "Load distributed checkpoint" discussion back in December? 
> You can see exactly how bad the problem is on your system with this log 
> style (this is from a pgbench run where it's postively awful--it really 
> does take over 17 seconds for the fsync to execute, and there are clients 
> that are hung the whole time waiting for it).
> 
> I also instrumented the background writer.  You get messages like this:
> 
> 2007-02-19 21:58:54.328 EST - DEBUG:  BGWriter Scan All - Written = 5/5 
> Unscanned = 23/54
> 
> This shows that we wrote (5) the maximum pages we were allowed to write 
> (5) while failing to scan almost half (23) of the buffers we meant to look 
> at (54).  By taking a look at this output while the system is under load, 
> I found I was able to do bgwriter optimization that used to take me days 
> of frustrating testing in hours.  I've been waiting for a good guide to 
> bgwriter tuning since 8.1 came out.  Once you have this, combined with 
> knowing how many buffers were dirty at checkpoint time because the 
> bgwriter didn't get to them in time (the number you want to minimize), the 
> tuning guide practically writes itself.
> 
> So my question is...what log level should all this go at?  Right now, I 
> have the background writer stuff adjusting its level dynamically based on 
> what happened; it logs at DEBUG2 if it hits the write limit (which should 
> be a rare event once you're tuned properly), DEBUG3 for writes that 
> scanned everything they were supposed to, and DEBUG4 if it scanned but 
> didn't find anything to write.  The source of checkpoint information logs 
> at DEBUG1, the fsync/write info at DEBUG2.
> 
> I'd like to move some of these up.  On my system, I even have many of the 
> checkpoint messages logged at INFO (the source of the checkpoint and the 
> total write time line).  It's a bit chatty, but when you get some weird 
> system pause issue it makes it easy to figure out if checkpoints were to 
> blame.  Given how useful I feel some of these messages are to system 
> tuning, and to explaining what currently appears as inexplicable pauses, I 
> don't want them to be buried at DEBUG levels where people are unlikely to 
> ever see them (I think some people may be concerned about turning on 
> things labeled DEBUG at all).  I am aware that I am too deep into this to 
> have an unbiased opinion at this point though, which is why I ask for 
> feedback on how to proceed here.
> 
> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
> 
>                 http://www.postgresql.org/about/donate

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Log levels for checkpoint/bgwriter monitoring

From
Magnus Hagander
Date:
Would not at least some of these numbers be better presented through the
stats collector, so they can be easily monitored?

That goes along the line of my way way way away from finished attempt
earlier, perhaps a combination of these two patches?

//Magnus


Bruce Momjian wrote:
> Is this patch ready?
> 
> ---------------------------------------------------------------------------
> 
> Greg Smith wrote:
>> I have a WIP patch that adds the main detail I have found I need to 
>> properly tune checkpoint and background writer activity.  I think it's 
>> almost ready to submit (you can see the current patch against 8.2 at 
>> http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) 
>> after making it a bit more human-readable.  But I've realized that along 
>> with that, I need some guidance in regards to what log level is 
>> appropriate for this information.
>>
>> An example works better than explaining what the patch does:
>>
>> 2007-02-19 21:53:24.602 EST - DEBUG:  checkpoint required (wrote 
>> checkpoint_segments)
>> 2007-02-19 21:53:24.685 EST - DEBUG:  checkpoint starting
>> 2007-02-19 21:53:24.705 EST - DEBUG:  checkpoint flushing buffer pool
>> 2007-02-19 21:53:24.985 EST - DEBUG:  checkpoint database fsync starting
>> 2007-02-19 21:53:42.725 EST - DEBUG:  checkpoint database fsync complete
>> 2007-02-19 21:53:42.726 EST - DEBUG:  checkpoint buffer flush dirty=8034 
>> write=279956 us sync=17739974 us
>>
>> Remember that "Load distributed checkpoint" discussion back in December? 
>> You can see exactly how bad the problem is on your system with this log 
>> style (this is from a pgbench run where it's postively awful--it really 
>> does take over 17 seconds for the fsync to execute, and there are clients 
>> that are hung the whole time waiting for it).
>>
>> I also instrumented the background writer.  You get messages like this:
>>
>> 2007-02-19 21:58:54.328 EST - DEBUG:  BGWriter Scan All - Written = 5/5 
>> Unscanned = 23/54
>>
>> This shows that we wrote (5) the maximum pages we were allowed to write 
>> (5) while failing to scan almost half (23) of the buffers we meant to look 
>> at (54).  By taking a look at this output while the system is under load, 
>> I found I was able to do bgwriter optimization that used to take me days 
>> of frustrating testing in hours.  I've been waiting for a good guide to 
>> bgwriter tuning since 8.1 came out.  Once you have this, combined with 
>> knowing how many buffers were dirty at checkpoint time because the 
>> bgwriter didn't get to them in time (the number you want to minimize), the 
>> tuning guide practically writes itself.
>>
>> So my question is...what log level should all this go at?  Right now, I 
>> have the background writer stuff adjusting its level dynamically based on 
>> what happened; it logs at DEBUG2 if it hits the write limit (which should 
>> be a rare event once you're tuned properly), DEBUG3 for writes that 
>> scanned everything they were supposed to, and DEBUG4 if it scanned but 
>> didn't find anything to write.  The source of checkpoint information logs 
>> at DEBUG1, the fsync/write info at DEBUG2.
>>
>> I'd like to move some of these up.  On my system, I even have many of the 
>> checkpoint messages logged at INFO (the source of the checkpoint and the 
>> total write time line).  It's a bit chatty, but when you get some weird 
>> system pause issue it makes it easy to figure out if checkpoints were to 
>> blame.  Given how useful I feel some of these messages are to system 
>> tuning, and to explaining what currently appears as inexplicable pauses, I 
>> don't want them to be buried at DEBUG levels where people are unlikely to 
>> ever see them (I think some people may be concerned about turning on 
>> things labeled DEBUG at all).  I am aware that I am too deep into this to 
>> have an unbiased opinion at this point though, which is why I ask for 
>> feedback on how to proceed here.
>>
>> --
>> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 7: You can help support the PostgreSQL project by donating at
>>
>>                 http://www.postgresql.org/about/donate
> 



Re: Log levels for checkpoint/bgwriter monitoring

From
Bruce Momjian
Date:
Magnus Hagander wrote:
> Would not at least some of these numbers be better presented through the
> stats collector, so they can be easily monitored?
> 
> That goes along the line of my way way way away from finished attempt
> earlier, perhaps a combination of these two patches?

Yes.

---------------------------------------------------------------------------


> 
> //Magnus
> 
> 
> Bruce Momjian wrote:
> > Is this patch ready?
> > 
> > ---------------------------------------------------------------------------
> > 
> > Greg Smith wrote:
> >> I have a WIP patch that adds the main detail I have found I need to 
> >> properly tune checkpoint and background writer activity.  I think it's 
> >> almost ready to submit (you can see the current patch against 8.2 at 
> >> http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) 
> >> after making it a bit more human-readable.  But I've realized that along 
> >> with that, I need some guidance in regards to what log level is 
> >> appropriate for this information.
> >>
> >> An example works better than explaining what the patch does:
> >>
> >> 2007-02-19 21:53:24.602 EST - DEBUG:  checkpoint required (wrote 
> >> checkpoint_segments)
> >> 2007-02-19 21:53:24.685 EST - DEBUG:  checkpoint starting
> >> 2007-02-19 21:53:24.705 EST - DEBUG:  checkpoint flushing buffer pool
> >> 2007-02-19 21:53:24.985 EST - DEBUG:  checkpoint database fsync starting
> >> 2007-02-19 21:53:42.725 EST - DEBUG:  checkpoint database fsync complete
> >> 2007-02-19 21:53:42.726 EST - DEBUG:  checkpoint buffer flush dirty=8034 
> >> write=279956 us sync=17739974 us
> >>
> >> Remember that "Load distributed checkpoint" discussion back in December? 
> >> You can see exactly how bad the problem is on your system with this log 
> >> style (this is from a pgbench run where it's postively awful--it really 
> >> does take over 17 seconds for the fsync to execute, and there are clients 
> >> that are hung the whole time waiting for it).
> >>
> >> I also instrumented the background writer.  You get messages like this:
> >>
> >> 2007-02-19 21:58:54.328 EST - DEBUG:  BGWriter Scan All - Written = 5/5 
> >> Unscanned = 23/54
> >>
> >> This shows that we wrote (5) the maximum pages we were allowed to write 
> >> (5) while failing to scan almost half (23) of the buffers we meant to look 
> >> at (54).  By taking a look at this output while the system is under load, 
> >> I found I was able to do bgwriter optimization that used to take me days 
> >> of frustrating testing in hours.  I've been waiting for a good guide to 
> >> bgwriter tuning since 8.1 came out.  Once you have this, combined with 
> >> knowing how many buffers were dirty at checkpoint time because the 
> >> bgwriter didn't get to them in time (the number you want to minimize), the 
> >> tuning guide practically writes itself.
> >>
> >> So my question is...what log level should all this go at?  Right now, I 
> >> have the background writer stuff adjusting its level dynamically based on 
> >> what happened; it logs at DEBUG2 if it hits the write limit (which should 
> >> be a rare event once you're tuned properly), DEBUG3 for writes that 
> >> scanned everything they were supposed to, and DEBUG4 if it scanned but 
> >> didn't find anything to write.  The source of checkpoint information logs 
> >> at DEBUG1, the fsync/write info at DEBUG2.
> >>
> >> I'd like to move some of these up.  On my system, I even have many of the 
> >> checkpoint messages logged at INFO (the source of the checkpoint and the 
> >> total write time line).  It's a bit chatty, but when you get some weird 
> >> system pause issue it makes it easy to figure out if checkpoints were to 
> >> blame.  Given how useful I feel some of these messages are to system 
> >> tuning, and to explaining what currently appears as inexplicable pauses, I 
> >> don't want them to be buried at DEBUG levels where people are unlikely to 
> >> ever see them (I think some people may be concerned about turning on 
> >> things labeled DEBUG at all).  I am aware that I am too deep into this to 
> >> have an unbiased opinion at this point though, which is why I ask for 
> >> feedback on how to proceed here.
> >>
> >> --
> >> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
> >>
> >> ---------------------------(end of broadcast)---------------------------
> >> TIP 7: You can help support the PostgreSQL project by donating at
> >>
> >>                 http://www.postgresql.org/about/donate
> > 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Log levels for checkpoint/bgwriter monitoring

From
Greg Smith
Date:
On Tue, 27 Mar 2007, Magnus Hagander wrote:

> Would not at least some of these numbers be better presented through the
> stats collector, so they can be easily monitored?
> That goes along the line of my way way way away from finished attempt
> earlier, perhaps a combination of these two patches?

When I saw your patch recently, I thought to myself "hmmm, the data 
collected here sure looks familiar"--you even made some of the exact same 
code changes I did.  I've been bogged down recently chasing a performance 
issue that, come to find, was mainly caused by the "high CPU usage for 
stats collector" bug.  That caused the background writer to slow to a 
crawl under heavy load, which is why I was having all these checkpoint and 
writer issues that got me monitoring that code in the first place.

With that seemingly resolved, slightly new plan now.  Next I want to take 
the data I've been collecting in my patch, bundle the most important parts 
of that into messages sent to the stats writer the way it was suggested 
you rewrite your patch, then submit the result.  I got log files down and 
have a real good idea what data should be collected, but as this would be 
my first time adding stats I'd certainly love some help with that.

Once that monitoring infrastructure is in place, I then planned to merge 
Itagati's "Load distributed checkpoint" patch (it touches a lot of the 
same code) and test that out under heavy load.  I think it gives a much 
better context to evaluate that patch in if rather than measuring just its 
gross results, you can say something like "with the patch in place the 
average fsync time on my system dropped from 3 seconds to 1.2 seconds when 
writing out more than 100MB at checkpoint time".  That's the direct cause 
of the biggest problem in that area of code, so why not stare right at it 
rather than measuring it indirectly.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD