Re: Limit of bgwriter_lru_maxpages of max. 1000? - Mailing list pgsql-general

From Gerhard Wiesinger
Subject Re: Limit of bgwriter_lru_maxpages of max. 1000?
Date
Msg-id alpine.LFD.2.00.0910051928460.15050@bbs.intern
Whole thread Raw
In response to Re: Limit of bgwriter_lru_maxpages of max. 1000?  (Greg Smith <gsmith@gregsmith.com>)
Responses Re: Limit of bgwriter_lru_maxpages of max. 1000?
List pgsql-general
On Mon, 5 Oct 2009, Greg Smith wrote:

> On Sun, 4 Oct 2009, Gerhard Wiesinger wrote:
>
>>> On Fri, 2 Oct 2009, Scott Marlowe wrote:
>>>
>>>> I found that lowering checkpoint completion target was what helped.
>>>> Does that seem counter-intuitive to you?
>>>
>>
>> I set it to 0.0 now.
>
> If you set that to 0.0, the whole checkpoing spreading logic doesn't apply
> like it's supposed to.  I'm not sure what the results you posted mean now. If
> you had it set to 0 and saw a bad spike (which is how I read your message),
> I'd say "yes, that's what happens when you do reduce that parameter, so don't
> do that".  If you meant something else please clarify.

I think the problem is, that it is done on checkpoint time (whether spread
or not). I should have been already be done by bgwriter.

>
> Thanks for the dtrace example, I suggested we add those checkpoint probes in
> there and someone did, but I hadn't seen anybody use them for anything yet.
>

I think more probes (e.g. on different writing conditions like writing
from bgwriter or on a checkpoint) would be interesting here.

>> Bug1: usage_count is IHMO not consistent
>
> It's a bit hack-ish, but the changes made to support multiple buffer use
> strategies introduced by the "Make large sequential scans and VACUUMs work in
> a limited-size ring" commit are reasonable even if they're not as consistent
> as we'd like.  Those changes were supported by benchmarks proving their
> utility, which always trump theoretical "that shouldn't work better!" claims
> when profiling performance.
>
> Also, they make sense to me, but I've spent a lot of time staring at
> pg_buffercache output to get a feel for what shows up in there under various
> circumstances.  That's where I'd suggest you go if this doesn't seem right to
> you; run some real database tests and use pg_buffercache to see what's inside
> the cache when you're done.  What's in there and what I expected to be in
> there weren't always the same thing, and it's interesting to note how that
> changes as shared_buffers increases.  I consider some time studying that a
> pre-requisite to analyzing performance of this code.
>

I have analyzed pg_buffercache (query every second, see below) in parallel
to see what happens but I didn't see expected results in some ways with
the usage_counts. Therefore I analyzed the code and found IHMO the problem
with the usage_count and buffer reallocation. Since the code change is
also new (I think it way 05/2009) it might be that you tested before ...

BTW: Is it possible to get everything in pg_class over all databases as
admin?

>> Bug2: Double iteration of buffers
>> As you can seen in the calling tree below there is double iteration with
>> buffers involved. This might be a major performance bottleneck.
>
> Hmmm, this might be a real bug causing scans through the buffer cache to go
> twice as fast as intended.

That's not twice O(2*n)=O(n) that's a factor n*n (outer and inner loop
iteration) which means overall is O(n^2) which is IHMO too much.

>  Since the part you suggest is doubled isn't very
> intensive or called all that often, there's no way it can be a major issue
> though.

It is a major issue since it is O(n^2) and not O(n). E.g. with 2GB share
buffer we have 262144 blocks and 68719476736 calls which is far too much.

>  That's based on knowing what the code does and how much it was
> called, as well as some confidence that if it were really a *major* problem,
> it would have shown up on the extensive benchmarks done on all the code paths
> you're investigating.
>

The problem might be hidden for the following reasons:
1.) Buffers values are too low that even n^2 is low for today's machines
2.) Code is not often called in that way
3.) backend writes out pages so that the code is never executed
4.) ...

>> BTW: Are there some tests available how fast a buffer cache hit is and a
>> disk cache hit is (not in the buffer cache but in the disk cache)? I'll
>> asked, because a lot of locking is involved in the code.
>
> I did some once but didn't find anything particularly interesting about the
> results.  Since you seem to be on a research tear here, it would be helpful
> to have a script to test that out available, I wasn't able to release mine
> and something dtrace based would probably be better than the approach I used
> (I threw a bunch of gettimeofdata calls into the logs and post-processed them
> with a script).
>

Do you have an where one should set tracepoints inside and outside
PostgreSQL?

>> BTW2: Oracle buffercache and background writer strategy is also
>> interesting.
>
> As a rule, we don't post links to other database implementation details here,
> as those can have patented design details we'd prefer not to intentionally
> re-implement.  Much of Oracle's design here doesn't apply here anyway, as it
> was done in the era when all of their writes were synchronous.  That required
> them to worry about doing a good job on some things in their background
> writer that we shrug off and let os writes combined with fsync handle
> instead.
>

Ok, no problem.

Ciao,
Gerhard

--
http://www.wiesinger.com/


SELECT
   CASE
     WHEN datname IS NULL THEN pg_buffercache.reldatabase::text
     ELSE datname
   END AS database,
   CASE
     WHEN spcname IS NULL THEN pg_buffercache.reltablespace::text
     ELSE spcname
   END AS tablespace,
   CASE
     WHEN relname IS NULL THEN pg_buffercache.relfilenode::text
     ELSE relname
   END AS relation,
   CASE
     WHEN relkind IS NULL THEN pg_buffercache.relfilenode::text
     ELSE relkind
   END AS relkind,
   usagecount > 0 AS usagecount_gt_0,
   isdirty,
   MIN(relblocknumber) AS min_blocknumber,
   MAX(relblocknumber) AS max_blocknumber,
   ROUND(AVG(relblocknumber),2) AS avg_blocknumber,
   ROUND(STDDEV(relblocknumber),2) AS stddev_blocknumber,
   COUNT(*) AS count
FROM
   pg_buffercache
LEFT JOIN pg_class      ON pg_buffercache.relfilenode = pg_class.oid
LEFT JOIN pg_tablespace ON pg_buffercache.reltablespace =
pg_tablespace.oid
LEFT JOIN pg_database   ON pg_buffercache.reldatabase = pg_database.oid
WHERE
   isdirty = true
GROUP BY
   CASE
     WHEN datname IS NULL THEN pg_buffercache.reldatabase::text
     ELSE datname
   END,
   CASE
     WHEN spcname IS NULL THEN pg_buffercache.reltablespace::text
     ELSE spcname
   END,
   CASE
     WHEN relkind IS NULL THEN pg_buffercache.relfilenode::text
     ELSE relkind
   END,
   CASE
     WHEN relname IS NULL THEN pg_buffercache.relfilenode::text
     ELSE relname
   END,
   usagecount > 0,
   isdirty
--HAVING relkind = 'r'
ORDER BY
   database,
   tablespace,
   relkind DESC,
   relation,
   usagecount > 0,
   isdirty
;

pgsql-general by date:

Previous
From: John R Pierce
Date:
Subject: Re:
Next
From: Guy Rouillier
Date:
Subject: Re: How useful is the money datatype?